 Research
 Open Access
 Published:
A partially functiontotopic model for protein function prediction
BMC Genomics volume 19, Article number: 883 (2018)
Abstract
Background
Proteins are a kind of macromolecules and the main component of a cell, and thus it is the most essential and versatile material of life. The research of protein functions is of great significance in decoding the secret of life. In recent years, researchers have introduced multilabel supervised topic model such as Labeled Latent Dirichlet Allocation (LabeledLDA) into protein function prediction, which can obtain more accurate and explanatory prediction. However, the topiclabel corresponding way of LabeledLDA is associating each label (GO term) with a corresponding topic directly, which makes the latent topics to be completely degenerated, and ignores the differences between labels and latent topics.
Result
To achieve more accurate probabilistic modeling of function label, we propose a Partially FunctiontoTopic Prediction (PFTP) model for introducing the local topics subset corresponding to each function label. Meanwhile, PFTP not only supports latent topics subset within a given function label but also a background topic corresponding to a ‘fake’ function label, which represents common semantic of protein function. Related definitions and the topic modeling process of PFTP are described in this paper. In a 5fold cross validation experiment on yeast and human datasets, PFTP significantly outperforms five widely adopted methods for protein function prediction. Meanwhile, the impact of model parameters on prediction performance and the latent topics discovered by PFTP are also discussed in this paper.
Conclusion
All of the experimental results provide evidence that PFTP is effective and have potential value for predicting protein function. Based on its ability of discovering morerefined latent substructure of function label, we can anticipate that PFTP is a potential method to reveal a deeper biological explanation for protein functions.
Background
Proteins are the main component of a cell, which explain the basic activity of cellular life. The research of protein functions is of great significance in elucidating the phenomena of life [1]. Although there have been amount of protein sequences in biological database in recent years [2, 3], a small percentage of these proteins have experimental function annotations because of the high cost of biochemical experiment. In comparison with biochemical experiment, computational methods predict the functional annotations of proteins by using known information, such as sequence, structure, and functional behavior, which reduce time and effort, and have become important longstanding research works in postgenomic era [4].
The earlier computational approach for predicting protein function is to utilize the protein sequence or structure similarity to transfer functional information, such as BLAST. [5]With the rapid development of computational algorithms, an increasing types of algorithms have been introduced into the studies of predicting protein function. At present, computational methods of protein function prediction can be classified as two types: classificationbased approaches and graphbased approaches. In classificationbased approaches, proteins are viewed as instances to be classified, and function annotations (such as Gene Ontology (GO) [6] terms) are regarded as labels. Each protein has a feature space composed by classification feature extracted from amino acid sequence, textual repositories, and so on. Based on these annotated proteins and their attribute features, we can train the classifier on training dataset and then predict function labels for unannotated proteins. For graphbased approaches, the network structure information of proteins is used to compute the distance between proteins, and then the closely related proteins are considered to have similar functional annotations [7, 8].
In classificationbased approaches, since each protein is annotated with several functions, various multilabel classifiers can be adopted. Yu et.al [9] proposed a multiple kernels (ProMK) method to process multiple heterogeneous protein data sources for predicting protein functions; Fodeh et.al [10] used the binaryrelevance for different classifiers to automatically assign molecular functions to genes; a new ant colony optimization algorithm is proposed in reference [11], which has applied to protein function dataset; Wang et.al [12] applied a new multilabel linear discriminant analysis approach to address protein function prediction problem; Liu et.al [4] introduced a multilabel supervised topic model called LabeledLDA into protein function prediction, whose experimental results on yeast and human datasets demonstrated the effectiveness of LabeledLDA on protein function prediction. This research is the first effort to apply a multilabel supervised topic model to protein function prediction. Besides, Pinoli et.al [13,14,15] applied two standard topic models, including latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) [16, 17], to predict GO terms of proteins on the basic of available GO annotations.
In the topic modeling process of reference [4], each protein is viewed as a mixture of ‘topics’, where each ‘topic’ is also viewed as the mixture of amino acid blocks. In comparison with discriminative model, such as support vector machine (SVM), a multilabel supervised topic model can transform the wordlevel statistics of each document to its labellevel distribution, and model all labels simultaneously rather than treating each label independently. Specially, topic model can provide the function label probability distribution over proteins as an output, and each function label is explained as a probability distribution over amino acid blocks. Nonetheless, in the study of Liu et.al [4], LabeledLDA associates each label (GO term) with a corresponding topic directly, which makes the latent topics to be completely degenerated, and ignores the differences between labels and latent topics. Therefore, LabeledLDA isn’t able to discover the topic that represents common semantic of protein functions. For interpretable text mining, Ramage et.al [18] proposed a partially labeled LDA (PLDA), which associates each label with a topic subset partitioned from global topics set. PLDA overcame the shortfalls of LabeledLDA, and improved the precision of text classification in experimental research.
Inspired by the application of multilabel topic model in protein function prediction and PLDA model, we introduce a Partially FunctiontoTopic Prediction model (called PFTP). Firstly, we describe the related definitions by contrasting text data and protein function data. Then the topic modeling process of PFTP is described in detail, including the generative process and parameter estimation of PFTP. In a 5fold cross validation experiment on predicting protein function, PFTP significantly outperforms five algorithms compared. All of the experimental results provide evidence that PFTP is effective and have potential value for predicting protein function.
Methods
Related definitions and notations
To better understand related objects of topic model, the corresponding relationship between protein function prediction and multilabel classification of text is first depicted in Fig. 1.
Several topic modeling concepts of protein function data and text data are displayed in Fig. 1, one on the left and the other on the right. First of all, the text dataset is composed of several documents numbered D1 to Dn, and the protein function dataset is composed of several protein sequences numbered P1 to Pn. Obviously, words are the main component of document, such as word ‘table’ and ‘database’. But for protein sequence, we consider a protein sequence to be a text string, which is defined on a fixed 20 amino acids alphabet (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W). Then amino acid blocks are the main component of protein sequence, such as ‘MS’ and ‘TS’. Besides, a protein annotated by GO terms is equivalent to a document labeled by tags, so each GO term or tag can be viewed as a label, such as ‘GO0003673’ and ‘language’. According to above statements, there are three types of equivalence relations between protein function data and text data: protein sequence and document, amino acid block and word, GO term and document tag. In general, the GO term (document tag), protein sequence (document) and amino acid block (word) are observable data for dataset.
As the input for topic model, the bag of words (BoW) is constructed by computing the worddocument matrix, where matrix element is obtained by counting the times of word in each document. As an instance, the word ‘table’ appears two times in document D1. Likewise for protein function data, an amino acid block  protein sequence matrix is computed for the construction of protein BoW. As an example, the amino acid block ‘MS’ appears one times in protein P1. Besides, the fixed amino acid blocks set or words set is also called ‘vocabulary’.
For topic model, a ‘topic’ is viewed as a probability distribution over a fixed vocabulary. Taking the text data as an example, the probabilities of word ‘table’ over ‘topic 1’ are 0.05. For the protein data, the probabilities of amino acid block ‘MS’ over ‘topics 1’ are 0.21. Obviously, topics are latent and needed to be inferred by topic modeling. Finally, in order to establish the connection between labels and topics, the latent topics discovered by our PFTP are divided into several nonoverlapping subsets, each of which associates a label. As can be seen in Fig. 1, we split whole topic set into several groups: ‘label1’ connects with ‘topic1’ to ‘topic3’; ‘lable2’ connects with ‘topic 4’ to ‘topic 5’, and so on. It is worth noting that our PFTP define a special type of topics as background topics. The background topics are divided from whole latent topics set, and don’t associate any observable label, which express the common sematic of documents. For instance, the background topic on text dataset may be some topics with a high probability on several universal words, such as ‘text’, ‘other’ and so on. To formalize the above description, the related notations are given below.
Suppose there are D proteins in the protein set which compose the protein space \( \mathbb{D}=\left\{1,\dots, D\right\} \), and the vocabulary of amino acid blocks is in a space of \( \mathbb{W}=\left\{1,\dots, W\right\} \), then W is the size of vocabulary. The topic space including Ktopics is represented by \( \mathbb{K}=\left\{1,\dots, K\right\} \), which is shared by whole protein set. Therefore, \( \mathbb{K} \) is also called global topic space. The protein function label space is expressed as \( \mathbb{L}=\left\{1,\dots, L\right\} \).
In PFTP model, the global topic space \( \mathbb{K} \) is divided into L groups without overlap, and each group corresponds to a subspace of topic \( {\mathbb{K}}_l \). Besides, there is a ‘background subspace of topics’ \( {\mathbb{K}}_B \).
Then, each of labels is assigned a subspace of topic \( {\mathbb{K}}_l \), the background topic subspace \( {\mathbb{K}}_B \) associates a background label l_{B}.In this case, the label space is expanded to L + 1 dimensions and expressed as \( {\mathbb{L}}^{\prime } \). Similar to topic modeling of text in LabeledLDA, each of topics can be represented as a multinomial distribution of parameter \( {\boldsymbol{\uptheta}}_k={\left\{{\theta}_{kw}\right\}}_{w=1}^W \) (the equivalent of the topicword matrix in Fig. 1) on the vocabulary \( \mathbb{W} \), and θ_{k} obeys a Dirichlet prior distribution of hyper parameter \( \boldsymbol{\uplambda} ={\left\{{\lambda}_w\right\}}_{w\in \mathbb{W}} \). But what is different about our PFTP is that each of labels l is represented as a multinomial distribution of parameter \( {\boldsymbol{\uppi}}_l={\left\{{\pi}_{lk}\right\}}_{k\in {\mathbb{K}}_l} \) (the equivalent of the labeltopics probability in Fig. 1) on its topic subspace, where π_{lk} is the probabilities of topic k among topic subspace \( {\mathbb{K}}_l \) corresponding to label l. Suppose π_{l} obeys a Dirichlet prior distribution of hyper parameter α.
We utilize a binary vector Λ_{d} to map global label space \( {\mathbb{L}}^{\prime } \) to \( {\mathbb{L}}_d \):
Λ_{d, L + 1} = 1 illustrates that latent background label l_{B} is assigned to each protein d. Then, the probabilities of \( {L}_d=\left{\mathbb{L}}_d\right \) labels of protein d is represented by a weight of proteinlabel \( {\boldsymbol{\uppsi}}_d={\left\{{\psi}_{dl}\right\}}_{l\in {\mathbb{L}}_d}={\left\{{\psi}_{dl}{\Lambda}_{dl}\right\}}_{l\in {\mathbb{L}}^{\prime }} \), and ψ_{d} obeys a Dirichlet prior distribution of hyper parameter β_{d} constrained by β and Λ_{d}:
In this paper, the shared parameters of whole protein sets is called global parameter in this paper, and the parameter facing one protein is called local parameter.
The topic modeling process of PFTP
Based on above expression, the process of PFTP topic modeling is divided into three steps: BoW construction, the description of model (the generative process or graphic model) and parameter estimation (model training and predicting).These steps are depicted in Fig. 2.
As shown in Fig. 2, PFTP model takes BoW as input. As we construct BoW of protein in exactly the same way as reference [4], this step will not repeat in this paper. There are two ways to describe our topic model, including the generative process and the graphic model. After identifying the model structure, the joint distribution of whole model is obtained. Based on this joint distribution, we can learn and infer unknown parameters of our model, which are also the output of PFTP. In fact, unknown parameters represent several matrixes. For instance, \( {\boldsymbol{\uptheta}}_k={\left\{{\theta}_{kw}\right\}}_{w=1}^W \) represents the topicword matrix in Fig. 2, and \( {\boldsymbol{\uppi}}_l={\left\{{\pi}_{lk}\right\}}_{k\in {\mathbb{K}}_l} \) represents the labeltopics matrix in Fig. 2.
The second and third steps are discussed in next sections. It is worth noting that the third step includes two substeps for realizing function prediction: model training and predicting. Both of these two substeps need adopt learning and inference algorithm to estimate parameters of model, and are described with more detail as follows.
The process of model training
PFTP takes a training protein set with known function as an input of training model. The unknown parameter includesπ_{l}, θ_{k} and ψ_{d}. The local hidden variables include the label number and topic number of each word sample. The unknown parameter and local hidden variables can be estimated by inferring algorithm in model training.
The process of model predicting
For unannotated proteins, based on the estimated parameters and local hidden variables, unknown local parameter ψ_{d} and hidden variables are updating by constraining the global parameter π_{l} and θ_{k}. Then, the label probabilities over protein are obtained.
The description of PFTP model
According to the above definitions, the whole word sample x is composed by protein set, where \( {x}_d={\left\{{\mathbf{x}}_{dn}\right\}}_{n=1}^{N_d} \). It illustrates that there are N_{d} word samples in protein d, x_{dn} represents one word sample. At this point, word sample x_{dn} not only associates a word number w_{dn}(\( {\mathbf{w}}_{dn}\in \mathbb{W} \)), but also is assigned a label number l_{dn}(\( {\mathbf{l}}_{dn}\in \mathbb{L} \)) and a topic number\( {\mathbf{z}}_{dn}\left({\mathbf{z}}_{dn}\in \mathbb{K}\right) \).
The generative process of word sample can be described as follows. The corresponding graphical model is shown in Fig. 3.

1.
For each global label \( l\in {\mathbb{L}}^{\prime }=\left\{1,\dots, L,L+1\right\} \)
Sample multinomial parameter vector π_{l} from K_{l} dimensions Dirichlet distribution:

2.
For each global topic \( k\in \mathbb{K}=\left\{1,\dots, K\right\} \)
Sample multinomial parameter vector θ_{k} from W dimensions Dirichlet distribution:

3.
For each local protein \( d\in \mathbb{D}=\left\{1,\dots, D\right\} \)

(a)
Sample label weight vector of protein d from L_{d} dimensions Dirichlet distribution:

(a)
where:

(b)
For each word sample x_{dn},

i.
Sample label number l_{dn} of x_{dn} from L_{d} dimensions multinomial distribution of parameter ψ_{d}:

i.

ii.
Sample topic number z_{dn} of x_{dn} from K dimensions multinomial distribution of parameter\( {\boldsymbol{\uppi}}_{{\mathbf{l}}_{dn}} \):

iii.
Sample word number w_{dn} of x_{dn} from W dimensions multinomial distribution of parameter \( {\boldsymbol{\uptheta}}_{{\mathbf{z}}_{dn}} \):
Parameter estimation
In PFTP model, the unknown parameters to be estimated are the global label multinomial parameters \( \boldsymbol{\uppi} ={\left\{{\boldsymbol{\uppi}}_l\right\}}_{l\in {\mathbb{L}}^{\prime }}={\left\{{\pi}_{lk}\right\}}_{l\in {\mathbb{L}}^{\prime },k\in {\mathbb{K}}_l} \), the global topic multinomial parameters \( \boldsymbol{\uptheta} ={\left\{{\boldsymbol{\uptheta}}_k\right\}}_{k\in \mathbb{K}}={\left\{{\theta}_{kw}\right\}}_{k\in \mathbb{K},w\in \mathbb{W}} \) and the local document label weight \( {\boldsymbol{\uppsi}}_d={\left\{{\psi}_{dl}\right\}}_{l\in {\mathbb{L}}_d} \); the local hidden variables are document label \( {L}_d={\left\{{\mathbf{l}}_{dn}\right\}}_{n=1}^{N_d} \) and topic \( {Z}_d={\left\{{\mathbf{z}}_{dn}\right\}}_{n=1}^{N_d} \); the known information are the observed label vector Λ_{d}, word samples \( {W}_d={\left\{{\mathbf{w}}_{dn}\right\}}_{n=1}^{N_d} \) and their joint distribution. As shown in Eq. (11):
Based on the joint distribution, several parameter estimations can be obtained, including p(π, θ, ψ, L, Z W, Λ, α, λ, β), the posterior distribution of unknown model parameters and hidden variables. In this paper, we use the Collapsed Gibbs sampling (CGS) to train a PFTP model. By marginalizing the model parameters (π, θ, ψ) from the joint distribution (11), the collapsed joint distribution of (L, Z, W) is obtained. The collapsed inference is as follows.
In the joint distribution Eq. (11), function label weight ψ_{d} only appears in p(ψ_{d} Λ_{d}, β_{d}) and p(L_{d} ψ_{d}):
N_{dl} is the number of samples assigned to observed label \( l\in {\mathbb{L}}_d \) of protein d; C_{1} is the constant of multinomial distribution coefficient:
Suppose\( {\widehat{\beta}}_{dl}={\Lambda}_{dl}\left({\beta}_l+{N}_{dl}\right) \), \( {\widehat{\psi}}_{dl}={\psi}_{dl}{\Lambda}_{dl} \). This parameter is eliminated by doing the integral of ψ_{d} in Eq. (11), the marginal distribution of local hidden variable L_{d} is shown in below:
\( {N}_d={\sum}_{l\in {\mathbb{L}}^{\prime }}\kern0em {N}_{dl}{\Lambda}_{dl}={\sum}_{l\in {\mathbb{L}}_d}\kern0em {N}_{dl} \) is the number of observed samples of protein d. The integral of Eq. (14) satisfies probabilistic completeness:
Therefore, deducing from Eq. (14), the predictive probability distribution for the labelassignment l_{dn} = lof sample x_{dn} is:
\( {N}_{dl}^{\left(\backslash dn\right)} \) is the number of samples that were assigned to label l and word w in addition to the current sample x_{dn}.
By the same way, in the joint distribution Eq. (11), global label parameter only appears in p(π α) and p(Z_{d} L_{d}, π).
N_{lk} represents the number of samples assigned to topic k of global label l; C_{2} is the constant of multinomial distribution coefficient:
Suppose \( {\widehat{\alpha}}_k={\alpha}_k+{N}_{lk} \). This parameter is eliminated by doing the integral of π in Eq. (17), the marginal distribution of local hidden variable Z is shown in below:
\( {N}_l={\sum}_{k\in \mathbb{K}}\kern0em {N}_{lk} \) is the number of observed samples assigned to global l in protein set. The integral of Eq. (19) satisfies probabilistic completeness:
Therefore, deducing from Eq. (19), the predictive probability distribution for the topicassignment k of sample x_{dn} in label l is:
\( {N}_{lk}^{\left(\backslash dn\right)} \) represents the number of samples that were assigned to the topic k of global label l in addition to the current sample x_{dn}, \( {N}_l^{\left(\backslash dn\right)}={\sum}_{k\in \mathbb{K}}\kern0em {N}_{lk}^{\left(\backslash dn\right)} \).
The integral of θ is same as LDA in Eq. (11):
Then the predictive probability distribution over the wordassignment wof topic k for observed sample x_{dn} is:
\( {N}_{kw}^{\left(\backslash dn\right)} \) is the number of samples that were assigned to the word w of topic k in addition to the current sample x_{dn}, \( {N}_k^{\left(\backslash dn\right)}={\sum}_{w\in \mathbb{W}}\kern0em {N}_{kw}^{\left(\backslash dn\right)} \).
Given the above, the collapsed joint distribution of (L, Z, W) is obtained by doing the integral of (π, θ, ψ) in Eqs. (14), (19) and (22).
To simplify computation, the Dirichlet prior distributions are symmetric Dirichlet distributions:
\( {\sum}_{l\in {\mathbb{L}}^{\prime }}\kern0em {\beta}_l{\Lambda}_{dl}={\sum}_{l\in {\mathbb{L}}_d}\kern0em {\beta}_l=\beta {L}_d \), \( {\sum}_{k\in \mathbb{K}}\kern0.1em {\alpha}_k=\alpha K \) and \( {\sum}_{w\in \mathbb{W}}\kern0.1em {\lambda}_w=\lambda W \) can be substituted to Eq. (24):
Then, the prediction probability distribution of hidden variable z_{dn} and l_{dn}can be computed from that collapsed joint distribution as a transition probability of state space in the Markov chain. Through Gibbs Sampling iteration, Markov chain converges to the target stationary distribution after the burnin time. Finally, collecting sufficient statistic samples from the converged Markov chain state space and averaging among the samples, we can get a posteriori estimates of corresponding parameters.
Deducing from Eqs. (16), (21) and (23), the predictive probability distribution for the wordassignment wof topic k in label l for sample x_{dn} is:
Results
Dataset
To investigate the performance of the proposed method, we utilize two types of datasets. The first one is S.cerevisiae dataset (S.C) proposed in [19], and the second one is human dataset constructed by ourselves.
In S.C dataset, there are several sub datasets that constructed from different characteristics of yeast genome. Meanwhile, each sub dataset use two kinds of function annotation standard, FunCat and GO. We mainly use the sub dataset that depends on the amino acid sequence of protein and GO. What’s more, to compare the performance of PFTP between difference label numbers, we construct a dataset named S.CCC from S.C, which only includes GO terms belonging to cellular component. Then, there are two datasets constructed from S.C.
The human dataset is constructed from the Universal Protein Resource (UniProt) databank [2] and constructed by the similar way of reference [4]. Meanwhile, we construct two Human datasets for different word length, where the max word length of Human1 dataset is two alphabet, and which of Human2 dataset is three alphabet.
Due to the large number of GO terms in protein function dataset, we adopted a label space dimension reduction (LSDR) method to overcome the classification difficulty of classifiers. Boolean Matrix Decomposition (BMD) has been studied for LSDR recently, which can recovery the label space after classification conveniently. Therefore, a BMD method proposed in reference [20] has conducted in S.C and Human dataset. The statistics of above two datasets is displayed in Table 1. ‘L’ represents the number of GO terms after BMD; ‘D’ denotes the number of proteins in each dataset; ‘W’ denotes the size of vocabulary.
Parameter settings
PFTP model involves three parameters: α, λ and K. α and λ are the parameters of two Dirichlet distribution, where the larger the value of λ, the more balanced the probabilistic of word in a topic. According to the experience, we set α = 50/K,λ = 200/W. The settings and impact of K value are explained later.
In the Gibbs sampling process of model training, we set the number of Markov chain as 1, the maximum number of iterations as 2000 times, where the number of iteration of burnin time is set to 1000. We record the state space at intervals of 50 times on converged Markov chain, and 20 times of record is conducted. In the process of model predicting, we set the number of iterations as 1000 times. After 500 times of iterations for burnin time, we record the state space at intervals of 50 times.
Evaluation criterias
In all of our experiments, we use three representative multilabel learning evaluation criteria, including Hamming loss(HL), Average precision(AP) and One Error. Besides, we also use three kinds of area under PrecisionRecall curve proposed in reference [19], including \( \overline{AUPRC} \), \( AU\left(\overline{PRC}\right) \) and \( \overline{AUPRCw} \). Meanwhile, the 5fold cross validation is adopted to assess the performance of PFTP and contrast methods. The average results of 5 independent rounds are reported in following sections.
The impact of topic number on experimental results
K denotes the number of global topics. The analysis about impact of K on model performance is discussed in this section. According to the description of Section 2, as PFTP allocates one or more latent topics to each GO term, then the value of K should range from Lto infinity in theory. Specifically, if we allocate only one topic to each GO term (K = L), then the model reduces to LabeledLDA. Obviously, setting K < Lmakes our PFTP have no ability to discover the substructure of function. In our experiment, each function is assigned exactly the same number of topics for the simplicity of computation. For example, we set K = 3L, then each GO term corresponds to a topic set with three topics. In view of above reason, the lower bounded of K value is set to 2L. On the other hand, although theory insists that the larger K value equals to the more refined substructure of label, incorporating more latent topics per function will increase the computational load. In reference [18], the impact of K value on the effectiveness of PLDA model has been discussed in several texts collections. Along with the growth of topic size, the performance of PLDA model approaches a fixed value which was obtained by a nonparametric model. In other words, the infinitely larger size of topics doesn’t equal to an infinitely greater performance, but an unbearable running time. Therefore, we set the upper bound of K value as 5Lbased on our empirical experience and the acceptable level of time overhead. In sum, the Kvalue should be set to an integer between 2L and5L. Then, the performance of PFTP under different Kvalue is shown in Fig. 4.
As shown in Fig. 4, all of the evaluation criteria value is relatively stable when Kis set to2L~4L. Nonetheless, when Kvalue is greater than 4L, the values of AP,\( \overline{AUPRC} \),\( AU\left(\overline{PRC}\right) \) and \( \overline{AUPRCw} \) decrease with the increase of K, the value of Hamming loss and One Error slowly increase with the increase of K. These results suggest that the optimum value range of K is 2L to4L. This was due to that the lower K value makes the fewer topics allocated to each label, and the higher K value makes the small difference of word distribution between topics. What’s more, the problem of huge labels is particularly obvious in protein function dataset, even if a BMD method has applied to reduce the label dimension. Therefore, we set K as 3L in our experiment.
Evaluation against widely adopted method
Firstly, we compare PFTP with LabeledLDA [4] and multilabel Knearest neighbor (MLKNN) [21] on four datasets. MLKNN is a representative multilabel classifier and can be applied by an open source tool called Mulan [22]. Figure 5 shows the HL, AP, One Error, \( AU\left(\overline{PRC}\right) \), \( \overline{AUPRC} \) and \( \overline{AUPRCw} \) values of these three models in SC, SCCC, Human1 and Human2 dataset, respectively. For AP, \( AU\left(\overline{PRC}\right) \), \( \overline{AUPRC} \) and \( \overline{AUPRCw} \), the larger the value, the better the performance. Conversely, for HL and OneError, the smaller the value, the better the performance. The red asterisk of Fig. 4 represents the best result on each dataset.
As shown in Fig. 5, we can observe that PTPF shown more advantages in contrast to LabeledLDA and MLKNN in four datasets. Concrete analysis is as follows:
For Human1 dataset, PFTP obtain a better performance in all evaluation criteria. On HL, PTPF achieves 9.7 and 2% improvements over LabeledLDA and MLKNN. On OneError, PTPF achieves 80 and 99% improvements over LabeledLDA and MLKNN. On AP, \( AU\left(\overline{PRC}\right) \), \( \overline{AUPRC} \) and \( \overline{AUPRCw} \), PFTP achieves 2.5, 0.2, 47 and 18% improvements over LabeledLDA, and achieves 48, 40, 43 and 41% improvements over MLKNN. Obviously, the improvements on \( \overline{AUPRC} \) and \( \overline{AUPRCw} \) is more significant than \( AU\left(\overline{PRC}\right) \).
For Human2 dataset, PFTP obtain a better performance in four evaluation criteria except \( AU\left(\overline{PRC}\right) \) and \( \overline{AUPRC} \). On HL, PTPF achieves 30 and 7.9% improvements over LabeledLDA and MLKNN. On OneError, PTPF achieves 66 and 99% improvements over LabeledLDA and MLKNN. On AP and \( \overline{AUPRCw} \), PFTP achieves 3.3 and 0.2% improvements over LabeledLDA, and achieves 40 and 29% improvements over MLKNN. Nevertheless, on \( AU\left(\overline{PRC}\right) \) and \( \overline{AUPRC} \), MLKNN and LabeledLDA get better results respectively.
For S.C dataset, PFTP obtain a better performance in four evaluation criteria except HL and OneError. On AP, \( \overline{AUPRC} \) and \( \overline{AUPRCw} \), PTPF achieves 2.8%, 22 and 16% improvements over LabeledLDA, and achieves 48, 17 and 32% improvements over MLKNN; on \( AU\left(\overline{PRC}\right) \), the results of LabeledLDA and PFTP are almost the same. Nevertheless, on HL, MLKNN gets better results than PFTP; on OneError, almost identical results were obtained by these three methods.
For S.CCC dataset, PFTP obtain a better performance on AP, \( \overline{AUPRC} \) and \( \overline{AUPRCw} \). On AP, PTPF achieves 2.6 and 27% improvements over LabeledLDA and MLKNN. On \( \overline{AUPRC} \), PTPF achieves 14 and 32% improvements over LabeledLDA and MLKNN. On \( \overline{AUPRCw} \), PTPF achieves 7.8 and 41% improvements over LabeledLDA and MLKNN.
Besides, we compare PFTP with three hierarchal multilabel classification (HMC) algorithm based on decision tree, namely HMC/SC (singlelabel classification)/HSC (hierarchical singlelabel classification) [19]. These three algorithms have been studied on protein function prediction dataset and proved to be a kind of multilabel classifiers with great performance. Since the results of CLUSHMC/SC/HSC in reference [19] are only on S.C dataset, the comparison results with our PFTP are also on S.C dataset, and are plotted in Fig. 6.
On \( \overline{AUPRC} \), our method exhibits dominant advantage against all of the three comparison methods. The performance improvements are 85, 85 and 84% against CLUSSC, CLUSHSC and CLUSHMC, respectively. On \( AU\left(\overline{PRC}\right) \), PTPF achieves 65, 51 and 32% improvements over CLUSSC, CLUSHSC and CLUSHMC. Nonetheless, on \( \overline{AUPRCw} \), CLUSHMC gets better results than PFTP.
The topics discovered by PFTP
The greatest strength of our protein function topic modeling is that, it can not only provide the function label probability distribution over proteins as an output, but also each function label can be explained as a probability distribution over topic subset, where each topic is represented as the probability distribution over amino acid blocks. To better understand this topic modeling process, we take GO term ‘GO0016020’ as an example, whose corresponding topics are shown in Table 2.
As shown in Table 2, the 2mers BoW is used in this example. For LabeledLDA, the onetoone correspondence between label and word is the key design consideration. Therefore, ‘GO0016020’ only corresponds with a topic numbered 288, and also corresponds with a probability distribution over word. The top 20 words are listed from large to small order.
For PFTP model, each GO term is a partition of global topics set. Such as for S.CCC dataset, the number of function label is 319, while the number of global topics is three times that of the labels, that’s a total of 958(including a background topic). Therefore, each GO term corresponds with four topics (including three local topics and one background topic). The topic number 863,864,865 and 1 are the four topics corresponded by ‘GO0016020’, where the number 1 is a background topic. Likewise, the top 20 words of these four topics are listed from large to small order.
Discussions
The results in Figs. 5 and 6 indicate that PFTP has the significant advantage against several widely adopted multilabel classifiers.
Compared with traditional multilabel classifiers(nontopic model), our method can further improve the accuracy of protein function prediction by introducing topics subset into supervised topic model, which can discover the topic that represents common semantic of documents and reflect the differences between labels and latent topics. Especially for CLUSHMC/SC/HSC, our method exhibit the dominant advantage on \( \overline{AUPRC} \). We attribute this success of our method to its utilization of BMD method on dataset. As the computation of \( \overline{AUPRC} \) doesn’t bias toward the accuracy of function label annotating more proteins, and focus on the average of whole accuracy. The GO term annotating fewer proteins will be deleted after BMD processing, and recovered after predicting, but the prediction accuracy don’t reduce. In other words, the combination of PFTP and BMD can improve the average accuracy of protein function prediction.
Compared with LabeledLDA, PFTP is able to discovery morerefined latent substructure of function label than LabeledLDA. By introducing topic subset for each label in PTPF, the relationship between functions and variety words, labels and topics were disclosed. Therefore, we can anticipate that PFTP is a potential method to reveal a deeper biological explanation for protein functions.
Meanwhile, the performance comparison of different dataset is also shown in Fig. 4. For S.CCC dataset, six evaluation criteria values vary relatively smoothly. It may be due to the fewer labels of S.CCC dataset, then changing the K value doesn’t lead to great impact on prediction effect. In the comparison of S.C and S.CCC dataset, we find that the value of AP, \( AU\overline{(PRC)} \), \( \overline{AUPRC} \) and \( \overline{AUPRCw} \) on S.C is lower than S.CCC, and the value of OneError and HL is almost equal between S.C and S.CCC. This is due to the same word space and different label number between these two dataset. The fewer labels of S.CCC can make a higher classifying performance. In the comparison of Human1 and Human2 dataset, we find that the value of \( \overline{AUPRC} \) and \( \overline{AUPRCw} \) on Human1 is higher than Human2; the value of AP on Human1 is lower than Human2; the value of OneError, HL and \( AU\overline{(PRC)} \) is almost equal on Human1 and Human2. These results show that, the classification performance of PFTP on Human1 and Human2 is almost the same, which reveal that the larger word space might not obtain a better classifying performance.
Conclusions
In this paper, we introduced an improved multilabel supervised topic model for predicting protein function. In our previous study, a multilabel supervised topic model LabeledLDA has been applied to protein function prediction, which associates each label (GO term) with a corresponding topic directly. This way makes the latent topics to be completely degenerated, and ignores the differences between labels and latent topics. To address the faultiness, we proposed a Partially FunctiontoTopic Prediction model for introducing the local topic subset corresponding to each function label. PFTP not only supports latent topics subsets within given function labels but also a background topic corresponding to a ‘fake’ function label. In a 5fold cross validation experiment on predicting protein function, PFTP significantly outperforms compared methods. Due to the morerefined way of function label modeling, PFTP shows the effectiveness and potential value in predicting protein function through experimental studies. Meanwhile, there are several problems in topic modeling of protein function prediction to be improved, such as the introduction of protein extra features and hierarchical function label structure. However, multilabel topic model is a potential method in many applications of bioinformatics.
Abbreviations
 BMD:

Boolean Matrix Decomposition
 BoW:

Bag of Words
 CGS:

Collapsed Gibbs sampling
 GO:

Gene Ontology
 HL:

Hamming loss, AP: Average precision
 HMC:

Hierarchal Multilabel Classification
 HSC:

Hierarchical Singlelabel Classification
 LDA:

Latent Dirichlet Allocation
 LSDR:

Label Space Dimension Reduction
 MLKNN:

Multilabel Knearest neighbor
 PFTP:

Partially FunctiontoTopic Prediction
 PLDA:

Partially Labeled LDA
 PLSA:

Probabilistic Latent Semantic Analysis
 S.C:

S.cerevisiae
 SC:

Singlelabel Classification
 SVM:

Support Vector Machine
 UniProt:

Universal Protein Resource
References
Weaver RF. Molecular biology (WCB Cell & Molecular Biology). 5th ed. New York: cGrawhill Education; 2011.
Consortium UP. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2016;45(D1):D158–69.
Berman HM, Battistuz T, Bhat TN. The protein data Bank. Berlin: Atomic evidence: Springer International Publishing; 2016. p. 218–22.
Liu L, Tang L, He L, Wei Z, Shaowen Y. Pedicting protein function via multilabel supervised topic model on gene ontology. Biotechnol. Biotechnol. Equip. 2017;31(1):1–9.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids. 1997;25:3389–402.
Gene Ontology Consortium. The gene ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32(Suppl 1):D258–61.
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks. Methods. 2016;93:84–91.
Erdin S, Venner E, Lisewski AM, Lichtarge O. Function prediction from networks of local evolutionary similarity in protein structure. BMC bioinformatics. 2013;14(3):S6.
Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Predicting protein function using multiple kernels. IEEE/ACM Trans Comput Biol Bioinf. 2015;12(1):219–33.
Fodeh S, Tiwari A, Yu H. Exploiting PubMed for protein molecular function prediction via NMF based multilabel classification. In: Proceeding of international conference on data mining workshops. 2017 IEEE conference on; 2017. p. 446–51.
However. Orderly roulette selection based ant Colony algorithm for hierarchical multilabel protein function prediction. Math Probl Eng. 2017;2017(2):1–15.
Wang H, Yan L, Huang H, Ding C. From protein sequence to protein function via multilabel linear discriminant analysis. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(3):503–13.
Pinoli P, Chicco D, Masseroli M. Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations. In: Proceeding of the 13th international conference on bioinformatics and bioengineering (BIBE). 2013 IEEE conference on; 2013. p. 1–4.
Masseroli M, Chicco D, Pinoli P. Probabilistic latent semantic analysis for prediction of gene ontology annotations. In: Proceeding of international joint conference on neural networks (IJCNN). 2012 IEEE conference on; 2012. p. 1–8.
Pinoli P, Chicco D, Masseroli M. Latent Dirichlet allocation based on Gibbs sampling for gene function prediction. In: Proceeding of international conference on computational intelligence in bioinformatics and computational biology. 2014 IEEE conference on; 2014. p. 1–8.
Dumais ST. Latent semantic analysis. Ann Rev Inf Sci Technol. 2004;38(1):188–230.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
Ramage D, Manning CD, Dumais S. Partially labeled topic models for interpretable text mining. In: International conference on knowledge discovery and data mining, 2011 ACM conference on; 2011. p. 457–65.
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H. Decision trees for hierarchical multilabel classification. Mach Learn. 2008;73(2):185–214.
Sun Y, Ye S, Sun Y, Kameda T. Improved algorithms for exact and approximate Boolean matrix decomposition. In: International conference on data science and advanced analytics, 2015 IEEE conference on; 2015. p. 1–10.
Zhang M, Zhou Z. MLKNN : a lazy learning approach to multilabel learning. Pattern Recogn. 2007;40(7):2038–48.
Tsoumakas G, Katakis I, Vlahavas I. Mining multilabel data. In: Maimonn O, Rokach L, editors. Data mining and knowledge discovery handbook. New York: Springer US; 2009. p. 667–85.
Acknowledgements
We would like to thank the researchers in State Key Laboratory of Conservation and Utilization of Bioresources, Yunnan University, Kunming, China. Their very helpful comments and suggestions have led to an improved version of paper.
Funding
This research was supported by the National Natural Science Foundation of China (no. 61862067, no. 61363021), and the Doctor Science Foundation of Yunnan normal university (no. 01000205020503090, no. 2016zb009). Publication costs are funded by the Doctor Science Foundation of Yunnan normal university (no. 2016zb009).
Availability of data and materials
The data and source code is available upon request.
About this supplement
This article has been published as part of BMC Genomics Volume 19 Supplement 10, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume19supplement10.
Author information
Authors and Affiliations
Contributions
LT and WZ conceived the study, and revised the manuscript. LL analyzed materials and literatures, and drafted the manuscript. LT and MT participated in the literatures analyses. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Liu, L., Tang, L., Tang, M. et al. A partially functiontotopic model for protein function prediction. BMC Genomics 19, 883 (2018). https://doi.org/10.1186/s1286401852767
Published:
DOI: https://doi.org/10.1186/s1286401852767
Keywords
 Multilabel classification
 Topic model
 Protein function
 Probability distribution