A partially function-to-topic model for protein function prediction

Background Proteins are a kind of macromolecules and the main component of a cell, and thus it is the most essential and versatile material of life. The research of protein functions is of great significance in decoding the secret of life. In recent years, researchers have introduced multi-label supervised topic model such as Labeled Latent Dirichlet Allocation (Labeled-LDA) into protein function prediction, which can obtain more accurate and explanatory prediction. However, the topic-label corresponding way of Labeled-LDA is associating each label (GO term) with a corresponding topic directly, which makes the latent topics to be completely degenerated, and ignores the differences between labels and latent topics. Result To achieve more accurate probabilistic modeling of function label, we propose a Partially Function-to-Topic Prediction (PFTP) model for introducing the local topics subset corresponding to each function label. Meanwhile, PFTP not only supports latent topics subset within a given function label but also a background topic corresponding to a ‘fake’ function label, which represents common semantic of protein function. Related definitions and the topic modeling process of PFTP are described in this paper. In a 5-fold cross validation experiment on yeast and human datasets, PFTP significantly outperforms five widely adopted methods for protein function prediction. Meanwhile, the impact of model parameters on prediction performance and the latent topics discovered by PFTP are also discussed in this paper. Conclusion All of the experimental results provide evidence that PFTP is effective and have potential value for predicting protein function. Based on its ability of discovering more-refined latent sub-structure of function label, we can anticipate that PFTP is a potential method to reveal a deeper biological explanation for protein functions.


Background
Proteins are the main component of a cell, which explain the basic activity of cellular life. The research of protein functions is of great significance in elucidating the phenomena of life [1]. Although there have been amount of protein sequences in biological database in recent years [2,3], a small percentage of these proteins have experimental function annotations because of the high cost of biochemical experiment. In comparison with biochemical experiment, computational methods predict the functional annotations of proteins by using known information, such as sequence, structure, and functional behavior, which reduce time and effort, and have become important long-standing research works in post-genomic era [4].
The earlier computational approach for predicting protein function is to utilize the protein sequence or structure similarity to transfer functional information, such as BLAST. [5]With the rapid development of computational algorithms, an increasing types of algorithms have been introduced into the studies of predicting protein function. At present, computational methods of protein function prediction can be classified as two types: classification-based approaches and graph-based approaches. In classification-based approaches, proteins are viewed as instances to be classified, and function annotations (such as Gene Ontology (GO) [6] terms) are regarded as labels. Each protein has a feature space composed by classification feature extracted from amino acid sequence, textual repositories, and so on. Based on these annotated proteins and their attribute features, we can train the classifier on training dataset and then predict function labels for unannotated proteins. For graphbased approaches, the network structure information of proteins is used to compute the distance between proteins, and then the closely related proteins are considered to have similar functional annotations [7,8].
In classification-based approaches, since each protein is annotated with several functions, various multi-label classifiers can be adopted. Yu et.al [9] proposed a multiple kernels (ProMK) method to process multiple heterogeneous protein data sources for predicting protein functions; Fodeh et.al [10] used the binary-relevance for different classifiers to automatically assign molecular functions to genes; a new ant colony optimization algorithm is proposed in reference [11], which has applied to protein function dataset; Wang et.al [12] applied a new multi-label linear discriminant analysis approach to address protein function prediction problem; Liu et.al [4] introduced a multi-label supervised topic model called Labeled-LDA into protein function prediction, whose experimental results on yeast and human datasets demonstrated the effectiveness of Labeled-LDA on protein function prediction. This research is the first effort to apply a multi-label supervised topic model to protein function prediction. Besides, Pinoli et.al [13][14][15] applied two standard topic models, including latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) [16,17], to predict GO terms of proteins on the basic of available GO annotations.
In the topic modeling process of reference [4], each protein is viewed as a mixture of 'topics' , where each 'topic' is also viewed as the mixture of amino acid blocks. In comparison with discriminative model, such as support vector machine (SVM), a multi-label supervised topic model can transform the word-level statistics of each document to its label-level distribution, and model all labels simultaneously rather than treating each label independently. Specially, topic model can provide the function label probability distribution over proteins as an output, and each function label is explained as a probability distribution over amino acid blocks. Nonetheless, in the study of Liu et.al [4], Labeled-LDA associates each label (GO term) with a corresponding topic directly, which makes the latent topics to be completely degenerated, and ignores the differences between labels and latent topics. Therefore, Labeled-LDA isn't able to discover the topic that represents common semantic of protein functions. For interpretable text mining, Ramage et.al [18] proposed a partially labeled LDA (PLDA), which associates each label with a topic subset partitioned from global topics set. PLDA overcame the shortfalls of Labeled-LDA, and improved the precision of text classification in experimental research.
Inspired by the application of multi-label topic model in protein function prediction and PLDA model, we introduce a Partially Function-to-Topic Prediction model (called PFTP). Firstly, we describe the related definitions by contrasting text data and protein function data. Then the topic modeling process of PFTP is described in detail, including the generative process and parameter estimation of PFTP. In a 5-fold cross validation experiment on predicting protein function, PFTP significantly outperforms five algorithms compared. All of the experimental results provide evidence that PFTP is effective and have potential value for predicting protein function.

Related definitions and notations
To better understand related objects of topic model, the corresponding relationship between protein function prediction and multi-label classification of text is first depicted in Fig. 1.
Several topic modeling concepts of protein function data and text data are displayed in Fig. 1, one on the left and the other on the right. First of all, the text dataset is composed of several documents numbered D1 to Dn, and the protein function dataset is composed of several protein sequences numbered P1 to Pn. Obviously, words are the main component of document, such as word 'table' and 'database'. But for protein sequence, we consider a protein sequence to be a text string, which is defined on a fixed 20 amino acids alphabet (G,A,V,L,I,F,P, Y,S,C,M,N,Q,T,D,E,K,R,H,W). Then amino acid blocks are the main component of protein sequence, such as 'MS' and 'TS'. Besides, a protein annotated by GO terms is equivalent to a document labeled by tags, so each GO term or tag can be viewed as a label, such as 'GO0003673' and 'language'. According to above statements, there are three types of equivalence relations between protein function data and text data: protein sequence and document, amino acid block and word, GO term and document tag. In general, the GO term (document tag), protein sequence (document) and amino acid block (word) are observable data for dataset.
As the input for topic model, the bag of words (BoW) is constructed by computing the word-document matrix, where matrix element is obtained by counting the times of word in each document. As an instance, the word 'table' appears two times in document D1. Likewise for protein function data, an amino acid block -protein sequence matrix is computed for the construction of protein BoW. As an example, the amino acid block 'MS' appears one times in protein P1. Besides, the fixed amino acid blocks set or words set is also called 'vocabulary'.
For topic model, a 'topic' is viewed as a probability distribution over a fixed vocabulary. Taking the text data as an example, the probabilities of word 'table' over 'topic 1' are 0.05. For the protein data, the probabilities of amino acid block 'MS' over 'topics 1' are 0.21. Obviously, topics are latent and needed to be inferred by topic modeling. Finally, in order to establish the connection between labels and topics, the latent topics discovered by our PFTP are divided into several non-overlapping subsets, each of which associates a label. As can be seen in Fig. 1, we split whole topic set into several groups: 'label1' connects with 'topic1' to 'topic3'; 'lable2' connects with 'topic 4' to 'topic 5' , and so on. It is worth noting that our PFTP define a special type of topics as background topics. The background topics are divided from whole latent topics set, and don't associate any observable label, which express the common sematic of documents. For instance, the background topic on text dataset may be some topics with a high probability on several universal words, such as 'text' , 'other' and so on. To formalize the above description, the related notations are given below.
Suppose there are D proteins in the protein set which compose the protein space D ¼ f1; …; Dg , and the vocabulary of amino acid blocks is in a space of W ¼ f1; …; W g, then W is the size of vocabulary. The topic space including Ktopics is represented by K ¼ f1; …; K g , which is shared by whole protein set. Therefore, K is Fig. 1 The corresponding relationship between protein function prediction and multi-label supervised topic modeling of text. The protein function data is shown on the left side, and the text data is shown on the right side also called global topic space. The protein function label space is expressed as L ¼ f1; …; Lg.
In PFTP model, the global topic space K is divided into L groups without overlap, and each group corresponds to a subspace of topic K l . Besides, there is a 'background subspace of topics' K B .
Then, each of labels is assigned a subspace of topic K l , the background topic subspace K B associates a background label l B .In this case, the label space is expanded to L + 1 dimensions and expressed as L 0 . Similar to topic modeling of text in Labeled-LDA, each of topics can be represented as a multinomial distribution of parameter θ k ¼ fθ kw g W w¼1 (the equivalent of the topic-word matrix in Fig. 1) on the vocabulary W, and θ k obeys a Dirichlet prior distribution of hyper parameter λ ¼ fλ w g w∈W . But what is different about our PFTP is that each of labels l is represented as a multinomial distribution of parameter π l ¼ fπ lk g k∈K l (the equivalent of the label-topics probability in Fig. 1) on its topic subspace, where π lk is the probabilities of topic k among topic subspace K l corresponding to label l. Suppose π l obeys a Dirichlet prior distribution of hyper parameter α.
We utilize a binary vector Λ d to map global label space L 0 to L d : Λ d, L + 1 = 1 illustrates that latent background label l B is assigned to each protein d. Then, the probabilities of L d ¼ jL d j labels of protein d is represented by a weight of protein-label ψ d ¼ fψ dl g l∈L d ¼ fψ dl Λ dl g l∈L 0 , and ψ d obeys a Dirichlet prior distribution of hyper parameter β d constrained by β and Λ d : In this paper, the shared parameters of whole protein sets is called global parameter in this paper, and the parameter facing one protein is called local parameter.
The topic modeling process of PFTP Based on above expression, the process of PFTP topic modeling is divided into three steps: BoW construction, the description of model (the generative process or graphic model) and parameter estimation (model training and predicting).These steps are depicted in Fig. 2.
As shown in Fig. 2, PFTP model takes BoW as input. As we construct BoW of protein in exactly the same way as reference [4], this step will not repeat in this paper. There are two ways to describe our topic model, including the generative process and the graphic model. After identifying the model structure, the joint distribution of whole model is obtained. Based on this joint distribution, we can learn and infer unknown parameters of our model, which are also the output of PFTP. In fact, unknown parameters represent several matrixes. For instance, θ k ¼ fθ kw g W w¼1 represents the topic-word matrix in Fig. 2, and π l ¼ fπ lk g k∈K l represents the label-topics matrix in Fig. 2.
The second and third steps are discussed in next sections. It is worth noting that the third step includes two sub-steps for realizing function prediction: model training and predicting. Both of these two sub-steps need adopt learning and inference algorithm to estimate parameters of model, and are described with more detail as follows.

The process of model predicting
For unannotated proteins, based on the estimated parameters and local hidden variables, unknown local parameter ψ d and hidden variables are updating by constraining the global parameter π l and θ k . Then, the label probabilities over protein are obtained.

The description of PFTP model
According to the above definitions, the whole word sample x is composed by protein set, where . It illustrates that there are N d word samples in protein d, x dn represents one word sample. At this point, word sample x dn not only associates a word number w dn (w dn ∈ W), but also is assigned a label number l dn (l dn ∈L) and a topic numberz dn ðz dn ∈KÞ.
The generative process of word sample can be described as follows. The corresponding graphical model is shown in Fig. 3.

For each global label l∈L
Sample multinomial parameter vector π l from K l dimensions Dirichlet distribution: 2. For each global topic k∈K ¼ f1; …; K g Sample multinomial parameter vector θ k from W dimensions Dirichlet distribution: 3. For each local protein d∈D ¼ f1; …; Dg (a) Sample label weight vector of protein d from L d dimensions Dirichlet distribution: where: ii. Sample topic number z dn of x dn from K dimensions multinomial distribution of parameterπ l dn : iii. Sample word number w dn of x dn from W dimensions multinomial distribution of parameter θ z dn : Parameter estimation In PFTP model, the unknown parameters to be estimated are the global label multinomial parameters π ¼ fπ l g l∈L 0 ¼ fπ lk g l∈L 0 ;k∈K l , the global topic multinomial parameters θ ¼ fθ k g k∈K ¼ fθ kw g k∈K;w∈W and the local document label weight ψ d ¼ fψ dl g l∈L d ; the local hidden variables are document label L d ¼ fl dn g N d n¼1 and topic and their joint distribution. As shown in Eq. (11): Based on the joint distribution, several parameter estimations can be obtained, including p(π, θ, ψ, L, Z| W, Λ, α, λ, β), the posterior distribution of unknown model parameters and hidden variables. In this paper, we use the Collapsed Gibbs sampling (CGS) to train a PFTP model. By marginalizing the model parameters (π, θ, ψ) from the joint distribution (11), the collapsed joint distribution of (L, Z, W) is obtained. The collapsed inference is as follows.
In the joint distribution Eq. (11), function label weight N dl is the number of samples assigned to observed label l∈L d of protein d; C 1 is the constant of multinomial distribution coefficient: This parameter is eliminated by doing the integral of ψ d in Eq. (11), the marginal distribution of local hidden variable L d is shown in below: Therefore, deducing from Eq. (14), the predictive probability distribution for the label-assignment l dn = lof sample x dn is: N ðndnÞ dl is the number of samples that were assigned to label l and word w in addition to the current sample x dn .
By the same way, in the joint distribution Eq. (11), global label parameter only appears in p(π| α) and p(Z d | L d , π).
N lk represents the number of samples assigned to topic k of global label l; C 2 is the constant of multinomial distribution coefficient: Supposeα k ¼ α k þ N lk . This parameter is eliminated by doing the integral of π in Eq. (17), the marginal distribution of local hidden variable Z is shown in below: N l ¼ P k∈K N lk is the number of observed samples assigned to global l in protein set. The integral of Eq. (19) satisfies probabilistic completeness: Therefore, deducing from Eq. (19), the predictive probability distribution for the topic-assignment k of sample x dn in label l is: Then the predictive probability distribution over the word-assignment wof topic k for observed sample x dn is: is the number of samples that were assigned to the word w of topic k in addition to the current sample x dn , N ðndnÞ k ¼ P w∈W N ðndnÞ kw . Given the above, the collapsed joint distribution of (L, Z, W) is obtained by doing the integral of (π, θ, ψ) in Eqs. (14), (19) and (22).
To simplify computation, the Dirichlet prior distributions are symmetric Dirichlet distributions: Then, the prediction probability distribution of hidden variable z dn and l dn can be computed from that collapsed joint distribution as a transition probability of state space in the Markov chain. Through Gibbs Sampling iteration, Markov chain converges to the target stationary distribution after the burn-in time. Finally, collecting sufficient statistic samples from the converged Markov chain state space and averaging among the samples, we can get a posteriori estimates of corresponding parameters.
Deducing from Eqs. (16), (21) and (23), the predictive probability distribution for the word-assignment wof topic k in label l for sample x dn is:

Dataset
To investigate the performance of the proposed method, we utilize two types of datasets. The first one is S.cerevisiae dataset (S.C) proposed in [19], and the second one is human dataset constructed by ourselves.
In S.C dataset, there are several sub datasets that constructed from different characteristics of yeast genome. Meanwhile, each sub dataset use two kinds of function annotation standard, FunCat and GO. We mainly use the sub dataset that depends on the amino acid sequence of protein and GO. What's more, to compare the performance of PFTP between difference label numbers, we construct a dataset named S.C-CC from S.C, which only includes GO terms belonging to cellular component. Then, there are two datasets constructed from S.C.
The human dataset is constructed from the Universal Protein Resource (UniProt) databank [2] and constructed by the similar way of reference [4]. Meanwhile, we construct two Human datasets for different word length, where the max word length of Human1 dataset is two alphabet, and which of Hu-man2 dataset is three alphabet.
Due to the large number of GO terms in protein function dataset, we adopted a label space dimension reduction (LSDR) method to overcome the classification difficulty of classifiers. Boolean Matrix Decomposition (BMD) has been studied for LSDR recently, which can recovery the label space after classification conveniently. Therefore, a BMD method proposed in reference [20] has conducted in S.C and Human dataset. The statistics of above two datasets is displayed in Table 1. 'L' represents the number of GO terms after BMD; 'D' denotes the number of proteins in each dataset; 'W' denotes the size of vocabulary.

Parameter settings
PFTP model involves three parameters: α, λ and K. α and λ are the parameters of two Dirichlet distribution, where the larger the value of λ, the more balanced the probabilistic of word in a topic. According to the experience, we set α = 50/K,λ = 200/W. The settings and impact of K value are explained later.
In the Gibbs sampling process of model training, we set the number of Markov chain as 1, the maximum number of iterations as 2000 times, where the number of iteration of burn-in time is set to 1000. We record the state space at intervals of 50 times on converged Markov chain, and 20 times of record is conducted. In the process of model predicting, we set the number of iterations as 1000 times. After 500

Evaluation criterias
In all of our experiments, we use three representative multi-label learning evaluation criteria, including Hamming loss(HL), Average precision(AP) and One Error. Besides, we also use three kinds of area under Precision-Recall curve proposed in reference [19], including AUPRC, AUðPRCÞ and AUPRCw. Meanwhile, the 5-fold cross validation is adopted to assess the performance of PFTP and contrast methods. The average results of 5 independent rounds are reported in following sections.

The impact of topic number on experimental results
K denotes the number of global topics. The analysis about impact of K on model performance is discussed in this section. According to the description of Section 2, as PFTP allocates one or more latent topics to each GO term, then the value of K should range from Lto infinity in theory. Specifically, if we allocate only one topic to each GO term (K = L), then the model reduces to Labeled-LDA. Obviously, setting K < Lmakes our PFTP have no ability to discover the sub-structure of function. In our experiment, each function is assigned exactly the same number of topics for the simplicity of computation. For example, we set K = 3L, then each GO term corresponds to a topic set with three topics. In view of above reason, the lower bounded of K value is set to 2L. On the other hand, although theory insists that the larger K value equals to the more refined sub-structure of label, incorporating more latent topics per function will increase the computational load. In reference [18], the impact of K value on the effectiveness of PLDA model has been discussed in several texts collections. Along with the growth of topic size, the performance of PLDA model approaches a fixed value which was obtained by a non-parametric model. In other words, the infinitely larger size of topics doesn't equal to an infinitely greater performance, but an unbearable running time. Therefore, we set the upper bound of K value as 5Lbased on our empirical experience and the acceptable level of time overhead. In sum, the Kvalue should be set to an integer between 2L and5L. Then, the performance of PFTP under different Kvalue is shown in Fig. 4. As shown in Fig. 4, all of the evaluation criteria value is relatively stable when Kis set to2L~4L. Nonetheless, when Kvalue is greater than 4L, the values of AP,AUPRC,AUðPRCÞ and AUPRCw decrease with the increase of K, the value of Hamming loss and One Error slowly increase with the increase of K. These results suggest that the optimum value range of K is 2L to4L. This was due to that the lower K value makes the fewer topics allocated to each label, and the higher K value makes the small difference of word distribution between topics. What's more, the problem of huge labels is particularly obvious in protein function dataset, even if a BMD method has applied to reduce the label dimension. Therefore, we set K as 3L in our experiment.
Evaluation against widely adopted method Firstly, we compare PFTP with Labeled-LDA [4] and multi-label K-nearest neighbor (MLKNN) [21] on four datasets. MLKNN is a representative multi-label classifier and can be applied by an open source tool called Mulan [22]. Figure 5 shows the HL, AP, One Error, AUð PRCÞ , AUPRC and AUPRCw values of these three models in SC, SC-CC, Human1 and Human2 dataset, respectively. For AP, AUðPRCÞ, AUPRC and AUPRCw, the larger the value, the better the performance. Conversely, for HL and One-Error, the smaller the value, the better the performance. The red asterisk of Fig. 4 represents the best result on each dataset.
As shown in Fig. 5, we can observe that PTPF shown more advantages in contrast to Labeled-LDA and MLKNN in four datasets. Concrete analysis is as follows: For Human1 dataset, PFTP obtain a better performance in all evaluation criteria. On HL, PTPF achieves 9.7 and 2% improvements over Labeled-LDA and MLKNN. On One-Error, PTPF achieves 80 and 99% improvements over Labeled-LDA and MLKNN. On AP, AUðPRCÞ , AUPRC and AUPRCw , PFTP achieves 2.5, 0.2, 47 and 18% improvements over Labeled-LDA, and achieves 48, 40, 43 and 41% improvements over MLKNN. Obviously, the improvements on AUPRC and AUPRCw is more significant than AUðPRCÞ.
For Human2 dataset, PFTP obtain a better performance in four evaluation criteria except AUðPRCÞ and AUPRC . On HL, PTPF achieves 30 and 7.9% improvements over Labeled-LDA and MLKNN. On One-Error, PTPF achieves 66 and 99% improvements over Labeled-LDA and MLKNN. On AP and AUPRCw, PFTP achieves 3.3 and 0.2% improvements over Labeled-LDA, and achieves 40 and 29% improvements over MLKNN. Nevertheless, on AUðPRCÞ and AUPRC , MLKNN and Labeled-LDA get better results respectively.
For S.C dataset, PFTP obtain a better performance in four evaluation criteria except HL and One-Error.
On AP, AUPRC and AUPRCw , PTPF achieves 2.8%, 22 and 16% improvements over Labeled-LDA, and achieves 48, 17 and 32% improvements over MLKNN; on AUðPRCÞ , the results of Labeled-LDA and PFTP are almost the same. Nevertheless, on HL, MLKNN gets better results than PFTP; on One-Error, almost identical results were obtained by these three methods.
Besides, we compare PFTP with three hierarchal multi-label classification (HMC) algorithm based on decision tree, namely HMC/SC (single-label classification)/HSC (hierarchical single-label classification) [19]. These three algorithms have been studied on protein function prediction dataset and proved to be a kind of multi-label classifiers with great performance. Since the results of CLUS-HMC/SC/HSC in reference [19] are only on S.C dataset, the comparison results with our PFTP are also on S.C dataset, and are plotted in Fig. 6. Fig. 4 The performance comparison of different K setting. For AP, AUðPRCÞ, AUPRC and AUPRCw, the larger the value, the better the performance; for HL and One-Error, the smaller the value, the better the performance; The red background represents the best value range On AUPRC, our method exhibits dominant advantage against all of the three comparison methods. The performance improvements are 85, 85 and 84% against CLUS-SC, CLUS-HSC and CLUS-HMC, respectively. On AUðPRCÞ, PTPF achieves 65, 51 and 32% improvements over CLUS-SC, CLUS-HSC and CLUS-HMC. Nonetheless, on AUPRCw , CLUS-HMC gets better results than PFTP.

The topics discovered by PFTP
The greatest strength of our protein function topic modeling is that, it can not only provide the function label probability distribution over proteins as an output, but also each function label can be explained as a probability distribution over topic subset, where each topic is represented as the probability distribution over amino acid blocks. To better understand this topic modeling process, we take GO term 'GO0016020' as an example, whose corresponding topics are shown in Table 2.
As shown in Table 2, the 2-mers BoW is used in this example. For Labeled-LDA, the one-to-one correspondence between label and word is the key design consideration. Therefore, 'GO0016020' only corresponds with a topic numbered 288, and also corresponds with a probability distribution over word. The top 20 words are listed from large to small order.
For PFTP model, each GO term is a partition of global topics set. Such as for S.C-CC dataset, the number of function label is 319, while the number of global topics is three times that of the labels, that's a The comparison results with PTFP and Labeled-LDA. For AP, AUPRC, AUðPRCÞ and AUPRCw, the larger the value, the better the performance; for HL and One-Error, the smaller the value, the better the performance; the red asterisk on bar represents the best result on each dataset total of 958(including a background topic). Therefore, each GO term corresponds with four topics (including three local topics and one background topic). The topic number 863,864,865 and 1 are the four topics corresponded by 'GO0016020' , where the number 1 is a background topic. Likewise, the top 20 words of these four topics are listed from large to small order.

Discussions
The results in Figs. 5 and 6 indicate that PFTP has the significant advantage against several widely adopted multi-label classifiers.
Compared with traditional multi-label classifiers(non-topic model), our method can further improve the accuracy of protein function prediction by introducing topics subset into supervised topic model, which can discover the topic that represents common semantic of documents and reflect the differences between labels and latent topics. Especially for CLUS-HMC/SC/HSC, our method exhibit the dominant advantage on AUPRC. We attribute this success of our method to its utilization of BMD method on dataset. As the computation of AUPRC doesn't bias toward the accuracy of function label annotating more proteins, and focus on the average of whole accuracy. The GO term annotating fewer proteins will be deleted after BMD processing, and recovered after predicting, but the prediction accuracy don't reduce. In other words, the combination of PFTP and BMD can improve the average accuracy of protein function prediction.
Compared with Labeled-LDA, PFTP is able to discovery more-refined latent sub-structure of function label than Labeled-LDA. By introducing topic subset for each label in PTPF, the relationship between functions and variety words, labels and topics were disclosed. Therefore, we can anticipate that PFTP is a potential method to reveal a deeper biological explanation for protein functions.
Meanwhile, the performance comparison of different dataset is also shown in Fig. 4. For S.C-CC dataset, six evaluation criteria values vary relatively smoothly. It may be due to the fewer labels of S.C-CC dataset, then changing the K value doesn't lead to great impact on prediction effect. In the comparison of S.C and S.C-CC Fig. 6 The comparison results with PTFP and HMC/SC/HSC. For three evaluation criteria, the larger the value, the better the performance, and the red asterisk on bar represents the best result on each dataset dataset, we find that the value of AP, AUðPRCÞ, AUPRC and AUPRCw on S.C is lower than S.C-CC, and the value of One-Error and HL is almost equal between S.C and S.C-CC. This is due to the same word space and different label number between these two dataset. The fewer labels of S.C-CC can make a higher classifying performance. In the comparison of Human1 and Hu-man2 dataset, we find that the value of AUPRC and AUPRCw on Human1 is higher than Human2; the value of AP on Human1 is lower than Human2; the value of One-Error, HL and AUðPRCÞ is almost equal on Hu-man1 and Human2. These results show that, the classification performance of PFTP on Human1 and Human2 is almost the same, which reveal that the larger word space might not obtain a better classifying performance.

Conclusions
In this paper, we introduced an improved multi-label supervised topic model for predicting protein function. In our previous study, a multi-label supervised topic model Labeled-LDA has been applied to protein function prediction, which associates each label (GO term) with a corresponding topic directly. This way makes the latent topics to be completely degenerated, and ignores the differences between labels and latent topics. To address the faultiness, we proposed a Partially Function-to-Topic Prediction model for introducing the local topic subset corresponding to each function label. PFTP not only supports latent topics subsets within given function labels but also a background topic corresponding to a 'fake' function label. In a 5-fold cross validation experiment on predicting protein function, PFTP significantly outperforms compared methods. Due to the more-refined way of function label modeling, PFTP shows the effectiveness and potential value in predicting protein function through experimental studies. Meanwhile, there are several problems in topic modeling of protein function prediction to be improved, such as the introduction of protein extra features and hierarchical function label structure. However, multi-label topic model is a potential method in many applications of bioinformatics.