 Research
 Open Access
 Published:
Semisupervised multilabel collective classification ensemble for functional genomics
BMC Genomics volume 15, Article number: S17 (2014)
Abstract
Background
With the rapid accumulation of proteomic and genomic datasets in terms of genomescale features and interaction networks through highthroughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fullylabeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multilabel protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data.
Results
In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semisupervised multilabel collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GMSMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGMSMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes.
Conclusion
Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.
Background
Advances in biotechnology have enabled highthroughput experiments to generate a wide variety of genomic and proteomic data sources, including genome sequences, protein structure, and proteinprotein interaction (PPI) networks.
Each data source provides a comprehensive view of the underlying mechanisms, and is represented as a set of features in a feature space or viewed as a graph structure where each individual is considered as a node. In the field of functional genomics, the process of manual annotation has become increasingly cumbersome with the rapid accumulation of the proteomic and genomic datasets. Computational methods to automate this task are urgently needed. Therefore, various computational methods have been proposed to automatically infer the functional properties of proteins using various data sources available (see [1] for a review).
Previous research in protein (or gene) function prediction can be partition into two classes of methods (featurebased approaches and graphbased approaches) according to the terms of input data and methodology. Featurebased machine learning algorithms require the instances to have a fixed set of attribute values from a feature space. The approaches involve extraction of features to encode the desired properties of a protein, and construction of a machine learning model for functional properties prediction. Some of the popularly used features are characteristics from amino acid sequence, textual repositories like MEDLINE, and more biologically meaningful features such as motifs derived from motif analysis of protein sequences, the isoelectric point and posttranslational modifications. Via these constructed attribute features, a predictive model is learnt by training a classifier using annotated proteins, and then utilize this model to predict the functions of the proteins [2–5].
On the other hand, graphbased approaches use the network structure information to exploit proteins (or genes) sharing similar functional properties. Protein interaction networks are becoming increasingly rich and useful in delineating the biological characteristics of proteins. A review of computational approaches that are being used to measure protein interactions can be found in [6]. For instance, the Pearson's correlation coefficient is used to measure pairwise similarity between gene expression profiles. Specifically, the proteinprotein interaction data can be modeled as a graph by considering individual proteins as the nodes, and the existence of an interaction between a pair of proteins as a link, graphbased or kernelbased classification algorithms are then used for protein data classification tasks based upon the protein interaction network [7–10].
Although many efforts have been made for automatically predicting functional properties of the proteins, this task still poses several significant challenges. First of all, existing featurebased methods and graphbased methods cannot guarantee good accuracy when there is only limited number of labeled data available. Most of the existing featurebased methods and graphbased methods require sufficiently large amount of labeled examples or a fullylabeled graph for training. However, acquiring such labels can be very expensive and timeconsuming in practical applications. The performance of functional prediction might be degraded when the requirement of sufficient labeled data is not met. Furthermore, proteins are generally involved in more than one biological process, and thus they are annotated with multiple functions. Thus, it increases the difficulties of functional prediction. A promising idea to tackle these challenges (label deficiency and multiple function prediction problems) is to take advantage of multiple data sources and multiple functions of proteins for enhancing the prediction performance. To this end, we propose effective approaches that utilize all data sources that are available in this problem setting, including interaction networks, protein attribute features, label correlations, and unlabeled data for enhancing the performance of predicting functional properties of the proteins.
In this paper, we first show that the learning task underlying the protein function prediction using various data sources of relational data matches well with the collective classification [11–13] framework. Then, we propose a new generative model based semisupervised multilabel collective classification algorithm, called GMSMCC, for predicting proteins with multiple functions utilizing both labeled and unlabeled data in the learning process. To further boost the learning performance, we extend our proposed GMSMCC method in an ensemble manner by constructing multiple latent networks. This approach, called ensemble of GMSMCC model (EGMSMCC), constructs various kinds of latent networks with various latent linkages to explicitly model the relationships among the node. We show how to effectively integrate these latent networks in an ensemble framework to improve the performance of protein function prediction.
We study the KDD Cup 2001 tasks of predicting functional properties (protein localization and their biological functions) of the protein corresponding to a given yeast gene. Experimental results show that the proposed algorithms (GMSMCC and EGMSMCC) can lead to performance superior to other compared featurebased approaches, graphbased approaches, and collective classification algorithms. In summary, the main contributions of this paper are listed as the following:

1.
This article is the first one to examine the CC algorithm for protein function prediction using semisupervised learning and multilabel learning techniques to leverage the unlabeled portion of the data and label correlation information in the partiallylabeled PPI network, which only has limited number of annotations.

2.
The proposed GMSMCC algorithm is able to utilize various data sources for protein function prediction, where the instance features and interactions, as well as the label correlations can be naturally and explicitly exploited to predict a set of functional labels for an unannotated protein.

3.
The proposed EGMSMCC algorithm is a multinetwork learning method which integrates multiple constructed latent graphs for protein function prediction using an ensemble framework. Via the multiple latent graphs constructed, the supervised knowledge can be propagated from labeled to unlabeled nodes effectively to boost the prediction performance.
Prediction task formalization
The protein functional properties prediction task has been widely explored in the literature. An extensive review on this task is found in [1]. The approaches of protein function prediction can be categorized into two categories, featurebased methods and graphbased methods, in terms of input data and methodology.
Featurebased methods. For these methods, each protein is characterized as a feature vector x_{ i } =<f_{1}, ..., f_{ d } > with a fixed set of feature values. The feature vectors of the data then taken as input to machine learning algorithms to infer annotation rules for predicting unannotated proteins [14]. Learning algorithms that have been used include SVM [3], neural networks [15], random forest [16], and cotraining [14], to name a few. Typically, feature extraction is involved to extract desired features to represent information of proteins. Then a feature selection is used in the learning process to select the most useful features to train a classifier. A protein usually performs multiple functions. As such, several approaches handle the prediction problem using the multilabel learning framework. For instance, Barutcuoglu et al. [17] learn SVM classification model for predicting functions in the Gene Ontology using a hierarchical multilabel structure. Pandey et al. [18] incorporate function correlation for predicting protein functions using a weighted multilabel kNN classifier. Schietgat et al. [19] predict gene function using hierarchical multilabel decision tree ensembles.
Graphbased methods. These methods study protein function in the context of a network. The recent availability of protein interaction networks has spurred on the development of computational methods for analyzing such data in order to elucidate the relationships between protein interactions and functional properties. Sharan et al. [9] categorize the methods into two groups: direct annotation schemes, which infer the function of a protein based on its connections in the network; and moduleassisted schemes which first identify modules of related proteins and then annotate each module based on the known functions of its members. Examples of direct annotation algorithms include neighborhood counting [8], graph theoretic methods [20], and Markov random field [21]. On the other hand, the modelassisted methods differ mainly in their module (or cluster) detection techniques. Examples of model detection methods include hierarchical clusteringbased methods [22] and graph clusteringbased methods [23]. Graphbased approaches using multilabel learning framework for prediction have also been studied [24–26].
Although a broad variety of interesting approaches have been developed, most of the methods mainly study the scenario where sufficient labeled data are available in the dataset. In this case, the supervision knowledge can be effectively used in the featurebased models and graphbased methods to achieve good learning performance. However, such labels are difficult and timeconsuming to obtain. In sparselabeled networks, one has only limited number of labeled nodes, say fewer than 10%, 5% or even 1%. The performance of prediction might be degraded due to the lack of annotated proteins [27]. It is thus natural to consider using various data sources of the protein data (including labeled and unlabeled) to improve the prediction performance.
Collective classification. The task of protein function prediction can be cast into the collective classification problem of building a predictive model from networked data. Generally, networked data can be represented by nodes (instances) interconnected with each other by edges reflecting the relation or dependence between the nodes. Information on the nodes is provided as a set of attribute features (e.g., words present in the web page). The class membership of an instance may influence the class membership of a related instance.
Conventional supervised learning methods assume that the instances to be classified are independent of each other, while collective classification jointly classifies interrelated instances by exploiting the interrelations among the instances [28, 29]. For example, consider the task of predicting the topics of hyperlinked web pages. Conventional supervised learning approaches only use the attribute features derived from the content of the pages to classify each page. In contrast, collective classification methods use the link structure to construct additional relational features based on the labels of neighboring pages. We can count the number of different labels of the neighboring pages that are linked to each page as the relational features. Collective classification methods would then explicitly use the attribute features and the relational features together for classification.
Formally, the collective classification task is described as follows: Let G = (V, E, X, Y, C) be a graph dataset. V is a set of nodes {v_{1}, . . . , v_{ N } }. E is the adjacency matrix where E(i, j) = 1 if node v_{ i } and node v_{ j } are connected and E(i, j) = 0 otherwise. X ⊂ R^{d} consists of d dimensional vector instances. Each x_{ i } ∈ X is an attribute vector for a node v_{ i } ∈ V . C = {c_{1}, c_{2}, ..., c_{ K }} is the set of K possible labels. Y contains the set of label set Y_{ i } corresponding to instance x_{ i } for i = 1, . . . , N . Each Y_{ i } = [Y_{ i }_{,1}, . . . , Y_{ i,l }, . . . , Y_{ i,K } ] ∈ {0, 1}^{k} such that Y_{ i,l } = 1 means that x_{ i } is associated with l and Y_{ i,l } = 0 otherwise. We assume that we have ${n}^{\prime}$ label data ${\left\{\left({x}_{i},{Y}_{i}\right)\right\}}_{i=1}^{{n}^{\prime}}$ and ${n}^{\u2033}$ unlabeled data ${\left\{\left({x}_{i}\right)\right\}}_{i={n}^{\prime}+1}^{{n}^{\prime}+{n}^{\u2033}}$ with $N={n}^{\prime}+{n}^{\u2033}$. The task is to construct a function to predict the class label of unlabeled nodes using the labeled nodes in the graph.
When there are only limited number of labeled nodes in the task of predicting functional properties of proteins, i.e. ${n}^{\prime}\ll {n}^{\u2033}$, most of the proteins may not connect to labeled ones, which makes the task very challenging. As such, it is natural to consider some sort of semisupervised learning. In the setting of semisupervised learning, one utilizes both labeled and unlabeled data together to improve the performance [30].
Methods
In this section, we present the (GMSMCC) algorithm to address the task of predicting functional properties of proteins. Our approach is to model the problem as a generative model process to learn a probabilistic interpretation of the data for the estimation of the conditional distribution p(cx) of the data, where c is a functional class and x is a protein instance.
GMSMCC
Given the dataset X = {x_{1}, ...,x_{ i },..., x_{ N }} with the attribute features W = {w_{1},... ,w_{ j } ,...,w_{ M }}, we set up a generative model for the attribute features of the protein instances in X (including labeled and unlabeled data) and estimating the conditional distribution P(cx) by using the pLSA model originally developed for latent topic analysis. Unlike other topic model based on latent topics, we adopt protein functional class c_{ k } as latent variables in the pLSA model and fixing p(c_{ k }x_{ i }) for the annotated proteins in the learning process. The model is given as
where P(c_{ k }x_{ i }) and P(w_{ j } c_{ k }) are the probabilities that a protein instance x_{ i } is associated with functional class c_{ k } and the probability that attribute feature w_{ j } occurs in a protein with class c_{ k }, respectively. For efficient optimization, we utilize the loglikelihood. The likelihood function is transformed into:
where n(x_{ i },w_{ j }) is the frequency of w_{ j } occurring in x_{ i }, and N,M are the number of proteins and attribute features, respectively.
We exploit the knowledge of network topological structure of the data for better estimation of the conditional probability P(cx) based on the assumption that nearby nodes tend to have similar labels. The basic assumption is that if two nodes x_{ i } and x_{ s } are connected in the network, these nearby nodes tend to share similar class labels, i.e., the distance of their conditional distribution P(cx_{ i }) and P(cx_{ s }) should be similar to each other. Here, we consider the KullbackLeibler (KL) divergence to measure the distance of two distributions. Suppose the distribution of P(c_{ k }x_{ i }) with respect to different classes is represented as a vector z_{ i } = [P(c_{1}xi), · · ·, P(c_{ K }x_{ i })]^{T} . Then the KLdivergence between z i and z s is defined as
KLdivergence is not symmetric, and thus we use the following symmetric KLdivergence
to measure the distance of two distributions. Here,D(z_{ i }; z_{ s }) is always nonnegative.
As discussed above, our idea is to smooth the distribution P(cx) over the network. If two proteins are connected with interactions, then their conditional distributions P(cx_{ i }) and P(cx_{ s }) should be close to each other. Such local smoothness in terms of the network topology is explicitly incorporated into the generative model through a network regularizer
where E is the adjacency matrix to represent the network topology, E_{ i,s } = 1 if v_{ i } and v_{ s } are connected, and E_{ i,s } = 0 otherwise.
In protein functional properties prediction, proteins generally involve multiple biological processes and have multiple functions. Thus, it is crucial to take the label correlations into account to better predict their functional classes. Here, we further generalized the generative model to support this general setting. Recall that the network regularizer $\mathcal{R}$ is used to smooth label probability distribution over the intrinsic network structure. One hopes that the resulting distribution is able to be smoothed with respect to the class label correlations. A natural assumption here could be that if two class labels c_{ k } and c_{ l } are related, then the distribution P(c_{ k }x_{ i }) and P(c_{ l }x_{ i }) with respect to different instances should be also similar to each other.
In particular, we construct a labeltolabel affinity graph with K vertices where each vertex corresponds to one class label. For each pairwise vertices, we put edges between them and compute their weighting. There are many choices to define the weight matrix F = [F_{ kl }] on the affinity graph. Specifically, we use the commonly used dotproduct as follows
where Y_{ k } = [Y_{1},_{ k }, · · ·, Y_{ N,k }]^{T} is the label distribution over the instances, such that Y_{ i,k } is nonzero if x_{ i } belongs to class c_{ k } and the remaining elements are zero. Here, Y_{ k } is normalized to 1. The dot product of two vectors is equivalent to their cosine similarity.
Suppose the vector representation of P(c_{ k }x_{ i }) with respect to different instances is r_{ k } = [P(c_{ k }x_{1}), · · ·, P(c_{ k }x_{ N })]^{T} .
we define the KLdivergence between r_{ k } and r_{ l } for pairwise class labels as follows
By using the label affinity matrix F and the symmetric KLdivergence defined above, we defined the label regularizer
to smooth the distribution P (cx).
Incorporating the smoothness terms (2) and (3) into the objective function in (1), we have the following new objective function
where α and β are the regularization parameters. When α = 0 and β = 0, maximizing $\mathcal{O}$ is equivalent to performing learning using the original pLSA model.
For the annotated proteins, their probability distributions P (cx) are fixed in the learning process. Specifically, the probability assignments are defined as a uniform distribution based on the known functional class labels as follows
where lx is the number of functional classes for an annotated protein xi.
For the unannotated proteins, we maximize the loglikelihood function $\mathcal{O}$ to compute their probabilistic distributions. The resulting probability distribution P (cx_{ i }) with respect to a given instance xi indicates the importance of a set of functions to the protein. One hopes that the P (c_{ l }x_{ i }) of the relevant labels are close to each other, and their values should be larger than those of the irrelevant labels. Hence, to make prediction of x_{ i }, we first rank the labels according to P (c_{ k } x_{ i }). Then we separate the set of labels into relevant and irrelevant label subsets according to the largest change observed across the sorted P (c_{ k } x_{ i }). That is, we seek the largest change between two successive P (c_{ k } x_{ i }) and P (c_{ k }_{+1}x_{ i }) in terms of their sorted orders. Their median value, say t = (P(c_{ k } x_{ i }) + P (c_{ k }_{+1}x_{ i }))/2, is used as splitting threshold to separate the class labels into relevant set and irrelevant set, where the the relevant set consists of the labels with probabilities larger than the threshold t, and the irrelevant set contains the remaining labels.
Model fitting with the EM algorithm
Our proposed approach, GMSMCC, utilizes the generative model with both network and label regularization for protein function prediction, and parameter estimation is different from original PLSA [31] or previous work utilizing PLSA with manifold learning for unsupervised data clustering [32]. Next, we introduce the EM algorithm used in the proposed GMSMCC approach for finding maximum likelihood parameter estimates.
In the proposed generative model, we have N K + M K parameters {P (w_{ j } c_{ k } ), P (c_{ k } x_{ i })} where the class labels ck are considered as the latent variables. For convenience, we denote these parameters as Θ. We use the EM algorithm which alternates between an expectation step (Estep) and a maximization step (Mstep) to estimate the parameters in the proposed GMSMCC model.
Estep
The Estep is the same as in the pLSA model. The posterior probabilities for the latent variables P (c_{ k }x_{ i }, w_{ j }) is computed as follows
Mstep
The Mstep reestimation for {P (w_{ j } c_{ k } )} is the same as that in the pLSA model as follows
In the Mstep, parameters are updated based on the expected complete data loglikelihood which depends on the posterior probabilities computed in the Estep [31]. The expected complete data loglikelihood of (4) is given by
using the posterior probabilities computed in the Estep.
We need to maximize $\mathcal{Q}\left(\Theta \right)$ with respect to the parameter Θ subject to the constraints ${\sum}_{k=1}^{K}P\left({c}_{k}{x}_{i}\right)=1$ and ${\sum}_{j=1}^{M}P\left({w}_{j}{c}_{k}\right)=1$. Therefore, we augment $\mathcal{Q}\left(\Theta \right)$ by the appropriate Lagrange multipliers ρ_{ i } to obtain
Maximization of ${\mathcal{Q}}^{\prime}$ with respect to P (c_{ k } x_{ i }) leads to the following set of equations:
where 1 ≤ i ≤ N, 1 ≤ k ≤ K.
We expect that if the attribute features of two proteins x_{ i } and x_{ s } are close (i.e., E_{ is } is large), then the distribution P (c_{ k } x_{ i }) and P (c_{ k } x_{ s }) are similar to each other, i.e., P(c_{ k } x_{ i }) will be close to P (c_{ k } x_{ s }). We have
Similarly, if two functions c_{ k } and c_{ l } are close (i.e., F_{ kl } is large), then the distribution P (c_{ k } x_{ i }) and P (c_{ l }x_{ i }) are similar to each other, i.e., P (c_{ k } x_{ i }) will be close to P (c_{ l }x_{ i }).
We have,
By using the approximation
(9) can be written as
where 1 ≤ i ≤ N, 1 ≤ k ≤ K,
and
To obtain the Mstep reestimation for P (cx), we construct six N KbyN K matrices: Z, Ω, D, B, U, and R.
First, we construct a KbyK block diagonal matrix D = [D_{ i,j }] based on the adjacency matrix E, where the (i, j)th block of D is a N byN matrix D_{ i,j } = [d_{ i,j,s,t }]s,t=1,...,N . All the entries of D are equal to 0 except the diagonal entries ${d}_{i,i.s.s}={\displaystyle \sum _{s}}{E}_{is}$
Next, we construct another KbyK block diagonal matrix B = [B_{ i,j }] where its (i, j)th block is also a N byN matrix B_{ i,j } = [b_{ i,j,s,t }]s,t=1,...,N . The entries of B are equal to 0 when i ≠ j; otherwise, if i = j, then we have b_{ i,j,s,t } = E_{ st }.
Then, we construct a N byN block diagonal matrix U = [U_{ i,j }] based on the label correlation matrix F , where the (i, j)th block of U is a KbyK matrix U_{ i,i } = [u_{ i,i,s,t }]_{ s },t=1,...,K . All nondiagonal entries of U are equal to 0 and the diagonal entries ${u}_{i,i.s.s}={\displaystyle \sum _{s}}{F}_{sl}$.
The matrix R = [R_{ i,j }] is another N byN block matrix where its (i, j)th block is a KbyK matrix R_{ i,j } = [r_{ i,j,s,t }]_{ s },t=1,...,N . Indeed, each R_{ i,j }, for i, j = 1, ..., K, is a diagonal matrix r_{ i,j,s,s } = F_{ ij } .
The matrix Z is a Kby1 block vector, where its kth entry Z_{ k } is a N dimensional vector defined as follows
The matrix Ω is a KbyK block matrix where its (i, j)th block is a N byN diagonal matrix. All the nondiagonal entries of Ω are equal to 0 and the diagonal entries
Let y denotes a Kby1 block matrix where
The system of equations in (9) is approximated using (10) and can be solved using the following matrix form:
Thus, the Mstep reestimation for P (cx) is
The Estep (6) and Msteps (7) and (12) are alternated until the objective function (4) converges.
In the initialization step of the EM algorithm, the values of P (w_{ i }c_{ k } ) and P (c_{ k } x_{ i }) are initialized based on the class priors according to the annotated proteins. We assume that each feature w_{ j } is conditionally independent to each other given the label c_{ k } . Concretely, P (w_{ j }c_{ k } ) are initialized as $P\left({w}_{j}{c}_{k}\right)=\frac{n\left({w}_{j},{c}_{k}\right)}{{\sum}_{i}n\left({w}_{i},{c}_{k}\right)}$, where n(w_{ j } , c_{ k } ) is the frequency of w_{ j } and c_{ k } cooccuring. The label distribution P (c_{ k } x_{ i }) for unannotated proteins are initialized as $P\left({c}_{k}{x}_{i}\right)=\frac{{\sum}_{i}n\left({c}_{k},{x}_{i}\right)}{{\sum}_{l}{n}_{i}\left({c}_{l},{x}_{i}\right)}$, where n(c_{ k } , x_{ i }) = 1 if x_{ i } is associated with c_{ k } and 0 otherwise. In each iteration of the EM algorithm, the probability assignments of P (cx) for labeled data are reset according to the known functional class labels as in Eq. (5).
EGMSMCC algorithm
The power of the network regularizer in Eq. (4) of our proposed GMSMCC model lies in the fact that the linkages of the network generally exhibit predictable relationships between class labels of linked proteins. Suppose we have an unannotated protein, and we have a good understanding of the relationship between the functions of this protein and the functional properties of its labeled neighbors, then we should be able to make a good prediction of the protein functional properties based on the linkage information.
In the proposed GMSMCC model, we use the autocorrelation in the protein interaction network which may provide some inconsistent linkages between the proteins not sharing similar functional properties. In the studies of functional genomics, if more information is available, one can derive more effective networks for capturing useful relationships between the proteins to propagate the supervision knowledge from labeled nodes to unlabeled nodes.
In the realworld, protein data are associated with various data sources. For example, the proteins are associated with attribute features; those proteins with similar feature values may also be similar in their associated functions. Also, the proteins are associated with a set of functional labels, which can be represented by label features that are useful for evaluating the pairwise similarity of protein instances. These latent linkages are already embedded in the data. We can exploit this knowledge to construct the latent graphs for more effective label prediction.
In this paper, in addition to the PPI network, we introduce two types of latent linkages to construct latent graphs. Based on the latent graphs we constructed, we extend our proposed generative model in an ensemble manner to further boost the prediction performance.
Given the adjacency matrices ${\left\{{E}^{\left(i\right)}\right\}}_{i=1}^{q}$ of q latent graphs, the proposed ensemble algorithm, namely EGMSMCC, is described in Algorithm 1. In the EGMSMCC algorithm, we learn an individual GMSMCC model on each of the constructed latent graph, and then combine the learned models to obtain a more reliable prediction than that of the model on a single latent graph.
Algorithm 1 EGMSMCC
Input: ${\left\{{E}^{\left(i\right)}\right\}}_{i=1}^{q}$, X, Y , the parameters α and β
Output: y
Procedure:
1: for i = 1 to q do
2: Learn a GMSMCC model using the constructed latent graph E^{(i)}. In the GMSMCC model, compute the network regularizer $\mathcal{R}$ in Eq. (2) according to E^{(i)};
3: Use EM algorithm to optimize the GMSMCC model to compute the label probability distribution y^{(i)};
4: end for
5: Combine the results of q learned models y^{(i)}, y^{(i}),..., y^{(q) }into an ensemble prediction as $y=\frac{1}{q}{\displaystyle \sum _{i=1}^{q}}{y}^{\left(i\right)}$
The basic idea of constructing latent graphs is to link together the protein nodes, such that nodes which are closer in the graphs will tend to have the same functional labels, and the nodes which are disconnected will tend to have different functional labels. Via the latent linkages in the latent graphs we constructed, knowledge from labeled nodes can be propagated to unlabeled nodes more effectively, such as the example in Figure 1. Next, we introduce three type of latent linkages to construct latent graphs that can be easily computed from the data. For each individual latent graph, we compute a weight E_{ ij } for each entry of its adjacency matrix where E_{ i,j } is large indicates two nodes are close together, and vice versa.
PPI latent graph: In our ensemble model, we consider the PPI network as a latent graph, and construct the adjacency matrix E_{(1)} of the PPI latent graph as follows
where E(i, j) = 1 if node v_{ i } and node v_{ j } are connected in the PPI network, and E(i, j) = 0 otherwise.
Random walk latent graph: When the underlying autocorrelation of original PPI network is small, i.e., some connected nodes may not share the same class label, the learning method based on the original PPI network might be affected.
It is observed that proteins that interact with level2 neighbors (indirect neighbors in the PPI network) also have a great likelihood of sharing similar characteristics [8]. To this end, we use the idea of evenstep random walk with restart (ERWR) [33] to compute the weights of the latent linkages. Intuitively, we assume that linkages to directed neighbors with the same function class with the target protein of interest typically have triangle structures (see Figure 1(b)). These neighbors (v_{2} and v_{3}) are able to obtain high scores using ERWR because they are wellconnected in the PPI network. On the other hand, ERWR can avoid the immediate neighbors (e.g., v_{1} and v_{2}) with inconsistent linkages that negatively influence the predictions because they are sparselyconnected. ERWR can also exploit the indirect neighbor data by adding linkages to level2 neighbors (e.g., v4) that are wellconnected to level1 neighbors.
Given the adjacency matrix E of the PPI network, we compute P = EE and normalize its entries with respect to each column to obtain a normalized transition probability matrix P . The ERWR random walker iteratively visits neighborhood nodes with transition probability given in P . Also at each step, it has probability α (e.g., α = 0.1) to return to the start node. We define the adjacency matrix E^{(2)} of the random walk latent graph as follows
where $R={\sum}_{t=1}^{T}\alpha {\left(1\alpha \right)}^{t}{p}^{t}$ is the steadystate probability matrix after T steps.
Prediction similarity latent graph: We also consider the values of class labels of the annotated proteins as input features to build a classifier that predicts all unlabeled proteins. Specifically, we use SVM classifier with probability outputs implemented in the LIBSVM library [34] to compute ${Y}_{i}={\left[P\left({c}_{1}{x}_{i}\right),P\left({c}_{2}{x}_{i}\right),\dots ,P\left({c}_{q}{x}_{i}\right)\right]}^{T}$ such that P (c_{ j } x_{ i }) is the confidence of a protein x_{ i } belongs to the class c_{ j } . The adjacency matrix E^{(3)} of latent graph based on the prediction confidences is defined as follows
Here, Y_{ i } and Y_{ j } are normalized to unit length, thus the dot product of the two vectors is equivalent to their cosine similarity.
In the prediction similarity latent graph, there are many entries being close to zero. It may not be necessary to consider these entries. Therefore, we use a kNN construction scheme for graph. We connect two nodes v_{ i } and v_{ j } if v_{ j } is among the knearest neighbors of v_{ i } or if v_{ j } is among the knearest neighbors of v_{ i } [35]. It is obvious that the number of edges is O(N ) and the graph is symmetric. We define a sparse adjacency matrix for kNN graph as follows
where ${\mathcal{N}}_{k}\left(i\right)$ is the set of k nearest neighbors of v_{ i }. In practice, we find that k does not need tuning. We use k = 10 nearest neighbors for each data set.
Experiments
In this section, we discuss the extensive experimental results to compare the performance of our proposed methods with the other baselines: SVM, wvRN+RL, ICA, semiICA, and ICML, and show that the proposed methods are able to achieve better performance against these baselines.
Yeast dataset and baselines
We conduct experiments to predict properties of the proteins corresponding to a given yeast gene from KDD Cup 2001 [36]. In particular, we formulated two prediction problems based on the properties of the proteins. Problem (1) is to predict the localization of the proteins encoded by the genes. It is a binary problem, i.e., a protein is localized (or not localized) to the corresponding organelle. Problem (2) is to predict the functions of the proteins, which a multilabel problem, i.e., a protein can have more than one function. There are totally 14 functional classes in the dataset.
The dataset for these two problems consisted 1,243 protein instances and 1,806 interactions among the pair of proteins interact with one another. The protein features include the attributes refer to the chromosome on which the genes appears, to whether the gene is essential for survival, observable characteristics of the phenotype, structural category of the protein, the existence of characteristic motifs in the amino acid sequence of the protein, and whether the protein forms larger proteins with others [36, 14].
We evaluate the performance of problem (1) by classification accuracy, and problem (2) by three multilabel learning evaluation metrics, i.e., Coverage, RankingLoss, and MacroF1 [37]. These criteria are defined as follows
Coverage evaluates how far we need, on the average, to go down the list of labels in order to cover all the true labels of an instance:
where rank_{ s }(x_{ i }, c_{ k } ) denotes the ranks of class label c_{ k } derived from a confidence function s(x_{ i }, c_{ k } ) which indicates the confidence for the class label c_{ k } to be a proper label of x_{ i }.
Ranking loss evaluates the average fraction of label pairsthat are reversely ordered for the instance:
where ${\mathcal{R}}_{i}=\left\{\left({c}_{1},{c}_{2}\right)h\left({x}_{i},{c}_{1}\right)\le h\left({x}_{i},{c}_{2}\right),\left({c}_{1},{c}_{2}\right)\in {Y}_{i}\times {\u0232}_{i}\right\}$, and $\u0232$ denotes the complementary set of Y_{ i }.
MacroF1 is the harmonic mean between precision and recall, where the average is calculated per label and then averaged across all labels. It is defined as
where p_{ k } and r_{ k } are the precision and recall of the kth label.
To validate the performance of our proposed algorithms, we compare our approach with four baseline methods:
1. SVM [34]. This baseline is a featurebased method only using the attribute features of the proteins for learning without considering using any network information.
2. wvRN+RL [38]. This algorithm is a relationalonly method using only the PPI network for prediction. wvRN+RL computes a new label distribution for an unlabeled node by averaging the current estimated distributions of its linked neighbors. This process is repeated until reaching the maximum iteration number.
3. ICA [28]. This denotes a collective classification algorithm which uses both attribute features and relational features to train a base classifier for prediction. The relational features are constructed based on the labels of neighbors. ICA uses an iterative process whereby the relational features are recomputed in each iteration until a fixed number of iterations is reached. Prior work has found logistic regression (LR) to be superior to other classifiers such as naive bayes and kNN, as base classifier for ICA. Therefore, we use LR as the local classifier for ICA in the experiments.
4. semiICA [39]. This method extends ICA to leverage the unlabeled data using semisupervised learning. There are four semiICA variants (KNOWNEM, ALLEM, KNOWNONEPASS, ALLONEPASS) for semiICA, we run all four variants and choose the best one as the result of semiICA.
5. ICML [13]. This method extends ICA to handle multilabel learning by constructing additional label correlation features to exploit the dependencies among the labels as additional input features to learn base classifier. The ICML algorithm is also based on an iterative framework similar to ICA.
It is generally more difficult to determine the classifier parameter values when the number of labeled data available is smaller (which is the focus of this study). For the SVM classifier, we use the LibSVM [34] with linear kernel as base classifier, and simply set the penalty parameter C = 1.0 for the SVM as default. The maximum number of iterations for ICA, semiICA are set to 10, and we use logistic regression as their base classifier as in [39, 13]. While the wvRN+RL uses 1000 iterations. The parameters α and β for our proposed method are set to 3 and 0.1. The parameter selection issue is discussed in the later section.
Results on protein localization prediction
We first consider problem (1) of KDD Cup 2001, i.e., the protein localization prediction problem. We set α ≠ 0 and β = 0 in our proposed method, and compare GMSMCC with the learning algorithms: SVM, wvRN+RN, ICA and semiICA. The performance is measured in terms of classification accuracy.
We compare the performance of the comparison algorithms by varying the number of labeled data ranging from 3% to 10% with an interval of 1%. For each labeled/unlabeled data split, we execute an algorithm for 10 runs by randomly selecting data split, and report the performance (mean and standard deviation) over 10 runs for the algorithms. Figure 2 shows the experimental results. As we can see from the figure, the overall picture taken from the experiments is clearly in favor of our proposed GMSMCC. The performance of GMSMCC consistently outperforms the other algorithms across different percentages of labeled data. On average, the accuracy over different percentages for GMSMCC, semiICA, ICA, SVM and wvRN+RL are 0.845, 0.801, 0.788, 0.788 and 0.666. GMSMCC performs best followed by semiICA. The 3rd best methods are ICA and SVM. Their performances are comparable. The relationalonly method wvRN+RL performs the worst.
We note that a smaller number of label data is the most interesting case for our algorithm because it is not reliable for prediction due to the inadequacy of supervised knowledge in the labeled dataset. Thus it is more desired that other data sources can be utilized together to improve the prediction performance. A closer examination of the results in Figure 2 show that the smaller the percentage of the labeled data is involved, the larger improvement GMSMCC achieves. GMSMCC achieves the largest improvement against 2nd best method when there are only 3% of labeled data (GMSMCC: 0.82 versus semiICA: 0.75). We also conduct pairwise ttest at 0.05 significance level to assess the statistical significance of the differences in performance of GMSMCC and the other test algorithms using 3% of labeled data. The performance of GMSMCC is significant better than those of the other baseline methods. This result illustrates the advantages of our methods when there are an extremely small number of labeled data. This is consistent with our earlier assertions that our approach can work even in the paucity of annotated proteins by exploring various data sources, including interaction networks, attribute features, and unlabeled data.
In this study, three types of latent graphs are utilized (see the EGMSMCC section). It is thus interesting to investigate the performance of GMSMCC using a single latent graph, and the performance of EGMSMCC utilizing multiple latent graphs. We test the performance of GMSMCC and EGMSMCC on the KDD Cup 2001 dataset with different label ratio from 3% to 10%. The experimental results are given in Table 1, where GMSMCC1, GMSMCC2 and GMSMCC3 denote the singlegraph model using the PPI latent graph (E^{(1)}), the random walk latent graph (E^{(2)}) and the prediction similarity latent graph (E^{(3)}), respectively. While GMSMCCmean denotes the singlegraph model using a latent graph constructed by averaging the weighing values of E^{(1)}, E^{(2)} and E^{(3)}.
We report the average accuracy and standard deviation of the comparison methods over 10 runs. The numbers in boldface (on each row of the tables) indicate the best results for each label ratio over the methods. From Table 1, we observe that EGMSMCC using multiple latent graphs is able to achieve better performance against the GMSMCC method using a single latent graph. A reasonable explanation for this finding is that the different latent graphs have complementary relationship for prediction. These latent graphs are derived from different sources. When complementary models learned from these latent graphs are combined in an ensemble, correct decisions are amplified by the aggregation process. The performance of an ensemble learner is highly dependent on two factors: one is the accuracy of each component learner; the other is the diversity among these components. Examining the results in Table 1 shows that the overall performances of the GMSMCC models generated from different graphs are reasonably well. This result indicates that each latent graph provides prediction knowledge from a specific aspect, and their combination leads to a more robust prediction.
Results on protein function prediction
We also conduct experiments for problem (2) of KDD Cup 2001, i.e., the multilabel protein function prediction problem. We set α and β to be nonzero by considering the network information and label correlation simultaneously. We compare the proposed algorithms with baseline classifiers: SVM, wvRN+RN, ICA, semiICA and ICML. SVM, wvRN+RN, ICA and semiICA are singlelabel classifiers. For these methods, we decompose the multilabel problem into a set of K binary classification problems using oneagainstall strategy, and train independent classifier for each singlelabel problem. This approach is known as the binary relevance (BR) method [40]. The predictions for all K binary classification problems are combined to make the final prediction.
We compare the performance of our proposed GMSMCC approach and other baseline algorithms with varying percentages of labeled data from 3% to 10%. For each percentage, we execute each algorithm 10 times by randomly selecting the label/unlabel data split from the dataset. Then we report average results as well as standard deviation of each compared algorithms over 10 runs. The result is shown in Figure 3. In order to keep consistency with the Coverage and RankingLoss evaluation metrics, we use 1MacroF1 instead of MacroF1. Thus, the smaller the value of the metric, the better the performance of the algorithm. We see from Figure 3 that GMSMCC (the black line) has the best performance (lies under the other curves) across all evaluation metrics and label ratios. SemiICA is the second best method. In the comparison, SVM performs poor in terms of Coverage. On the other hand, wvRN+RL, ICML and ICA perform poor in terms of MacroF1. Recent studies [41] have shown that one multilabel learning algorithm rarely outperforms another algorithm on all criteria because the evaluation measures used in the experiments assess the learning performance from different aspects. In the experiments, we find that GMSMCC consistently outperforms other algorithms across all label ratios. On average, ICAM achieves Coverage improvement of 0.35 (GMSMCC:3.90 versus semiICA:4.25), RankingLoss improvement of 0.01 (GMSMCC:0.104 versus semiICA:0.114), and 1MacroF1 improvement of 0.068 (GMSMCC:0.640 versus semiICA:0.708) against the second best method. This result indicates that the proposed GMSMCC algorithm is effective for the multilabel protein function prediction task.
Similar to the experiments for protein localization prediction, we also conduct experiments to examine the effect of the proposed EGMSMCC method (integrating multiple latent graphs) for enhancing the prediction performance against the GMSMCC method using a single latent graph. GMSMCC1, GMSMCC2 and GMSMCC3 denote the singlegraph model using (E^{(1)}), (E^{(2)}) and (E^{(3)}), respectively. GMSMCCmean denotes the singlegraph model using a latent graph constructed by averaging the weighing values of E^{(1)}, E^{(2)} and E^{(3)}.
We compare GMSMCC and EGMSMCC with respect to different percentages of labeled data from 3% to 10%. For brevity, we just report Coverage and RankingLoss. The results are given in Figure 4 and 5. The percentage of labeled data is illustrated on the horizontal axis. According to the figures, we can see that EGMSMCC consistently outperforms the GMSMCC algorithms using a single latent graph because more information are utilized. This result demonstrates the effectiveness of our proposed EGMSMCC method for multilabel protein function prediction.
Convergence study
The objective function $\mathcal{O}$ in Eq. (4) is optimized for classification prediction. Here, we investigate how fast the algorithm converges. Figures 6(a) and 6(b) show the convergence curves of the proposed algorithm on the problem (1) and (2) (at 5% label ratio), respectively. The xaxis is the number of iteration number in the process of optimizing the objective value O and the yaxis is the value of successively computed objective value $\left\right\mathcal{O}\left(t+1\right)\mathcal{O}\left(t\right)\left\right/\left\right\mathcal{O}\left(t\right)\left\right$. We see that the algorithm converge within 10 iterations. The required computational time for problems (1) and (2) are 10.5 seconds and 10.3 seconds using our MATLAB implementation, respectively.
Parameter sensitivity
In our proposed GMSMCC method, the regularization parameters α and β quantify the importance of the network regularizer and label regularizer in the objective function (4). These parameters also determine the learning setting. Our framework is formulated in singlelabel collective classification learning by considering α ≠ 0 and β = 0, i.e., we solve single label learning problem for the problem (1). On the other hand, our framework is formulated in multilabel collective classification learning when α ≠ 0 and β ≠ 0, i.e., we consider the label correlation in the learning process for the problem (2).
We examine the parametric sensitivity of our GMSMCC approach with respect to parameter α by fixing β = 0 and varying α on problem (1). Figure 7(a) illustrates the accuracy of GMSMCC with different α values from 0 to 30 on the protein localization prediction task using 5% label ratio. When α = 0 the accuracy is low, since no network information is used in this case. This also provides evidence of the advantages of the network regularization in the proposed method. When α becomes large, the accuracy increases. The plateau in the accuracy curve from 1 to 30 shows that the proposed GMSMCC achieves fairly stable performance with different value of α. It implies that the method is robust when a different value of α is selected. We find that GMSMCC presents good classification performance when α = 3.
Next, we fix α = 3 and vary β from 0 to 0.4 on problem (2) using 5% label ratio. The result is given in Figure 7(b). We observe that when β = 0 or β = 0.4, the performance is poor. It is evident that the smallest Coverage is achieved at β = 0.1. Therefore, we set α = 3 and β = 0.1 in all the comparisons.
Interaction relations
Our proposed method using the objective function in Eq. (4) is capable characterizing the interaction relations among the genes code for proteins, and these proteins tend to localize in various parts of cells in order to perform crucial functions. We construct an extended graph data set ${G}^{\prime}=\left({X}^{\prime},{E}^{\prime}\right)$ for the KDD Cup 2001 data, where ${E}^{\prime}$ is the known interactions among the proteins and ${X}^{\prime}$ is the feature set of the proteins. Each ${x}_{i}^{\prime}\notin {X}^{\prime}$ is an extended feature vector for the ith protein/gene by integrating its attribute features, localization and functional labels together as follows: ${x}_{i}^{\prime}=\left({x}_{i},{Y}_{i}^{l},{Y}_{i}^{f}\right)$, where x_{ i } is the attribute vector, ${Y}_{i}^{l}=\left[{Y}_{i1}^{l},{Y}_{i2}^{l}\right]\in {\left\{0,1\right\}}^{2}$ and ${Y}_{i}^{f}=\left[{Y}_{i1}^{f},\cdots \phantom{\rule{0.3em}{0ex}},{Y}_{iK}^{f}\right]\in {\left\{0,1\right\}}^{K}$ are the localization label features and function label features with respect to ith instance. Given a new instance $\widehat{x}$, the interaction between $\widehat{x}$ and ${x}_{i}^{\prime}\in {X}^{\prime}$ is estimated by the cosine similarity between their conditional probability vectors obtained from the proposed method. The resulting similarity ranges from 0 to 1, with 0 indicating two instances are independent, and 1 indicating two instances are highly interrelated. We apply the cosine similarity measure to evaluate the interaction relations of 5 randomly selected genes (G238510, G234935, G235158, G237021, G234980) to other genes in the KDD Cup 2001 dataset. Table 2 shows the interesting interrelations discovered by previous studies with respect to the evaluated genes. In general, we can see that these interrelated genes tend to have large similarity values. This shows the advantages of using our proposed method to detect the interactions. Biologists can use the method to identify related genes and to further investigate their interactions.
Conclusion
In this paper, we first propose GMSMCC, an effective and novel semisupervised multilabel collective classification based method for predicting functional properties of proteins. GMSMCC is designed with the use of pLSA generative model with a network regularizer and label regularizer, which exploit the network linkages and label correlations effectively to compute the label probability distribution for prediction. Then, we extend it in an ensemble manner and develop the EGMSMCC approach to exploit various kinds of latent linkages in constructing latent graphs to further improve the prediction performance. Experimental results on two tasks of KDD Cup 2001 (the localization prediction task and the protein function prediction task) consistently demonstrate the effectiveness of the proposed methods. The performances of the proposed methods are shown to be better than that of stateoftheart algorithms, including SVM, wvRN+RL, and three variants of ICA. In future, we will extend our proposed method to handle heterogeneous biological networks.
References
 1.
Pandey G, Kumar V, Steinbach M: Computational approaches for protein function prediction: A survey. 2006, Twin Cities: Department of Computer Science and Engineering, University of Minnesota
 2.
Jensen LJ, Gupta R, Staerfeldt HH, Brunak S: Prediction of human protein function according to gene ontology categories. Bioinformatics. 2003, 19 (5): 635642. 10.1093/bioinformatics/btg036.
 3.
Cai C, Han L, Ji ZL, Chen X, Chen YZ: Svmprot: webbased support vector machine software for functional classification of a protein from its primary sequence. Nucleic acids research. 2003, 31 (13): 36923697. 10.1093/nar/gkg600.
 4.
Lobley AE, Nugent T, Orengo CA, Jones DT: Ffpred: an integrated featurebased function prediction server for vertebrate proteomes. Nucleic acids research. 2008, 36 (suppl 2): 297302.
 5.
Shen HB, Chou KC: Ezypred: a topdown approach for predicting enzyme functional classes and subclasses. Biochemical and Biophysical Research Communications. 2007, 364 (1): 5359. 10.1016/j.bbrc.2007.09.098.
 6.
Pellegrini M, Haynor D, Johnson JM: Protein interaction networks. Expert review of proteomics. 2004, 1 (2): 239249. 10.1586/14789450.1.2.239.
 7.
Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from proteinprotein interaction networks. Nature biotechnology. 2003, 21 (6): 697700. 10.1038/nbt825.
 8.
Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from proteinprotein interactions. Bioinformatics. 2006, 22 (13): 16231630. 10.1093/bioinformatics/btl145.
 9.
Sharan R, Ulitsky I, Shamir R: Networkbased prediction of protein function. Molecular systems biology. 2007, 3 (1):
 10.
Xiong W, Liu H, Guan J, Zhou S: Protein function prediction by collective classification with explicit and implicit edges in proteinprotein interaction networks. BMC bioinformatics. 2013, 14 (Suppl 12): 4
 11.
Sen P, Namata G, Bilgic M, Getoor L, Galligher B, EliassiRad T: Collective classification in network data. AI magazine. 2008, 29 (3): 93
 12.
McDowell LK, Gupta KM, Aha DW: Cautious collective classification. The Journal of Machine Learning Research. 2009, 10: 27772836.
 13.
Kong X, Shi X, Yu PS: Multilabel collective classification. SIAM International Conference on Data Mining (SDM). 2011, 618629.
 14.
Krogel MA, Scheffer T: Multirelational learning, text mining, and semisupervised learning for functional genomics. Machine Learning. 2004, 57 (12): 6181.
 15.
Mooney C, Pollastri G, et al: Sclpred: protein subcellular localization prediction by nto1 neural networks. Bioinformatics. 2011, 27 (20): 28122819. 10.1093/bioinformatics/btr494.
 16.
DíazUriarte R, De Andres SA: Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006, 7 (1): 310.1186/1471210573.
 17.
Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multilabel prediction of gene function. Bioinformatics. 2006, 22 (7): 830836. 10.1093/bioinformatics/btk048.
 18.
Pandey G, Myers CL, Kumar V: Incorporating functional interrelationships into protein function prediction algorithms. BMC bioinformatics. 2009, 10 (1): 14210.1186/1471210510142.
 19.
Schietgat L, Vens C, Struyf J, Blockeel H, Kocev D, Džeroski S: Predicting gene function using hierarchical multilabel decision tree ensembles. BMC bioinformatics. 2010, 11 (1): 210.1186/14712105112.
 20.
Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Wholeproteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinformatics. 2005, 21 (suppl 1): 302310. 10.1093/bioinformatics/bti1054.
 21.
Deng M, Tu Z, Sun F, Chen T: Mapping gene ontology to proteins based on proteinprotein interaction data. Bioinformatics. 2004, 20 (6): 895902. 10.1093/bioinformatics/btg500.
 22.
Arnau V, Mars S, Marín I: Iterative cluster analysis of protein interaction data. Bioinformatics. 2005, 21 (3): 364378. 10.1093/bioinformatics/bti021.
 23.
Adamcsek B, Palla G, Farkas IJ, Dereényi I, Vicsek T: Cfinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22 (8): 10211023. 10.1093/bioinformatics/btl039.
 24.
Yu G, Domeniconi C, Rangwala H, Zhang G, Yu Z: Transductive multilabel ensemble classification for protein function prediction. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 10771085.
 25.
Jiang JQ, McQuay LJ: Predicting protein function by multilabel correlated semisupervised learning. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2012, 9 (4): 10591069.
 26.
Wu Q, Ng MK, Ye Y, Li X, Shi R, Li Y: Multilabel collective classification via markov chain based learning method. KnowledgeBased Systems. 2014, 63: 114.
 27.
Mostafavi S, Morris Q: Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics. 2010, 26 (14): 17591765. 10.1093/bioinformatics/btq262.
 28.
Neville J, Jensen D: Iterative classification in relational data. Proc AAAI2000 Workshop on Learning Statistical Models from Relational Data. 2000, 1320.
 29.
Wu Q, Ye Y, Ng MK, Ho SS, Shi R: Collective prediction of protein functions from proteinprotein interaction networks. BMC bioinformatics. 2014, 15 (Suppl 2): 910.1186/1471210515S2S9.
 30.
Shi R, Wu Q, Ye Y, Ho SS: A generative model with network regularization for semisupervised collective classification. Proceedings of the 2014 SIAM International Conference on Data Mining. 2014
 31.
Hofmann T: Unsupervised learning by probabilistic latent semantic analysis. Machine learning. 2001, 42 (12): 177196.
 32.
Cai D, Wang X, He X: Probabilistic dyadic data analysis with local and global consistency. Proc of the 26th Annual International Conference on Machine Learning. 2009, 105112.
 33.
Gallagher B, Tong H, EliassiRad T, Faloutsos C: Using ghost edges for classification in sparsely labeled networks. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 256264.
 34.
Chang CC, Lin CJ: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2 (3): 27
 35.
Von Luxburg U: A tutorial on spectral clustering. Statistics and computing. 2007, 17 (4): 395416. 10.1007/s112220079033z.
 36.
Cheng J, Hatzis C, Hayashi H, Krogel M.A, Morishita S, Page D, Sese J: Kdd cup 2001 report. ACM SIGKDD Explorations Newsletter. 2002, 3 (2): 4764. 10.1145/507515.507523.
 37.
Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S: An extensive experimental comparison of methods for multilabel learning. Pattern Recognition. 2012, 45 (9): 30843104. 10.1016/j.patcog.2012.03.004.
 38.
Macskassy SA, Provost F: Classification in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research. 2007, 8: 935983.
 39.
McDowell L, Aha D: Semisupervised collective classification via hybrid label regularization. Proc of the 29th International Conference on Machine Learning. 2012, 975982.
 40.
Zhang ML, Zhou ZH: A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering. 2013, 99 (PrePrints): 1
 41.
Read J, Pfahringer B, Holmes G, Frank E: Classifier chains for multilabel classification. Machine learning. 2011, 85 (3): 333359. 10.1007/s1099401152565.
Acknowledgements
Y. Ye's research was supported in part by National Key Technology R&D Program of MOST China under Grant No. 2012BAK17B08, and NSFC under Grant No.61272538. S.S. Ho's research was supported in part by AcRF Grant RG41/12 and NTUSUG. S. Zhou's research was supported in part by National Natural Science Foundation of China (NSFC) under grant No. 61272380. Publication costs for this article were funded by grants of the corresponding author.
This article has been published as part of BMC Genomics Volume 15 Supplement 9, 2014: Thirteenth International Conference on Bioinformatics (InCoB2014): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S9.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Q. Wu participated in designing the algorithm and drafted the manuscript. Y. Ye, S.S. Ho and S. Zhou revised and finalized the paper. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wu, Q., Ye, Y., Ho, SS. et al. Semisupervised multilabel collective classification ensemble for functional genomics. BMC Genomics 15, S17 (2014). https://doi.org/10.1186/1471216415S9S17
Published:
Keywords
 Protein function prediction
 protein interaction networks
 collective classification
 semisupervised learning
 multilabel learning