Semi-supervised multi-label collective classification ensemble for functional genomics
- Qingyao Wu^{1, 2},
- Yunming Ye^{1}Email author,
- Shen-Shyang Ho^{2} and
- Shuigeng Zhou^{3}
https://doi.org/10.1186/1471-2164-15-S9-S17
© Wu et al.; licensee BioMed Central Ltd. 2014
Published: 8 December 2014
Abstract
Background
With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data.
Results
In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes.
Conclusion
Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.
Keywords
Background
Advances in biotechnology have enabled high-throughput experiments to generate a wide variety of genomic and proteomic data sources, including genome sequences, protein structure, and protein-protein interaction (PPI) networks.
Each data source provides a comprehensive view of the underlying mechanisms, and is represented as a set of features in a feature space or viewed as a graph structure where each individual is considered as a node. In the field of functional genomics, the process of manual annotation has become increasingly cumbersome with the rapid accumulation of the proteomic and genomic datasets. Computational methods to automate this task are urgently needed. Therefore, various computational methods have been proposed to automatically infer the functional properties of proteins using various data sources available (see [1] for a review).
Previous research in protein (or gene) function prediction can be partition into two classes of methods (feature-based approaches and graph-based approaches) according to the terms of input data and methodology. Feature-based machine learning algorithms require the instances to have a fixed set of attribute values from a feature space. The approaches involve extraction of features to encode the desired properties of a protein, and construction of a machine learning model for functional properties prediction. Some of the popularly used features are characteristics from amino acid sequence, textual repositories like MEDLINE, and more biologically meaningful features such as motifs derived from motif analysis of protein sequences, the isoelectric point and post-translational modifications. Via these constructed attribute features, a predictive model is learnt by training a classifier using annotated proteins, and then utilize this model to predict the functions of the proteins [2–5].
On the other hand, graph-based approaches use the network structure information to exploit proteins (or genes) sharing similar functional properties. Protein interaction networks are becoming increasingly rich and useful in delineating the biological characteristics of proteins. A review of computational approaches that are being used to measure protein interactions can be found in [6]. For instance, the Pearson's correlation coefficient is used to measure pairwise similarity between gene expression profiles. Specifically, the protein-protein interaction data can be modeled as a graph by considering individual proteins as the nodes, and the existence of an interaction between a pair of proteins as a link, graph-based or kernel-based classification algorithms are then used for protein data classification tasks based upon the protein interaction network [7–10].
Although many efforts have been made for automatically predicting functional properties of the proteins, this task still poses several significant challenges. First of all, existing feature-based methods and graph-based methods cannot guarantee good accuracy when there is only limited number of labeled data available. Most of the existing feature-based methods and graph-based methods require sufficiently large amount of labeled examples or a fully-labeled graph for training. However, acquiring such labels can be very expensive and time-consuming in practical applications. The performance of functional prediction might be degraded when the requirement of sufficient labeled data is not met. Furthermore, proteins are generally involved in more than one biological process, and thus they are annotated with multiple functions. Thus, it increases the difficulties of functional prediction. A promising idea to tackle these challenges (label deficiency and multiple function prediction problems) is to take advantage of multiple data sources and multiple functions of proteins for enhancing the prediction performance. To this end, we propose effective approaches that utilize all data sources that are available in this problem setting, including interaction networks, protein attribute features, label correlations, and unlabeled data for enhancing the performance of predicting functional properties of the proteins.
In this paper, we first show that the learning task underlying the protein function prediction using various data sources of relational data matches well with the collective classification [11–13] framework. Then, we propose a new generative model based semi-supervised multi-label collective classification algorithm, called GM-SMCC, for predicting proteins with multiple functions utilizing both labeled and unlabeled data in the learning process. To further boost the learning performance, we extend our proposed GM-SMCC method in an ensemble manner by constructing multiple latent networks. This approach, called ensemble of GM-SMCC model (EGM-SMCC), constructs various kinds of latent networks with various latent linkages to explicitly model the relationships among the node. We show how to effectively integrate these latent networks in an ensemble framework to improve the performance of protein function prediction.
- 1.
This article is the first one to examine the CC algorithm for protein function prediction using semi-supervised learning and multi-label learning techniques to leverage the unlabeled portion of the data and label correlation information in the partially-labeled PPI network, which only has limited number of annotations.
- 2.
The proposed GM-SMCC algorithm is able to utilize various data sources for protein function prediction, where the instance features and interactions, as well as the label correlations can be naturally and explicitly exploited to predict a set of functional labels for an unannotated protein.
- 3.
The proposed EGM-SMCC algorithm is a multi-network learning method which integrates multiple constructed latent graphs for protein function prediction using an ensemble framework. Via the multiple latent graphs constructed, the supervised knowledge can be propagated from labeled to unlabeled nodes effectively to boost the prediction performance.
Prediction task formalization
The protein functional properties prediction task has been widely explored in the literature. An extensive review on this task is found in [1]. The approaches of protein function prediction can be categorized into two categories, feature-based methods and graph-based methods, in terms of input data and methodology.
Feature-based methods. For these methods, each protein is characterized as a feature vector x_{ i } =<f_{1}, ..., f_{ d } > with a fixed set of feature values. The feature vectors of the data then taken as input to machine learning algorithms to infer annotation rules for predicting unannotated proteins [14]. Learning algorithms that have been used include SVM [3], neural networks [15], random forest [16], and cotraining [14], to name a few. Typically, feature extraction is involved to extract desired features to represent information of proteins. Then a feature selection is used in the learning process to select the most useful features to train a classifier. A protein usually performs multiple functions. As such, several approaches handle the prediction problem using the multi-label learning framework. For instance, Barutcuoglu et al. [17] learn SVM classification model for predicting functions in the Gene Ontology using a hierarchical multi-label structure. Pandey et al. [18] incorporate function correlation for predicting protein functions using a weighted multi-label kNN classifier. Schietgat et al. [19] predict gene function using hierarchical multi-label decision tree ensembles.
Graph-based methods. These methods study protein function in the context of a network. The recent availability of protein interaction networks has spurred on the development of computational methods for analyzing such data in order to elucidate the relationships between protein interactions and functional properties. Sharan et al. [9] categorize the methods into two groups: direct annotation schemes, which infer the function of a protein based on its connections in the network; and module-assisted schemes which first identify modules of related proteins and then annotate each module based on the known functions of its members. Examples of direct annotation algorithms include neighborhood counting [8], graph theoretic methods [20], and Markov random field [21]. On the other hand, the model-assisted methods differ mainly in their module (or cluster) detection techniques. Examples of model detection methods include hierarchical clustering-based methods [22] and graph clustering-based methods [23]. Graph-based approaches using multi-label learning framework for prediction have also been studied [24–26].
Although a broad variety of interesting approaches have been developed, most of the methods mainly study the scenario where sufficient labeled data are available in the dataset. In this case, the supervision knowledge can be effectively used in the feature-based models and graph-based methods to achieve good learning performance. However, such labels are difficult and time-consuming to obtain. In sparse-labeled networks, one has only limited number of labeled nodes, say fewer than 10%, 5% or even 1%. The performance of prediction might be degraded due to the lack of annotated proteins [27]. It is thus natural to consider using various data sources of the protein data (including labeled and unlabeled) to improve the prediction performance.
Collective classification. The task of protein function prediction can be cast into the collective classification problem of building a predictive model from networked data. Generally, networked data can be represented by nodes (instances) interconnected with each other by edges reflecting the relation or dependence between the nodes. Information on the nodes is provided as a set of attribute features (e.g., words present in the web page). The class membership of an instance may influence the class membership of a related instance.
Conventional supervised learning methods assume that the instances to be classified are independent of each other, while collective classification jointly classifies interrelated instances by exploiting the interrelations among the instances [28, 29]. For example, consider the task of predicting the topics of hyperlinked web pages. Conventional supervised learning approaches only use the attribute features derived from the content of the pages to classify each page. In contrast, collective classification methods use the link structure to construct additional relational features based on the labels of neighboring pages. We can count the number of different labels of the neighboring pages that are linked to each page as the relational features. Collective classification methods would then explicitly use the attribute features and the relational features together for classification.
Formally, the collective classification task is described as follows: Let G = (V, E, X, Y, C) be a graph dataset. V is a set of nodes {v_{1}, . . . , v_{ N } }. E is the adjacency matrix where E(i, j) = 1 if node v_{ i } and node v_{ j } are connected and E(i, j) = 0 otherwise. X ⊂ R^{ d } consists of d dimensional vector instances. Each x_{ i } ∈ X is an attribute vector for a node v_{ i } ∈ V . C = {c_{1}, c_{2}, ..., c_{ K }} is the set of K possible labels. Y contains the set of label set Y_{ i } corresponding to instance x_{ i } for i = 1, . . . , N . Each Y_{ i } = [Y_{ i }_{,1}, . . . , Y_{ i,l }, . . . , Y_{ i,K } ] ∈ {0, 1}^{k} such that Y_{ i,l } = 1 means that x_{ i } is associated with l and Y_{ i,l } = 0 otherwise. We assume that we have ${n}^{\prime}$ label data ${\left\{\left({x}_{i},{Y}_{i}\right)\right\}}_{i=1}^{{n}^{\prime}}$ and ${n}^{\u2033}$ unlabeled data ${\left\{\left({x}_{i}\right)\right\}}_{i={n}^{\prime}+1}^{{n}^{\prime}+{n}^{\u2033}}$ with $N={n}^{\prime}+{n}^{\u2033}$. The task is to construct a function to predict the class label of unlabeled nodes using the labeled nodes in the graph.
When there are only limited number of labeled nodes in the task of predicting functional properties of proteins, i.e. ${n}^{\prime}\ll {n}^{\u2033}$, most of the proteins may not connect to labeled ones, which makes the task very challenging. As such, it is natural to consider some sort of semi-supervised learning. In the setting of semi-supervised learning, one utilizes both labeled and unlabeled data together to improve the performance [30].
Methods
In this section, we present the (GM-SMCC) algorithm to address the task of predicting functional properties of proteins. Our approach is to model the problem as a generative model process to learn a probabilistic interpretation of the data for the estimation of the conditional distribution p(c|x) of the data, where c is a functional class and x is a protein instance.
GM-SMCC
where n(x_{ i },w_{ j }) is the frequency of w_{ j } occurring in x_{ i }, and N,M are the number of proteins and attribute features, respectively.
to measure the distance of two distributions. Here,D(z_{ i }; z_{ s }) is always nonnegative.
where E is the adjacency matrix to represent the network topology, E_{ i,s } = 1 if v_{ i } and v_{ s } are connected, and E_{ i,s } = 0 otherwise.
In protein functional properties prediction, proteins generally involve multiple biological processes and have multiple functions. Thus, it is crucial to take the label correlations into account to better predict their functional classes. Here, we further generalized the generative model to support this general setting. Recall that the network regularizer $\mathcal{R}$ is used to smooth label probability distribution over the intrinsic network structure. One hopes that the resulting distribution is able to be smoothed with respect to the class label correlations. A natural assumption here could be that if two class labels c_{ k } and c_{ l } are related, then the distribution P(c_{ k }|x_{ i }) and P(c_{ l }|x_{ i }) with respect to different instances should be also similar to each other.
where Y_{ k } = [Y_{1},_{ k }, · · ·, Y_{ N,k }]^{ T } is the label distribution over the instances, such that Y_{ i,k } is nonzero if x_{ i } belongs to class c_{ k } and the remaining elements are zero. Here, Y_{ k } is normalized to 1. The dot product of two vectors is equivalent to their cosine similarity.
Suppose the vector representation of P(c_{ k }|x_{ i }) with respect to different instances is r_{ k } = [P(c_{ k }|x_{1}), · · ·, P(c_{ k }|x_{ N })]^{ T } .
to smooth the distribution P (c|x).
where α and β are the regularization parameters. When α = 0 and β = 0, maximizing $\mathcal{O}$ is equivalent to performing learning using the original pLSA model.
where lx is the number of functional classes for an annotated protein xi.
For the unannotated proteins, we maximize the log-likelihood function $\mathcal{O}$ to compute their probabilistic distributions. The resulting probability distribution P (c|x_{ i }) with respect to a given instance xi indicates the importance of a set of functions to the protein. One hopes that the P (c_{ l }|x_{ i }) of the relevant labels are close to each other, and their values should be larger than those of the irrelevant labels. Hence, to make prediction of x_{ i }, we first rank the labels according to P (c_{ k } |x_{ i }). Then we separate the set of labels into relevant and irrelevant label subsets according to the largest change observed across the sorted P (c_{ k } |x_{ i }). That is, we seek the largest change between two successive P (c_{ k } |x_{ i }) and P (c_{ k }_{+1}|x_{ i }) in terms of their sorted orders. Their median value, say t = (P(c_{ k } |x_{ i }) + P (c_{ k }_{+1}|x_{ i }))/2, is used as splitting threshold to separate the class labels into relevant set and irrelevant set, where the the relevant set consists of the labels with probabilities larger than the threshold t, and the irrelevant set contains the remaining labels.
Model fitting with the EM algorithm
Our proposed approach, GM-SMCC, utilizes the generative model with both network and label regularization for protein function prediction, and parameter estimation is different from original PLSA [31] or previous work utilizing PLSA with manifold learning for unsupervised data clustering [32]. Next, we introduce the EM algorithm used in the proposed GM-SMCC approach for finding maximum likelihood parameter estimates.
In the proposed generative model, we have N K + M K parameters {P (w_{ j } |c_{ k } ), P (c_{ k } |x_{ i })} where the class labels ck are considered as the latent variables. For convenience, we denote these parameters as Θ. We use the EM algorithm which alternates between an expectation step (E-step) and a maximization step (M-step) to estimate the parameters in the proposed GM-SMCC model.
E-step
M-step
using the posterior probabilities computed in the E-step.
where 1 ≤ i ≤ N, 1 ≤ k ≤ K.
We have,
To obtain the M-step re-estimation for P (c|x), we construct six N K-by-N K matrices: Z, Ω, D, B, U, and R.
First, we construct a K-by-K block diagonal matrix D = [D_{ i,j }] based on the adjacency matrix E, where the (i, j)th block of D is a N -by-N matrix D_{ i,j } = [d_{ i,j,s,t }]s,t=1,...,N . All the entries of D are equal to 0 except the diagonal entries ${d}_{i,i.s.s}={\displaystyle \sum _{s}}{E}_{is}$
Next, we construct another K-by-K block diagonal matrix B = [B_{ i,j }] where its (i, j)th block is also a N -by-N matrix B_{ i,j } = [b_{ i,j,s,t }]s,t=1,...,N . The entries of B are equal to 0 when i ≠ j; otherwise, if i = j, then we have b_{ i,j,s,t } = E_{ st }.
Then, we construct a N -by-N block diagonal matrix U = [U_{ i,j }] based on the label correlation matrix F , where the (i, j)th block of U is a K-by-K matrix U_{ i,i } = [u_{ i,i,s,t }]_{ s },t=1,...,K . All non-diagonal entries of U are equal to 0 and the diagonal entries ${u}_{i,i.s.s}={\displaystyle \sum _{s}}{F}_{sl}$.
The matrix R = [R_{ i,j }] is another N -by-N block matrix where its (i, j)th block is a K-by-K matrix R_{ i,j } = [r_{ i,j,s,t }]_{ s },t=1,...,N . Indeed, each R_{ i,j }, for i, j = 1, ..., K, is a diagonal matrix r_{ i,j,s,s } = F_{ ij } .
The E-step (6) and M-steps (7) and (12) are alternated until the objective function (4) converges.
In the initialization step of the EM algorithm, the values of P (w_{ i }|c_{ k } ) and P (c_{ k } |x_{ i }) are initialized based on the class priors according to the annotated proteins. We assume that each feature w_{ j } is conditionally independent to each other given the label c_{ k } . Concretely, P (w_{ j }|c_{ k } ) are initialized as $P\left({w}_{j}|{c}_{k}\right)=\frac{n\left({w}_{j},{c}_{k}\right)}{{\sum}_{i}n\left({w}_{i},{c}_{k}\right)}$, where n(w_{ j } , c_{ k } ) is the frequency of w_{ j } and c_{ k } co-occuring. The label distribution P (c_{ k } |x_{ i }) for unannotated proteins are initialized as $P\left({c}_{k}|{x}_{i}\right)=\frac{{\sum}_{i}n\left({c}_{k},{x}_{i}\right)}{{\sum}_{l}{n}_{i}\left({c}_{l},{x}_{i}\right)}$, where n(c_{ k } , x_{ i }) = 1 if x_{ i } is associated with c_{ k } and 0 otherwise. In each iteration of the EM algorithm, the probability assignments of P (c|x) for labeled data are reset according to the known functional class labels as in Eq. (5).
EGM-SMCC algorithm
The power of the network regularizer in Eq. (4) of our proposed GM-SMCC model lies in the fact that the linkages of the network generally exhibit predictable relationships between class labels of linked proteins. Suppose we have an unannotated protein, and we have a good understanding of the relationship between the functions of this protein and the functional properties of its labeled neighbors, then we should be able to make a good prediction of the protein functional properties based on the linkage information.
In the proposed GM-SMCC model, we use the autocorrelation in the protein interaction network which may provide some inconsistent linkages between the proteins not sharing similar functional properties. In the studies of functional genomics, if more information is available, one can derive more effective networks for capturing useful relationships between the proteins to propagate the supervision knowledge from labeled nodes to unlabeled nodes.
In the real-world, protein data are associated with various data sources. For example, the proteins are associated with attribute features; those proteins with similar feature values may also be similar in their associated functions. Also, the proteins are associated with a set of functional labels, which can be represented by label features that are useful for evaluating the pairwise similarity of protein instances. These latent linkages are already embedded in the data. We can exploit this knowledge to construct the latent graphs for more effective label prediction.
In this paper, in addition to the PPI network, we introduce two types of latent linkages to construct latent graphs. Based on the latent graphs we constructed, we extend our proposed generative model in an ensemble manner to further boost the prediction performance.
Given the adjacency matrices ${\left\{{E}^{\left(i\right)}\right\}}_{i=1}^{q}$ of q latent graphs, the proposed ensemble algorithm, namely EGM-SMCC, is described in Algorithm 1. In the EGM-SMCC algorithm, we learn an individual GM-SMCC model on each of the constructed latent graph, and then combine the learned models to obtain a more reliable prediction than that of the model on a single latent graph.
Algorithm 1 EGM-SMCC
Input: ${\left\{{E}^{\left(i\right)}\right\}}_{i=1}^{q}$, X, Y , the parameters α and β
Output: y
Procedure:
1: for i = 1 to q do
2: Learn a GM-SMCC model using the constructed latent graph E^{(i)}. In the GM-SMCC model, compute the network regularizer $\mathcal{R}$ in Eq. (2) according to E^{(i)};
3: Use EM algorithm to optimize the GM-SMCC model to compute the label probability distribution y^{(i)};
4: end for
5: Combine the results of q learned models y^{(i)}, y^{(i}),..., y^{(q) }into an ensemble prediction as $y=\frac{1}{q}{\displaystyle \sum _{i=1}^{q}}{y}^{\left(i\right)}$
where E(i, j) = 1 if node v_{ i } and node v_{ j } are connected in the PPI network, and E(i, j) = 0 otherwise.
Random walk latent graph: When the underlying autocorrelation of original PPI network is small, i.e., some connected nodes may not share the same class label, the learning method based on the original PPI network might be affected.
It is observed that proteins that interact with level-2 neighbors (indirect neighbors in the PPI network) also have a great likelihood of sharing similar characteristics [8]. To this end, we use the idea of even-step random walk with restart (ERWR) [33] to compute the weights of the latent linkages. Intuitively, we assume that linkages to directed neighbors with the same function class with the target protein of interest typically have triangle structures (see Figure 1(b)). These neighbors (v_{2} and v_{3}) are able to obtain high scores using ERWR because they are well-connected in the PPI network. On the other hand, ERWR can avoid the immediate neighbors (e.g., v_{1} and v_{2}) with inconsistent linkages that negatively influence the predictions because they are sparsely-connected. ERWR can also exploit the indirect neighbor data by adding linkages to level-2 neighbors (e.g., v4) that are well-connected to level-1 neighbors.
where $R={\sum}_{t=1}^{T}\alpha {\left(1-\alpha \right)}^{t}{p}^{t}$ is the steady-state probability matrix after T steps.
Here, Y_{ i } and Y_{ j } are normalized to unit length, thus the dot product of the two vectors is equivalent to their cosine similarity.
where ${\mathcal{N}}_{k}\left(i\right)$ is the set of k nearest neighbors of v_{ i }. In practice, we find that k does not need tuning. We use k = 10 nearest neighbors for each data set.
Experiments
In this section, we discuss the extensive experimental results to compare the performance of our proposed methods with the other baselines: SVM, wvRN+RL, ICA, semi-ICA, and ICML, and show that the proposed methods are able to achieve better performance against these baselines.
Yeast dataset and baselines
We conduct experiments to predict properties of the proteins corresponding to a given yeast gene from KDD Cup 2001 [36]. In particular, we formulated two prediction problems based on the properties of the proteins. Problem (1) is to predict the localization of the proteins encoded by the genes. It is a binary problem, i.e., a protein is localized (or not localized) to the corresponding organelle. Problem (2) is to predict the functions of the proteins, which a multi-label problem, i.e., a protein can have more than one function. There are totally 14 functional classes in the dataset.
The dataset for these two problems consisted 1,243 protein instances and 1,806 interactions among the pair of proteins interact with one another. The protein features include the attributes refer to the chromosome on which the genes appears, to whether the gene is essential for survival, observable characteristics of the phenotype, structural category of the protein, the existence of characteristic motifs in the amino acid sequence of the protein, and whether the protein forms larger proteins with others [36, 14].
We evaluate the performance of problem (1) by classification accuracy, and problem (2) by three multi-label learning evaluation metrics, i.e., Coverage, RankingLoss, and MacroF1 [37]. These criteria are defined as follows
where rank_{ s }(x_{ i }, c_{ k } ) denotes the ranks of class label c_{ k } de-rived from a confidence function s(x_{ i }, c_{ k } ) which indicates the confidence for the class label c_{ k } to be a proper label of x_{ i }.
where ${\mathcal{R}}_{i}=\left\{\left({c}_{1},{c}_{2}\right)|h\left({x}_{i},{c}_{1}\right)\le h\left({x}_{i},{c}_{2}\right),\left({c}_{1},{c}_{2}\right)\in {Y}_{i}\times {\u0232}_{i}\right\}$, and $\u0232$ denotes the complementary set of Y_{ i }.
where p_{ k } and r_{ k } are the precision and recall of the k-th label.
To validate the performance of our proposed algorithms, we compare our approach with four baseline methods:
1. SVM [34]. This baseline is a feature-based method only using the attribute features of the proteins for learning without considering using any network information.
2. wvRN+RL [38]. This algorithm is a relational-only method using only the PPI network for prediction. wvRN+RL computes a new label distribution for an unlabeled node by averaging the current estimated distributions of its linked neighbors. This process is repeated until reaching the maximum iteration number.
3. ICA [28]. This denotes a collective classification algorithm which uses both attribute features and relational features to train a base classifier for prediction. The relational features are constructed based on the labels of neighbors. ICA uses an iterative process whereby the relational features are recomputed in each iteration until a fixed number of iterations is reached. Prior work has found logistic regression (LR) to be superior to other classifiers such as naive bayes and kNN, as base classifier for ICA. Therefore, we use LR as the local classifier for ICA in the experiments.
4. semi-ICA [39]. This method extends ICA to leverage the unlabeled data using semi-supervised learning. There are four semi-ICA variants (KNOWN-EM, ALL-EM, KNOWN-ONEPASS, ALL-ONEPASS) for semi-ICA, we run all four variants and choose the best one as the result of semi-ICA.
5. ICML [13]. This method extends ICA to handle multi-label learning by constructing additional label correlation features to exploit the dependencies among the labels as additional input features to learn base classifier. The ICML algorithm is also based on an iterative framework similar to ICA.
It is generally more difficult to determine the classifier parameter values when the number of labeled data available is smaller (which is the focus of this study). For the SVM classifier, we use the LibSVM [34] with linear kernel as base classifier, and simply set the penalty parameter C = 1.0 for the SVM as default. The maximum number of iterations for ICA, semi-ICA are set to 10, and we use logistic regression as their base classifier as in [39, 13]. While the wvRN+RL uses 1000 iterations. The parameters α and β for our proposed method are set to 3 and 0.1. The parameter selection issue is discussed in the later section.
Results on protein localization prediction
We first consider problem (1) of KDD Cup 2001, i.e., the protein localization prediction problem. We set α ≠ 0 and β = 0 in our proposed method, and compare GM-SMCC with the learning algorithms: SVM, wvRN+RN, ICA and semi-ICA. The performance is measured in terms of classification accuracy.
We note that a smaller number of label data is the most interesting case for our algorithm because it is not reliable for prediction due to the inadequacy of supervised knowledge in the labeled dataset. Thus it is more desired that other data sources can be utilized together to improve the prediction performance. A closer examination of the results in Figure 2 show that the smaller the percentage of the labeled data is involved, the larger improvement GM-SMCC achieves. GM-SMCC achieves the largest improvement against 2nd best method when there are only 3% of labeled data (GM-SMCC: 0.82 versus semi-ICA: 0.75). We also conduct pairwise t-test at 0.05 significance level to assess the statistical significance of the differences in performance of GM-SMCC and the other test algorithms using 3% of labeled data. The performance of GM-SMCC is significant better than those of the other baseline methods. This result illustrates the advantages of our methods when there are an extremely small number of labeled data. This is consistent with our earlier assertions that our approach can work even in the paucity of annotated proteins by exploring various data sources, including interaction networks, attribute features, and unlabeled data.
Accuracy (mean±standard deviation) of GM-SMCC and EGM-SMCC against different label ratio on problem (1) of KDD Cup 2001.
label ratio | GM-SMCC-1 | GM-SMCC-2 | GM-SMCC-3 | GM-SMCC-mean | EGM-SMCC |
---|---|---|---|---|---|
3% | 0.827 ± 0.031 | 0.805 ± 0.009 | 0.771 ± 0.008 | 0.789 ± 0.007 | 0.834 ± 0.020 |
4% | 0.833 ± 0.021 | 0.813 ± 0.026 | 0.805 ± 0.016 | 0.800 ± 0.006 | 0.845 ± 0.027 |
5% | 0.843 ± 0.008 | 0.802 ± 0.018 | 0.804 ± 0.024 | 0.803 ± 0.016 | 0.843 ± 0.012 |
6% | 0.846 ± 0.004 | 0.807 ± 0.023 | 0.790 ± 0.017 | 0.818 ± 0.003 | 0.849 ± 0.013 |
7% | 0.846 ± 0.002 | 0.827 ± 0.018 | 0.812 ± 0.019 | 0.845 ± 0.005 | 0.868 ± 0.013 |
8% | 0.852 ± 0.002 | 0.813 ± 0.011 | 0.817 ± 0.030 | 0.845 ± 0.002 | 0.860 ± 0.008 |
9% | 0.857 ± 0.020 | 0.831 ± 0.014 | 0.826 ± 0.022 | 0.853 ± 0.004 | 0.872 ± 0.011 |
10% | 0.858 ± 0.017 | 0.831 ± 0.014 | 0.846 ± 0.012 | 0.855 ± 0.007 | 0.874 ± 0.006 |
We report the average accuracy and standard deviation of the comparison methods over 10 runs. The numbers in boldface (on each row of the tables) indicate the best results for each label ratio over the methods. From Table 1, we observe that EGM-SMCC using multiple latent graphs is able to achieve better performance against the GM-SMCC method using a single latent graph. A reasonable explanation for this finding is that the different latent graphs have complementary relationship for prediction. These latent graphs are derived from different sources. When complementary models learned from these latent graphs are combined in an ensemble, correct decisions are amplified by the aggregation process. The performance of an ensemble learner is highly dependent on two factors: one is the accuracy of each component learner; the other is the diversity among these components. Examining the results in Table 1 shows that the overall performances of the GM-SMCC models generated from different graphs are reasonably well. This result indicates that each latent graph provides prediction knowledge from a specific aspect, and their combination leads to a more robust prediction.
Results on protein function prediction
We also conduct experiments for problem (2) of KDD Cup 2001, i.e., the multi-label protein function prediction problem. We set α and β to be non-zero by considering the network information and label correlation simultaneously. We compare the proposed algorithms with baseline classifiers: SVM, wvRN+RN, ICA, semi-ICA and ICML. SVM, wvRN+RN, ICA and semi-ICA are single-label classifiers. For these methods, we decompose the multi-label problem into a set of K binary classification problems using one-against-all strategy, and train independent classifier for each single-label problem. This approach is known as the binary relevance (BR) method [40]. The predictions for all K binary classification problems are combined to make the final prediction.
Similar to the experiments for protein localization prediction, we also conduct experiments to examine the effect of the proposed EGM-SMCC method (integrating multiple latent graphs) for enhancing the prediction performance against the GM-SMCC method using a single latent graph. GM-SMCC-1, GM-SMCC-2 and GM-SMCC-3 denote the single-graph model using (E^{(1)}), (E^{(2)}) and (E^{(3)}), respectively. GM-SMCC-mean denotes the single-graph model using a latent graph constructed by averaging the weighing values of E^{(1)}, E^{(2)} and E^{(3)}.
Convergence study
Parameter sensitivity
In our proposed GM-SMCC method, the regularization parameters α and β quantify the importance of the network regularizer and label regularizer in the objective function (4). These parameters also determine the learning setting. Our framework is formulated in single-label collective classification learning by considering α ≠ 0 and β = 0, i.e., we solve single label learning problem for the problem (1). On the other hand, our framework is formulated in multi-label collective classification learning when α ≠ 0 and β ≠ 0, i.e., we consider the label correlation in the learning process for the problem (2).
Next, we fix α = 3 and vary β from 0 to 0.4 on problem (2) using 5% label ratio. The result is given in Figure 7(b). We observe that when β = 0 or β = 0.4, the performance is poor. It is evident that the smallest Coverage is achieved at β = 0.1. Therefore, we set α = 3 and β = 0.1 in all the comparisons.
Interaction relations
Selected interrelated genes and their similarity computed by the proposed GM-SMCC method.
GeneID | GeneID | Similarity |
---|---|---|
G238510 | G239467 | 0.99706 |
G238510 | G239178 | 0.95597 |
G238510 | G235250 | 0.8347 |
G234935 | G234445 | 0.9178 |
G234935 | G239966 | 0.92039 |
G234935 | G235763 | 0.95516 |
G234935 | G235329 | 0.95938 |
G235158 | G234735 | 0.98431 |
G235158 | G234074 | 0.9788 |
G235158 | G234177 | 0.90675 |
G235158 | G235216 | 0.96184 |
G237021 | G234486 | 0.85557 |
G237021 | G234065 | 0.88554 |
G237021 | G239804 | 0.96585 |
G237021 | G239266 | 0.92513 |
G234980 | G235439 | 0.98653 |
G234980 | G235231 | 0.99427 |
G234980 | G234914 | 0.99755 |
G234980 | G235780 | 0.96058 |
Conclusion
In this paper, we first propose GM-SMCC, an effective and novel semi-supervised multi-label collective classification based method for predicting functional properties of proteins. GM-SMCC is designed with the use of pLSA generative model with a network regularizer and label regularizer, which exploit the network linkages and label correlations effectively to compute the label probability distribution for prediction. Then, we extend it in an ensemble manner and develop the EGM-SMCC approach to exploit various kinds of latent linkages in constructing latent graphs to further improve the prediction performance. Experimental results on two tasks of KDD Cup 2001 (the localization prediction task and the protein function prediction task) consistently demonstrate the effectiveness of the proposed methods. The performances of the proposed methods are shown to be better than that of state-of-the-art algorithms, including SVM, wvRN+RL, and three variants of ICA. In future, we will extend our proposed method to handle heterogeneous biological networks.
Declarations
Acknowledgements
Y. Ye's research was supported in part by National Key Technology R&D Program of MOST China under Grant No. 2012BAK17B08, and NSFC under Grant No.61272538. S.S. Ho's research was supported in part by AcRF Grant RG-41/12 and NTU-SUG. S. Zhou's research was supported in part by National Natural Science Foundation of China (NSFC) under grant No. 61272380. Publication costs for this article were funded by grants of the corresponding author.
This article has been published as part of BMC Genomics Volume 15 Supplement 9, 2014: Thirteenth International Conference on Bioinformatics (InCoB2014): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S9.
Authors’ Affiliations
References
- Pandey G, Kumar V, Steinbach M: Computational approaches for protein function prediction: A survey. 2006, Twin Cities: Department of Computer Science and Engineering, University of MinnesotaGoogle Scholar
- Jensen LJ, Gupta R, Staerfeldt HH, Brunak S: Prediction of human protein function according to gene ontology categories. Bioinformatics. 2003, 19 (5): 635-642. 10.1093/bioinformatics/btg036.PubMedView ArticleGoogle Scholar
- Cai C, Han L, Ji ZL, Chen X, Chen YZ: Svm-prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic acids research. 2003, 31 (13): 3692-3697. 10.1093/nar/gkg600.PubMedPubMed CentralView ArticleGoogle Scholar
- Lobley AE, Nugent T, Orengo CA, Jones DT: Ffpred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic acids research. 2008, 36 (suppl 2): 297-302.View ArticleGoogle Scholar
- Shen HB, Chou KC: Ezypred: a top-down approach for predicting enzyme functional classes and subclasses. Biochemical and Biophysical Research Communications. 2007, 364 (1): 53-59. 10.1016/j.bbrc.2007.09.098.PubMedView ArticleGoogle Scholar
- Pellegrini M, Haynor D, Johnson JM: Protein interaction networks. Expert review of proteomics. 2004, 1 (2): 239-249. 10.1586/14789450.1.2.239.PubMedView ArticleGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nature biotechnology. 2003, 21 (6): 697-700. 10.1038/nbt825.PubMedView ArticleGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006, 22 (13): 1623-1630. 10.1093/bioinformatics/btl145.PubMedView ArticleGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Molecular systems biology. 2007, 3 (1):Google Scholar
- Xiong W, Liu H, Guan J, Zhou S: Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks. BMC bioinformatics. 2013, 14 (Suppl 12): 4-Google Scholar
- Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T: Collective classification in network data. AI magazine. 2008, 29 (3): 93-Google Scholar
- McDowell LK, Gupta KM, Aha DW: Cautious collective classification. The Journal of Machine Learning Research. 2009, 10: 2777-2836.Google Scholar
- Kong X, Shi X, Yu PS: Multi-label collective classification. SIAM International Conference on Data Mining (SDM). 2011, 618-629.Google Scholar
- Krogel MA, Scheffer T: Multi-relational learning, text mining, and semi-supervised learning for functional genomics. Machine Learning. 2004, 57 (1-2): 61-81.View ArticleGoogle Scholar
- Mooney C, Pollastri G, et al: Sclpred: protein subcellular localization prediction by n-to-1 neural networks. Bioinformatics. 2011, 27 (20): 2812-2819. 10.1093/bioinformatics/btr494.PubMedView ArticleGoogle Scholar
- Díaz-Uriarte R, De Andres SA: Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006, 7 (1): 3-10.1186/1471-2105-7-3.PubMedPubMed CentralView ArticleGoogle Scholar
- Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multi-label prediction of gene function. Bioinformatics. 2006, 22 (7): 830-836. 10.1093/bioinformatics/btk048.PubMedView ArticleGoogle Scholar
- Pandey G, Myers CL, Kumar V: Incorporating functional inter-relationships into protein function prediction algorithms. BMC bioinformatics. 2009, 10 (1): 142-10.1186/1471-2105-10-142.PubMedPubMed CentralView ArticleGoogle Scholar
- Schietgat L, Vens C, Struyf J, Blockeel H, Kocev D, Džeroski S: Predicting gene function using hierarchical multi-label decision tree ensembles. BMC bioinformatics. 2010, 11 (1): 2-10.1186/1471-2105-11-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics. 2005, 21 (suppl 1): 302-310. 10.1093/bioinformatics/bti1054.View ArticleGoogle Scholar
- Deng M, Tu Z, Sun F, Chen T: Mapping gene ontology to proteins based on protein-protein interaction data. Bioinformatics. 2004, 20 (6): 895-902. 10.1093/bioinformatics/btg500.PubMedView ArticleGoogle Scholar
- Arnau V, Mars S, Marín I: Iterative cluster analysis of protein interaction data. Bioinformatics. 2005, 21 (3): 364-378. 10.1093/bioinformatics/bti021.PubMedView ArticleGoogle Scholar
- Adamcsek B, Palla G, Farkas IJ, Dereényi I, Vicsek T: Cfinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22 (8): 1021-1023. 10.1093/bioinformatics/btl039.PubMedView ArticleGoogle Scholar
- Yu G, Domeniconi C, Rangwala H, Zhang G, Yu Z: Transductive multi-label ensemble classification for protein function prediction. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1077-1085.View ArticleGoogle Scholar
- Jiang JQ, McQuay LJ: Predicting protein function by multi-label correlated semi-supervised learning. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2012, 9 (4): 1059-1069.View ArticleGoogle Scholar
- Wu Q, Ng MK, Ye Y, Li X, Shi R, Li Y: Multi-label collective classification via markov chain based learning method. Knowledge-Based Systems. 2014, 63: 1-14.View ArticleGoogle Scholar
- Mostafavi S, Morris Q: Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics. 2010, 26 (14): 1759-1765. 10.1093/bioinformatics/btq262.PubMedPubMed CentralView ArticleGoogle Scholar
- Neville J, Jensen D: Iterative classification in relational data. Proc AAAI-2000 Workshop on Learning Statistical Models from Relational Data. 2000, 13-20.Google Scholar
- Wu Q, Ye Y, Ng MK, Ho SS, Shi R: Collective prediction of protein functions from protein-protein interaction networks. BMC bioinformatics. 2014, 15 (Suppl 2): 9-10.1186/1471-2105-15-S2-S9.View ArticleGoogle Scholar
- Shi R, Wu Q, Ye Y, Ho SS: A generative model with network regularization for semi-supervised collective classification. Proceedings of the 2014 SIAM International Conference on Data Mining. 2014Google Scholar
- Hofmann T: Unsupervised learning by probabilistic latent semantic analysis. Machine learning. 2001, 42 (1-2): 177-196.View ArticleGoogle Scholar
- Cai D, Wang X, He X: Probabilistic dyadic data analysis with local and global consistency. Proc of the 26th Annual International Conference on Machine Learning. 2009, 105-112.Google Scholar
- Gallagher B, Tong H, Eliassi-Rad T, Faloutsos C: Using ghost edges for classification in sparsely labeled networks. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 256-264.View ArticleGoogle Scholar
- Chang CC, Lin CJ: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2 (3): 27-Google Scholar
- Von Luxburg U: A tutorial on spectral clustering. Statistics and computing. 2007, 17 (4): 395-416. 10.1007/s11222-007-9033-z.View ArticleGoogle Scholar
- Cheng J, Hatzis C, Hayashi H, Krogel M.-A, Morishita S, Page D, Sese J: Kdd cup 2001 report. ACM SIGKDD Explorations Newsletter. 2002, 3 (2): 47-64. 10.1145/507515.507523.View ArticleGoogle Scholar
- Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S: An extensive experimental comparison of methods for multi-label learning. Pattern Recognition. 2012, 45 (9): 3084-3104. 10.1016/j.patcog.2012.03.004.View ArticleGoogle Scholar
- Macskassy SA, Provost F: Classification in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research. 2007, 8: 935-983.Google Scholar
- McDowell L, Aha D: Semi-supervised collective classification via hybrid label regularization. Proc of the 29th International Conference on Machine Learning. 2012, 975-982.Google Scholar
- Zhang ML, Zhou ZH: A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering. 2013, 99 (PrePrints): 1-View ArticleGoogle Scholar
- Read J, Pfahringer B, Holmes G, Frank E: Classifier chains for multi-label classification. Machine learning. 2011, 85 (3): 333-359. 10.1007/s10994-011-5256-5.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.