Biomarker discovery across annotated and unannotated microarray datasets using semi-supervised learning
© Harris and Ghaffari. 2008
Published: 16 September 2008
Skip to main content
© Harris and Ghaffari. 2008
Published: 16 September 2008
The growing body of DNA microarray data has the potential to advance our understanding of the molecular basis of disease. However annotating microarray datasets with clinically useful information is not always possible, as this often requires access to detailed patient records. In this study we introduce GLAD, a new Semi-Supervised Learning (SSL) method for combining independent annotated datasets and unannotated datasets with the aim of identifying more robust sample classifiers.
In our method, independent models are developed using subsets of genes for the annotated and unannotated datasets. These models are evaluated according to a scoring function that incorporates terms for classification accuracy on annotated data, and relative cluster separation in unannotated data. Improved models are iteratively generated using a genetic algorithm feature selection technique.
Our results show that the addition of unannotated data into training, significantly improves classifier robustness.
The introduction of DNA microarray technology in 1995  has likely resulted in a huge volume of as yet undiscovered and potentially medically useful knowledge within gene expression profiles. This new bank of information has motivated researchers to develop new techniques for extracting this knowledge, and relating it to externally obtained sample information. For experiments aimed at answering a clinical question, such information might include patient disease stage, or response to a particular drug. The cost of producing adequately annotated datasets has been a barrier to the widespread application of microarray technology in medicine.
Based on the nature of the datasets, a variety of machine learning techniques, including supervised learning algorithms such as classification, and unsupervised learning algorithms such as clustering, have been applied. Clustering techniques  are applied to the datasets for assigning samples to their corresponding group solely based on similar expression levels. Supervised algorithms on the other hand classify  samples according to their externally determined class.
None of the standard supervised and unsupervised techniques are appropriate for datasets with some unlabeled samples; Semi-supervised algorithms can address these situations.
Blum and Mitchell  introduced the co-training algorithm for improving the sample classification performance when there are few labeled samples and many unlabeled samples. The co-training algorithm assumes that there are two independent sets of features available, such that each feature set is good enough to train a good classifier. The algorithm incorporates an iterative classification of samples from the unlabeled data using two naive Bayes classifiers designed from the independent features sets. In a demonstration of their technique aimed at web page classification, the addition of unlabeled samples decreased classification error relative to classification using only labeled data.
In a subsequent study, Nigam and Ghani  further examined the performance of the co-training algorithm and specifically its sensitivity to the independence of the feature sets. Their results confirm that when there is natural split of the features sets, co-training outperforms the other approaches such as expectation-maximization (EM). In the situation that such a split is not available, a random assignment of features into two sets still performs better than using only one feature set. They also introduced the co-EM algorithm, a hybrid that iteratively updates the unlabeled data labels using EM. Li et al.  proposed a Semi-Supervised Learning (SSL) algorithm for heterogeneous datasets having both labeled and unlabeled samples. Their example data were comprised of DNA microarray expressions and phylogenic reconstructions, with class labels corresponding to gene function. Their work may be considered a form of co-training in that two distinct datasets from a common set of samples (genes) is equivalent to a single dataset with two distinct sets of features. As with the above approaches, independent models are developed for each dataset. They show that minimizing the disagreement in predictions between these models leads to improved accuracy, and introduced a co-updating technique for iteratively improving prediction concordance.
Recently, Qi et al.  introduced a Bayesian Semi-Supervised approach termed BGEN (Bayesian GENeralization). The BGEN method trains a kernel classifier using both labeled and unlabeled data. Their example data consisted of expression profiles of wild type and mutant C. elegant embryos and identified enriched genes, with a small subset of genes labeled according to involvement in development of cell lineage. BGEN predictions were more accurate than predictions from either K-means clustering or SVM classification.
In this paper we propose the Genetic Learning Across Datasets concept (GLAD), and demonstrate an implementation that enables feature selection across unlabeled and labeled datasets. GLAD algorithms are distinct from previous approaches of semi-supervised learning in that the datasets analyzed may have very different statistical distributions, such as would arise in datasets collected independently by labs using different measurement technology. Additionally, a subset of labeled examples is not required for each dataset. As many available datasets will not have the desired annotation for any samples, this method extends the usability of the limited number of adequately annotated microarray datasets.
We conducted three experiments, each addressing a different cancer diagnostic problem: ALL/AML differential diagnosis, prediction of response to imatinib in CML, and prediction of outcome in DLBCL. In each experimental group, two microarray gene expression datasets were selected. If available, labels were removed from one of the component datasets, thus creating a combined dataset with both labeled and unlabeled subsets.
Genes before mapping
Genes after mapping
AML - ALL
AML-ALL 1 
Train: 1-ALL (27) 2-AML (11)
Test: 1-ALL (20) 2-AML (14)
AML-ALL 2 
3-MLL (20: deleted)
CML 1 
1-no cytogenetic response to imatinib (15)
2-cytogenetic response to imatinib (30)
CML 2 
DLBCL 1 
1-DLBCL (32: cured, 26: fatal or refractory)
2-FL (19: deleted)
DLBCL 2 
2-MLBCL (34: deleted)
For this study we implemented a GLAD algorithm as a wrapper technique for feature selection. A Genetic Algorithm (GA) is used for generating a population of relevant feature subsets. For a given subset, a model is computed from the labeled data and separately for the unlabeled data. Linear Discriminant Analysis (LDA) and K-means (K = 2) cluster algorithms were used for these two data types. A unique two-term scoring function was derived to independently score the labeled and unlabeled data models. An overall score is computed as a weighted average of the two terms as shown below.
Score = w × Score labeled+ (1-w) × Score unlabeled
We defined the labeled data model score as the standard leave-one-out-cross-validation accuracy for the labeled training samples.
C i ≡ centroid of cluster i
π i ≡ proportion of data in cluster i
≡ expected proportion in cluster i
≡ number of datapoints in cluster i
nc ≡ number of clusters
The cluster separation term is given by a modified ratio of the inter-cluster distance to the mean cluster size. The consistent proportion term, is defined as the RMS difference between the sorted actual and expected class priors. The class priors may be estimated from the labeled data, or may be available externally.
For each experiment, we did the following:
Iterate GLAD algorithm on labeled training data only
Iterate GLAD algorithm on labeled and unlabeled training data
Compare model accuracy on test data across generated populations of models
In these experiments, GLAD was run for 100 iterations with a population size of 5000, and a subset size of 3 features.
Improvements by adding unlabeled samples for AML-ALL
labeled + unlabeled
Improvements by adding unlabeled samples for CML
labeled + unlabeled
Improvements by adding unlabeled samples for DLBCL
labeled + unlabeled
In this study we proposed a new technique for concurrently mining labeled and unlabeled datasets. This method supplements standard supervised learning with clustering of data lacking clinical annotation to estimate the predictive power of gene subsets. The performance of our algorithm was evaluated in comparison with supervised learning only on microarray data from three different cancer types. Our results show that adding unlabeled samples can increase the accuracy of classification significantly.
We thank Exagen Diagnostics for support in conducting this research and presenting these results.
This article has been published as part of BMC Genomics Volume 9 Supplement 2, 2008: IEEE 7th International Conference on Bioinformatics and Bioengineering at Harvard Medical School. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S2
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.