A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information
© Zhang et al; licensee BioMed Central Ltd. 2010
Received: 10 March 2009
Accepted: 29 April 2010
Published: 29 April 2010
The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application.
A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type.
We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.
The use of gene expression profiling for the classification of human cancers has been widely investigated. Previous works were successful in predicting tumor types in the context of binary problems. Many algorithms for feature extraction and sample classification have been proposed [1–6]. More recently, a method for addressing the potential mislabeling in the training set was proposed for binary classification of cancer samples . As there are over a hundred types of cancers, and potentially even more subtypes , it is essential to develop multi-category methodologies for molecular classification for any practical application .
Multi-category prediction can be achieved using binary classification algorithms via the one-versus-one (OVO) and/or one-versus-rest (OVR) partition of the training data set. However, in a cancer type prediction, multi-category problems proved to be more challenging than simple binary problems, and the reported results were less than satisfactory [3, 10]. On one hand, when the available resource is limited and the sample size of a given category (class) is small, classifiers based on the OVR partition of the data set potentially suffer from severe over-fitting, leading to low predictive ability and robustness. Furthermore, the substantial noise introduced by implementing the numerous classifiers under an OVO scheme and the asymmetric training sets caused by OVR partitioning of the data will inevitably weaken the classification system. On the other hand, the effects of biological and technical noise together with the genetic heterogeneity of samples within a clinically defined tumor class decrease the predictive power in a multiple setting .
The jackknife is a well known, non-parametric method often used for estimating the sampling distribution of a statistic. Given a sample dataset and a desired statistic (e.g., the mean), the jackknife works by computing the desired statistic with an element (or a group of elements) removed from the equation. The process is repeated for each element in the dataset. The application in cancer classification with gene expression profiling has been reported in the context of binary problems . In that study, the individual maximum difference subsets (MDSSs) of genes identified from a set of jackknife subsets of samples were aggregated to generate the "overall MDSS" in order to return the expected classification. In other words, jackknife was used for feature selection rather than for training multiple sub-classifiers.
In this study, a new learning method called paired-samples test algorithm (PST), which is based on the jackknife method, was used to classify multiple tumor types using gene expression data. The proposed method is designed for solving multi-category problems under an OVR scheme with a very limited training data set, and it is similar to the bootstrap aggregating (bagging) procedure, which proved to be helpful in improving weak classifiers [13, 14]. In order to get a relative measurement of uncertainty in the prediction of a sample category (class), the training sample being removed (validation sample) each time was predicted together with the training samples. The procedure was implemented in a parsimonious way, making its integration with a computationally intensive algorithm, such as the stochastic, regulation-based binary regression , feasible. The performance of the proposed method was evaluated under several scenarios of gene selection criteria using two well known and challenging datasets: the GCM and NCI60 datasets containing 14 and 9 cancer tumor types, respectively.
Results and Discussion
Determination of the optimum number of genes (features) to be used by the classification algorithm is usually a difficult task that depends on several factors, including the classification algorithms and the complexity of the data set. For the used binary regression algorithm, previous studies have shown that a feature set of one to two hundred top genes is adequate for a simple two category problem [6, 7]. In this study, the size of the feature set used was 200, 300, 500 or 1000 genes for the GCM dataset and 100, 200, 300 or 5000 genes for the NCI60 dataset.
It should be noted that while the largest improvements were seemingly coming from the weaker gene selection mechanisms, the application of PST made the binary regression algorithm more robust in relation to the gene selection methods and the size of the gene set to be used.
Comparison between the incorrectly classified samples obtained using the proposed method and previous studies using the same data set.
3(LU, LU, BL)
3(LE, PA, UT)
It is possible that the superiority of the proposed method over SVM and other learning algorithms could be related to the difference in gene selection methods used in this study and by Ramaswamy et al (2001) and Bagirov et al (2003) [11, 15]. However, our preliminary work as well as readily available information [10, 18] demonstrated that SVM outperformed k-NN, NN (neural network), PNN (probabilistic neural work) and the decision tree in general does not support such a claim. In fact, the highest accuracies obtained using SVM occurred when 200-1000 genes were selected based on FC, t-statistics, penalized t-statistics and non-parametric ANOVA, ranging from 72.2% to 75.9%. These were well below the results obtained using our approach.
As indicated in Table 1, it seems that some tumor types are easily predicted. For example, LY, UT, ME and CSN tumors had 100% prediction accuracy using all three methods. Meanwhile, other types, such as BR, had a high misclassification rate ranging from 50 to 75%, indicating potential excess heterogeneity. Additionally, the profile of misclassified samples was very different between the four studies. In fact, among the four BR tumors, two were misclassified as OV and PA in Ramaswamy et al (2001) , three were misclassified as LU, LU and BL in Bagirov et al (2003) , and three were misclassified as LE, PA, and UT in the current study.
PPP rediction uncertainty
In cancer type predictions, multi-category problems have proven to be more challenging than binary cases, not only in the classification accuracy but also in the assessment of uncertainty. In this paper, a jackknife-like classification method, called paired-samples test algorithm (PST), was proposed and applied to two bench datasets of multiple tumor types [11, 20]. The results showed that the proposed method has improved the prediction accuracy of test samples in the GCM dataset, especially when t-statistics were used for primary feature selection. For the NCI60 dataset, improvement was observed only when the number of used genes was relative small. These improvements made the binary regression algorithm more robust to gene selection and the number of genes used.
The core idea of the proposed method is to repeatedly test a certain known tumor type with a blind test sample while withholding an associated training sample; in this way, not only can the prediction be made but also the relative confidence R(T) of the prediction can be accessed by measuring the difference between the prediction probability of the test sample and the corresponding value of the withheld training sample. R(T) provided insight into the sources of the uncertainty in the statistical classification by revealing the loss in confidence due to the utilization of weak classifiers or heterogeneity in a given tumor type. It is possible to combine the measurement F(T) and R(T) to make a better score for type determination. Our continuous work will consider this possibility in regards to penalizing a negative R(T) value.
Paired-samples test algorithm
When the distribution of the data is complex and/or the training set is small compared to the feature dimension, the combined decision of an ensemble of multiple classifiers can be used to improve the performance of a single classification rule . The bagging procedure is one such technique widely used to establish multiple classifiers . It consists of training a set of classifiers, each being based on a bootstrap replicate of the training set, and aggregating them according to relative credits or weights. In the process of training classifiers for tumor prediction using microarray data, a feature selection step is usually performed in order to decrease noise. Therefore, the bagging technique could improve the robustness for the prediction mainly due to the fact that each classifier has its specific training set and group of selected features. However, for multi-category problems, the application of bagging techniques is subject to some limitations due to the partition of the training set as described in the following paragraph.
Assume the original training set includes 10 tumor types labeled with letters from A to J, and each type has 8 samples. In establishing an OVR classifier for separating type A from others, the training data will be divided into two groups, one containing the A samples (8 samples) and another containing the remaining 72 samples (B to J). Although the training set is not small in size, it is extremely asymmetric. Theoretically, in a bootstrap replicate of the same size, the probability of a sample being included is . Thus, the number of A samples in some replicates may be very small, leading to an inaccurate classifier. Furthermore, a valid bagging technique requires a great deal of replicates. Consequently, combining a bagging technique with computationally intensive classification algorithms and gene extraction methods may become impractical due to high computational cost. In order to overcome these shortcomings, we propose a paired-samples testing algorithm, a parsimonious jackknife-like method.
Paired-samples test algorithm (PST)
Binary classification algorithm
A binary classification algorithm was nested in PST and was performed to establish the series sub-classifiers and calculate the classification probabilities, such as CLA- iand p(T → A|CLA- i) as indicated in Figure 7. Prior to each implementation, the genes were selected with the information of the involved training samples and by using the methods described in section 2.3.
where Φ (·) was the standard normal cumulative distribution function.
Due to the fact that the number of genes was larger than the number of samples, a dimension reduction technique called singular value decomposition (SVD) was performed on the gene expression matrix X. The resulting model is equivalent to the one in equation (3) but with "eigen-genes" as the exploratory factors .The reduced regression model had the dimension of its parameter vector γ equal to the number of samples in the training set. The parameter vector, γ was estimated using Gibbs sampling, and β in equations (2) and (3) was obtained as a by linear transformation of the estimates of the reduced model [6, 7]. The technical details of the implementation of SVD and Gibbs sampling can be found in West et al (2001) and Zhang et al (2006) [6, 7]. We also tried using ICA (Independent Component Analysis) to replace SVD for dimension reduction , however no positive improvement in the prediction performance was achieved (results not shown).
Feature selection steps were required for cancer classification with gene expression profiling. In this study, four gene ranking methods were combined with PST. All the calculations were based on log2 transformed gene expression data.
Fold change (FC)
It is calculated with the formula FC = |M o - M r |. In OVR setup, M o represents the mean of the training samples in a single tumor type to be separated from the others, and M r represents the mean of the training samples in all other cancer types.
where S p is the pooled standard deviation, N o and N r are the numbers of the training samples in the two groups, respectively, and M o and M r are the same as defined above. Gene ranking is based on the absolute values.
In this study, δ was set to a value equal to the ninetieth percentile of the distribution of the standard deviations of S p for all genes in the array .
Kruskal Wallis non-parameter test (NP-ANOWA)
where R i is the sum of ranks in group i, ni is the number of observations in the i th group, and n is the sample size. There are e distinct values, with v1 equal to the smallest, v2 equal to the next smallest and so on. In OVR setup, the test reduces to the two-sided Mann-Whitney's Test. For gene ranking, only the statistic W is required.
The proposed method was applied to two well-known and challenging datasets that have been analyzed previously by several groups.
It consisted of 144 and 54 training and testing samples, respectively, representing 14 tumor types . These tumor types included BR (breast adenocarcinoma), PR (prostate adenocarcinoma), LU (lung adenocarcinoma), CO (colorectal adenocarcinoma), LY (lymphoma), BL (bladder transitional cell carcinoma), ML (melanoma), UT (uterine adenocarcinoma), LE (leukemia), RE (renal cell carcinoma), PA (pancreatic adenocarcinoma), OV (ovarian adenocarcinoma), ME (pleural mesothelioma) and CNS (central nervous system). All samples were primary tumors with the exception of eight metastatic tumors in the test set. Expression data was generated using Affymetrix high-density oligonucleutide microarrays containing 16,043 known human genes or expressed sequence tags (EST). Cancer types LY, CNS and LE have more training samples in the original dataset. Based on literature [11, 15] and our primary analysis of the data, samples for these three types were consistently predicted with very high accuracy (95-100%). In the current study, in order to remove the concern that the high accuracy was related to the larger training sets, only 8 training samples for each of the tumor types were used for the prediction of test samples.
The 60 cell lines were derived from tumors: 8 breast, 5 central nervous system, 7 colon, 6 leukemia, 8 melanoma, 9 non-small-cell-lung-carcinoma (NSCLC), 6 ovarian, 2 prostate, and 8 renal . Because of their small class size, the two prostate cell lines were excluded from our analysis. Expression data was generated using Affymetrix high-density oligonucleutide microarrays containing 6,817 human genes .
For both datasets, the expression intensities for each gene were calculated using Affymetrix GENECHIP analysis software. In the current study, some preprocessing was conducted on the data provided in literature [11, 20]. It consisted of threshold treatment of the expression intensities with 20 for GCM data (1 for NCI60 data) and 16,000 as the lower and upper limit, respectively, after which the log2 transformation was applied. Further, genes with the highest transformed intensity smaller than two times the minimum expression across all samples of each dataset were deleted.
The authors are grateful to the two reviewers for their constructive comments.
- Furey S, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000, 16: 906-914. 10.1093/bioinformatics/16.10.906.PubMedView Article
- Golub TR, Slonim DK, Tamayo P, Huard C, Gassenbeek M, Mesirov P, Coller H, Loh ML, Downing JR, Caligiuri MA: Molecular classification of Cancer: Class discovery and class prediction by gene expression prediction. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.PubMedView Article
- Dudoit S, Fridlyyand J, Speed T: Comparison of discrimination methods for classification of tumors using gene expression data. J Am Stat Ass. 2002, 97: 77-87. 10.1198/016214502753479248.View Article
- Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescuy CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001, 7: 673-679. 10.1038/89044.PubMed CentralPubMedView Article
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMed CentralPubMedView Article
- West M, Blanchette C, Dressman H, Huang ER, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci. 2001, 98: 11462-11467. 10.1073/pnas.201162998.PubMed CentralPubMedView Article
- Zhang W, Rekaya R, Bertrand JK: A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer. Bioinformatics. 2006, 22: 317-325. 10.1093/bioinformatics/bti738.PubMedView Article
- Hanahan D, Weinberg R: The hallmarks of cancer. Cell. 2000, 100: 57-7. 10.1016/S0092-8674(00)81683-9.PubMedView Article
- Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov J, Golub T: Molecular classification of multiple tumor types. Bioinformatics. 2001, 17 (suppl.1): S316-S322.PubMedView Article
- Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21: 631-643. 10.1093/bioinformatics/bti033.PubMedView Article
- Ramaswam S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci. 2001, 98: 15149-15154. 10.1073/pnas.211566398.View Article
- Lyons-Weiler J, Patel S, Bhattacharya S: A classification-based machine learning approach for the analysis of genome-wide expression data. Genome Res. 2003, 13: 503-12. 10.1101/gr.104003.PubMed CentralPubMedView Article
- Skurichina M, Duin RPW: Bagging, boosting and the random space method for linear classifiers. Pattern Anal Appl. 2002, 5: 121-135. 10.1007/s100440200011.View Article
- Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008, 9: 319-10.1186/1471-2105-9-319.PubMed CentralPubMedView Article
- Bagirov AM, On B, Ivkovic S, Aaunders G, Yearwood J: New Algorithms for multi-class diagnosis using tumor gene expression signatures. Bioinformatics. 2003, 19: 1800-1807. 10.1093/bioinformatics/btg238.PubMedView Article
- Antonov AV, Tetko IV, Mader MT, Budczies J, Mewes HW: Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics. 2004, 20: 644-652. 10.1093/bioinformatics/btg462.PubMedView Article
- Shen L, Tan EC: Reducing multiclass cancer classification to binary by output coding and SVM. Comput Biol Chem. 2006, 30 (1): 63-71. 10.1016/j.compbiolchem.2005.10.008.PubMedView Article
- Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classication methods for tissue classification based on gene expression. Bioinformatics. 2004, 20: 2429-2437. 10.1093/bioinformatics/bth267.PubMedView Article
- Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Irer V, Jeffrey SS, Rijin Van de M, Waltham M: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24: 227-235. 10.1038/73432.PubMedView Article
- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci. 2001, 98: 10782-10794. 10.1073/pnas.191368598.View Article
- Breiman : Bagging predictors. Machine Learning. 1996, 24 (2): 123-140.
- Albert J, Chib S: Bayesian analysis of binary polychotomous response data. J Am Stat Ass. 1993, 88: 669-670. 10.2307/2290350.View Article
- Johnson VE, Albert JH: Ordinary Data Model. 1999, Springer New York
- Wall ME, Rechtsteiner A, Rocha LM: Singular value decomposition and principal component analysis. A Practical Approach to Microarray Data Analysis. Edited by: Berrar DP, Dubitzky W, Granzow M. 2003, Kluwer: Norwell, 91-109. full_text.View Article
- Huang D, Zheng C: Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics. 2006, 22: 1855-1862. 10.1093/bioinformatics/btl190.PubMedView Article
- Efron B, Tibshirani R, Storey D, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Ass. 2001, 96: 1151-1160. 10.1198/016214501753382129.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.