Novel methods to identify biologically relevant genes for leukemia and prostate cancer from gene expression profiles
© Chen et al; licensee BioMed Central Ltd. 2010
Received: 26 November 2009
Accepted: 30 April 2010
Published: 30 April 2010
High-throughput microarray experiments now permit researchers to screen thousands of genes simultaneously and determine the different expression levels of genes in normal or cancerous tissues. In this paper, we address the challenge of selecting a relevant and manageable subset of genes from a large microarray dataset. Currently, most gene selection methods focus on identifying a set of genes that can further improve classification accuracy. Few or none of these small sets of genes, however, are biologically relevant (i.e. supported by medical evidence). To deal with this critical issue, we propose two novel methods that can identify biologically relevant genes concerning cancers.
In this paper, we propose two novel techniques, entitled random forest gene selection (RFGS) and support vector sampling technique (SVST). Compared with results from six other methods developed in this paper, we demonstrate experimentally that RFGS and SVST can identify more biologically relevant genes in patients with leukemia or prostate cancer. Among the top 25 genes selected using SVST method, 15 genes were biologically relevant genes in patients with leukemia and 13 genes were biologically relevant genes in patients with prostate cancer. Meanwhile, the RFGS method, while less effective than SVST, still identified an average of 9 biologically relevant genes in both leukemia and prostate cancers. In contrast to traditional statistical methods, which only identify less than 8 genes in patients with leukemia and less than 8 genes in patients with prostate cancer, our methods yield significantly better results.
Our proposed SVST and RFGS methods are novel approaches that can identify a greater number of biologically relevant genes. These methods have been successfully applied to both leukemia and prostate cancers. Research in the fields of biology and medicine should benefit from the identification of biologically relevant genes by confirming recent discoveries in cancer research or suggesting new avenues for exploration.
The completion of the Human Genome Project (HGP) has been recognized as a great achievement in the study of biomedicine; the project not only provided comprehensive information on the human genome but also inspired new ways to study human diseases such as cancers. Concurrent with the advancement of the HGP, several high-throughput and rapid gene function analysis techniques were developed. Among them, microarray may be the most mature technique, and it has become a major data resource in gene function research [1–3]. Over the past few years, microarray-based gene expression profiling has proven to be a promising approach in predicting cancer classification and prognosis outcomes [4–6]. In most cases, cancer diagnosis depends on using a complex combination of clinical and histopathological data. However, it is often difficult or impossible to recognize tumor types in atypical instances . To translate microarray data into functional physiological information, a set of genes with the maximum amount of information and a minimum amount of noise is needed. For example, diagnostic tests that measure the abundance of a given protein in serum may be derived from a small subset of biologically relevant genes.
In cancer classification, one of the reasons one may wish to select a minimum set of genes is to avoid an over-fitting problem caused by attempting to apply a large number of genes to a small number of samples. There are several statistical and machine learning techniques such as t-Test, k-nearest neighbors, clustering methods , self organizing maps (SOM) , genetic algorithm , back-propagation neural network [11–13], probabilistic neural network, decision tree , random forest , and support vector machines (SVM) [16, 17] that have been applied in selecting informative genes. Although these methods can select smaller set of informative genes, only a small percentage of these so called "informative" genes are biologically relevant as proved by medical experiments. Our goal in this paper, therefore, is to best identify biologically relevant genes from a small set of genes using our proposed methods. We present a novel approach that addresses different considerations, including: (1) the identification of quality samples, (2) the selection of a small set of informative genes from these samples, (3) the comparison of these genes with medical literature, and (4) the interpretation of their biological relevance.
Prostate cancer and leukemia are very common cancers in the United States. In 2007 alone, approximately 24 800 new cases and 12 320 deaths among males were attributed to leukemia. Among males age 40 and below, leukemia is the most common fatal cancer. Meanwhile, 19 440 new cases and 9 470 deaths among females were attributed to leukemia, and it is the leading cause of cancer death among females below age 20. Acute lymphocytic leukemia (ALL) is the most common cancer in children age 14 and below. Prostate cancer, on the other hand, in 2007 accounted for almost 29% (218 890) of incidents in males. For men age 80 and older, prostate cancer is the second most common cause of cancer death. Based on cases diagnosed between 1996 and 2002, an estimated 91% of these new cases are expected to be diagnosed at the local or regional level, for which the 5-year relative survival rate approaches 100% [18, 19]. Therefore, the identification of biologically relevant genes is of fundamental and practical interest. The examination of these genes may be useful in confirming recent discoveries in cancer research or suggesting new methods for exploration.
In this paper, we examine eight methods for identifying biologically relevant genes. Among them are six statistics methods [20, 21] and two machine learning methods. The statistics methods include three parametric methods: Signal-to-noise ratio (SNR) [22–24], t-Test [23, 25], and Least Significant Difference (LSD) [13, 26]. They also include three nonparametric methods: Threshold Number of Misclassification (TNoM) , Minimum Distance to Modal Ranking (MDMR) [27, 28], and Weighted Punishment on Overlap (WEPO) [29, 30]. In addition to these six statistics methods, we propose two new methods using machine learning approaches: Random forest gene selection (RFGS) and Support Vector Sampling technique (SVST). For each one of these, we first introduce some underlying theory and the process of computation. Then, we apply these methods to both leukemia and prostate cancer datasets. We compare the top 25 genes identified by each method with those identified within current medical literature, thus pinpointing the biological genes most related to leukemia and prostate cancer. The results show that our proposed SVST method is significantly better than statistical methods for identifying relevant biological genes in leukemia and prostate cancer.
The remainder of this paper is organized as follows: Section 2 discusses the various statistics-based gene selection methods considered in the paper. Section 3 describes our two proposed machine learning methods. Section 4 describes the experiment results and discusses leukemia and prostate cancer. Finally, Section 5 presents the conclusions of our study.
Statistics-Based Gene Selection Methods
Gene selection is widely used to select target genes in the diagnosis of cancers. One of the primary goals of gene selection is to avoid the over-fitting problems caused by the high dimensions and relatively small number of samples of microarray data. Theoretically, in cancer classification, only informative genes which are highly related to particular classes (or subtypes) should be selected . In microarray data analysis, the challenge is to select informative genes that clearly differentiate the classes. Since the number of informative genes is very small compared to the total number of genes in each experiment, utilizing a better search technique is critical. We divide such techniques into two main categories: statistics-based methods and machine learning-based methods. In this section, we will discuss the statistics methods while addressing the machine learning-based methods in the next section.
The statistics methods rank or score the discriminability of each gene based on its own gene expression patterns. Both parametric and nonparametric approaches for estimations of discriminability have been proposed. The parametric estimation approach assesses the discriminability of genes using a variety of statistical analyses, including Signal-to-noise ratio (SNR), t-Test, and Least Significant Difference (LSD). Parametric estimation depends on exact expression levels and the number of replicate samples. The statistical criteria are based on the assumption that the data comes from some kind of distribution. Each parametric approach puts different weights on the variance and number of samples of the criteria. In this study, we use three parametric methods: Signal-to-noise ratio (SNR), t-Test, and Least Significant Difference (LSD). A gene is considered more informative if it possesses a larger corresponding score.
Signal-to-Noise Ratio (SNR)
The μ and σ characters represent the mean and the standard deviation of samples in each class (either + 1 or -1) individually. We rank these genes by F score and then select the top 25 gene sets as the features.
where M + and M- are the sample sizes and μ and σ are the respective mean and standard deviation of samples in each class (either + 1 or -1). We rank these genes with a T score and then select the top 25 gene sets as the features.
Least Significant Difference (LSD)
where μ and σ are the respective mean and standard deviation of samples in each class (either + 1 or -1). We rank these genes by F score and then select the top 25 gene sets as the features.
In contrast to the parametric approach, nonparametric approaches rank samples of each gene using their expression level and punish the disorders that damage a perfect sample split. The less the punishment, the smaller the score a gene receives. This means that a gene is more informative if it has a smaller corresponding score. In this study, we use three nonparametric methods: Threshold Number of Misclassification (TNoM), Minimum Distance to Modal Ranking (MDMR), and Weighted Punishment on Overlap (WEPO).
Threshold Number of Misclassification (TNoM)
We then rank these genes with a TNoM score and select the top 25 gene sets as the features.
Minimum Distance to Modal Ranking (MDMR)
We then rank these genes with an MDMR score and select the top 25 genes for the study.
Weighted Punishment on Overlap (WEPO)
Machine Learning-Based Gene Selection Methods
Identifying biologically relevant genes, such as cancer-related genes, from microarray gene expression data is one of the most important areas in modern medical research. In addition to the six statistical methods described in the previous section, we also propose two machine learning-based gene selection methods: Random Forest Gene Selection (RFGS) and Support Vector Sampling Technique (SVST).
Random Forest Gene Selection (RFGS)
Random forest is an algorithm for classification developed by Leo Breiman  that uses an ensemble of classification trees. Each of the classification trees is built using a bootstrap sample of the data, and at each split the candidate set of variables is comprised of a random subset. Thus, random forest uses both bagging and random variable selection for tree building.
This approach is displayed in the following pseudo code, where X is the cancer's gene expression data (containing S samples G and genes) and the Y S is the label of each sample.
The Pseudo Code of the Random Forest Gene Selection Method
s = number of samples, g = number of genes
Output: n top genes
2. for i = 1 to S
3. do normalize X
5. for I = 1 to N (N = 100 used here)
6. while (All genes assigned completely)
7. Randomly assign all genes into M groups (M = 1000 used here)
8. for J = 1 to M
9. Build up a decision tree on each group
10. Mark the root of each group
13. Rank gene following the number of marks for every gene
14. Select the top 25 genes from the ranking list
15. Confirm the genes with biological evidence from public resources
16. Calculate the average biological genes found in the top 25 genes
Support Vector Sampling Technique (SVST)
In the ongoing effort to improve the accuracy of cancer classification, many machine learning methods have been developed over the past few years. Among them, SVM is arguably one of the best methods. Although the SVM classification method has been widely used in the machine learning domain, there is little research focused on the actual support vectors. These support vectors have several computational and learning theoretic consequences . Gene selection is a common way to avoid the high dimensional feature problem; however, the majority of past research has applied gene selection algorithms using all available samples. The accuracy of SVM is largely dependent on a hyperplane that can clearly separate different classes, and many samples may be outliers or may be separated incorrectly. Thus, using all samples could cause some degree of inaccuracy in classification performance.
In this paper, we develop a new method to identify biologically relevant genes using only quality samples which are located on support vectors. We assume that the use of support vectors is critical in eliminating irrelevant tissue composition-related genes. We called this method the support vector sampling technique (SVST). Our hypothesis is that by using samples located only on support vectors, we have a higher probability of identifying more relevant genes. To verify this hypothesis experimentally, we compared SVST with other statistical methods using two cancer datasets. SVST is a two-step process which includes first selecting support vector samples and then performing the SNR gene selection method. This approach allows us to narrow the field to only the most relevant samples in order to select the most biologically relevant genes.
The approach process is displayed in the following pseudo code. X is the cancer's gene expression data, containing S samples and G genes, and the Y S is the label of each sample.
The Pseudo Code of the SVST Method
s = number of samples, g = number of genes
Output: n top genes
2. for i = 1 to S
3. do normalize X
5. Set K = linear function
6. do train SVM(K(X S ), Y S ) 
7. sv = extract support vectors from training SVM
8. for i = 1 to S
9. svs = extract support vector samples by sv from all samples
11. for i = 1 to G
12. r-genes = do SNR scoring function(svs)
14. rank r-genes by SNR score
Theoretical basis of the SVST
When α i = 0 then L D = 0 in formula (2), as in this case, α i means that the i th data has no influence on the hyperplane; therefore, this sample is correctly classified by the hyperplane (such as point A in Figure 2).
Therefore, L D = α i , and under this circumstance, α i means that the ith data has a degree of influence on the hyperplane (such as point B in Figure 2).
L D is negative, and therefore, α i means the ith data is incorrectly classified by the hyperplane (such as point C in Figure 2). Each α i determines the degree by which each training example influences the SVM function. Because the majority of the training examples do not affect the SVM function, most of the α i are 0. We can then infer that these support vectors should contain the desired strong classification information. By extracting only the samples (such as point B) located on the hyperplane, we can run a gene selection algorithm that better identifies biologically relevant genes.
Parameter settings in SVM for SVST method.
Gamma [Default: 1/(# of genes)]
1/7200 for leukemia 1/12600 for prostate cancer
Results and Discussion
In this paper, we experiment using two cancer gene expression microarray datasets: leukemia and prostate cancer. We chose this data not only out of concern for the potential influence on human beings but also for the data's characteristics. Leukemia microarray data is easily classified; many cancer classification researchers consider this data as a performance comparison standard. Prostate cancer microarray data, however, is not easily classified. Therefore, utilizing both datasets provides a measurable way to demonstrate the benefits of our proposed methods.
Application to the leukemia microarray dataset
This original gene expression data was downloaded from http://www.genome.wi.mit.edu/MPR/. The data contains 72 bone marrow or peripheral blood samples with either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). The data set provides 7129 human genes produced by Affymetrix high-density olignucleotide microarrays. The intensity of gene expression is rescaled to normalize overall intensities for each microarray. Even though this data provides a plethora of genetic information, its feature dimension is too high for practical analysis. We need a selection method that can reduce this feature dimension.
Identifying biologically relevant leukemia genes
The biologically relevant genes found in leukemia.
Functions of the biologically relevant genes found in leukemia.
Adhesion plaque protein. Binds alpha-actinin and the CRP protein. May be a component of a signal transduction pathway that mediates adhesion-stimulated changes in gene expression.
Heterodimers between TCF3 and tissue-specific basic helix-loop-helix (bHLH) proteins play major roles in determining tissue-specific cell fate during embryogenesis, like muscle or early B-cell differentiation. Binds to the kappa-E2 site in the kappa immunoglobulin gene enhancer.
In the immune response, may act as an inhibitory receptor upon ligand induced tyrosine phosphorylation by recruiting cytoplasmic phosphatase(s).
This antigen is associated with early stages of melanoma tumor progression. May play a role in growth regulation. Lysosome membrane; Multi-pass membrane protein. Late endosome membrane; Multi-pass membrane protein. Note = Also found in Weibel-Palade bodies of endothelial cells. Located in platelet dense granules. melanomas, hematopoietic cells, tissue macrophages.
T cell receptor alpha-chain.
Fodrin, which seems to be involved in secretion, interacts with calmodulin in a calcium-dependent manner.
Part of the host defense system of polymorphonuclear leukocytes. It is responsible for microbicidal activity against a wide range of organisms.
As an inhibitor of cysteine proteinases, this protein is thought to serve an important physiological role as a local regulator of this enzyme activity.
Sequence-specific transcription factor which is part of a developmental regulatory system that provides cells with specific positional identities on the anterior-posterior axis.
Required in cooperation with CD79B for initiation of the signal transduction cascade activated by binding of antigen to the B-cell antigen receptor complex.
May be involved in coupling the protein kinase C and calmodulin signal transduction systems.
Essential for the control of the cell cycle at the G1/S (start) transition. Potentiates the transcriptional activity of ATF5.
The proteasome is a multicatalytic proteinase complex which is characterized by its ability to cleave peptides with Arg, Phe, Tyr, Leu, and Glu adjacent to the leaving group at neutral or slightly basic pH. The proteasome has an ATP-dependent proteolytic activity. This subunit is involved in antigen processing to generate class I binding peptides.
Augments natural killer cell activity in spleen cells and stimulates interferon gamma production in T-helper type I cells.
Interacting selectively with one or more specific sites on a receptor molecule, a macromolecule that undergoes combination with a hormone, neurotransmitter, drug or intracellular messenger to initiate a change in cell function.
In this section, we individually examine these 15 genes for relevance in the diagnosis of leukemia. All 15 genes have some relevance to leukemia and deserve a more detailed analysis to understand their role in the cancer's development. The role of some of these biologically relevant genes can be easily explained because they code for proteins whose role in leukemia has been long identified and widely studied. Such is the case of the HoxA9 gene, where Hoxa9 collaborates with other genes to produce highly aggressive acute leukemic disease . The other example is the Macmarcks gene, where tumor necrosis factor-alpha rapidly stimulates Macmarcks gene transcription in human promyelocytic leukemia cells . The presence of some of the other genes in our list can be explained by recently published studies. For example, the role of the CD33 gene, CD33, is a myeloid cell surface antigen that is expressed on blast cells in acute myeloid leukemia (AML) in a majority of all patients regardless of age or subtype of disease .
The role of the 15 genes in Table 3 is described as follows. The ZYX gene: Zyxin encodes a LIM domain protein localized at focal contacts in adherent erythroleukemia cells . The TCF3 gene: The t(1;19)(q23;p13.3) is one of the most common chromosomal abnormalities in B-cell precursor acute lymphoblastic leukemia and usually gives rise to the TCF3-PBX1 fusion gene. The TCF3 gene has been shown to be involved in the majority of cases with a cytogenetically visible t(1;19) translocation, while the remaining TCF3-negative ALLs demonstrated breakpoint heterogeneity . The CD63 gene: In the rat basophilic leukemia cell line, an antibody against CD63 (AD1) inhibited immunoglobulin E (IgE)-mediated histamine release, suggesting a role for CD63 in events associated with mediator release . The TCRA gene: T-cell prolymphocytic leukemia is a sporadic, mature T-cell disorder in which there is usually an aberrant T-cell receptor alpha (TCRA) rearrangement that activates the TCL1 or MTCP1-B1 oncogenes . The SPTAN1 gene: In a human chronic myelogenous leukemia cell line with the Ph1 chromosome, K562, the SPTAN1 mapped centromeric to the translocation breakpoint, indicating that the alpha-fodrin gene is not translocated to the Ph1 chromosome in this cell line . The MPO gene: The tumour cells were positive for CD68 (KP1), CD68 (PGM1), lysozyme and CD45. They were negative for MPO, CD15, CD163, TdT, CD117, T and B cell markers . The CST3 gene: Sun Y explores differentially expressed genes in leukemia gene expression profiles and identifies main related genes in acute leukemia. The results show that in four patient/donor pairs with ALL, 5 up-regulated (RIZ, STK-1, T-cell leukemia/lymphoma 1A, Cbp/p300, Op18) and 1 down-regulated genes (hematopoietic proteoglycan core protein) were identified. In five patient/donor pairs with AML, 6 up-regulated (STAT5B, ligand p62 for the Lck SH2, CST3, LTC4S, myeloid leukemia factor 2 and epb72) and 1 down-regulated genes (CCR5) were identified . The CD79A gene: Expression of the CD79A(MB-1) chain has been studied in leukemia and is shown to be present in most B lineage acute lymphoblastic leukemia . The CCND3 gene: A 51-bp deletion was detected in CCND3 in a patient with normal karyotype acute myeloid leukemia . The PSMB9 gene: PSMB9 (LMP2) is expressed both in normal EBV latency and EBV-associated pathologies. EBV is associated with a variety of haematopoietic cancers such as African Burkitt's lymphoma, Hodgkin's, and adult T-cell leukemia . The IL18 gene: IL18 (IGIF) proposed to be designated as IL-18, selectively up-regulates ICAM-1 expression in KG-1 cells, a human myelomonocytic cell line, human IL-18 was measurable in the plasma of leukemia patients . The STOM gene: STORP is homologous to the STOM (Epb72) gene coding for the erythrocyte band 7 integral membrane proteins or stomatin. The STORP gene is positioned 2 kb upstream of the promyelocytic leukemia gene in a head-to-head configuration .
Application to the prostate cancer microarray dataset
Prostate cancer dataset
The original gene expression data for prostate cancers is available at http://www.genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi. The dataset contains expression levels for 52 prostate tumor samples and 50 normal samples. Each sample contains 12600 genes measured using Afffymertix oligonucleotide arrays. We set the tumor sample to (-1) and the normal samples to (+ 1), and we then merged these data sets together for the 8 methods.
Identifying biologically relevant prostate cancer genes
The biologically relevant genes found in prostate cancer.
Functions of the biologically relevant genes found in prostate cancer.
Plays an essential role in cell growth and maintenance of cell morphology.
S100 calcium binding protein A4.
Intracellular transport of retinol.
Appears to play a crucial role in mediating reciprocal interactions between the endothelium and surrounding matrix and mesenchyme.
Type IV collagen is the major structural component of glomerular basement membranes (GBM), forming a 'chicken-wire' meshwork together with laminins, proteoglycans, and entactin/nidogen.
Chicken nel-like 2 homolog with a wide and weak expression, expressed in adult and fetal brain and hemopoietic cells (nucleated peripheral blood cells) but not in B cells.
Conjugation of reduced glutathione to a wide number of exogenous and endogenous hydrophobic electrophiles.
It is likely to play important roles in both maturation and maintenance of the central nervous system and male reproductive system.
Transmembrane receptor activity.
Lim domain only 3.
Essential for providing the brain with appropriate levels of T3 (3,5,3'-triiodothyronine) during the critical period of development.
May play a role in the regulation of mRNA stability.
Induces apoptosis. Its activity may be modulated by binding to the decoy receptors TNFRSF10C/TRAILR3, TNFRSF10D/TRAILR4 and TNFRSF11B/OPG that cannot induce apoptosis.
We also list the roles of the rest of the biological genes shown in Table 5. The TNFSF10 gene: the FOXO family of forkhead transcription factors is implicated in TNFSF10 transcriptional activation in prostate carcinoma cells . The S100A4 gene: S100A4 protein is expressed in neither benign nor malignant prostatic epithelium nor in LNCaP and Du145 cells. The mechanism underlying absent S100A4 expression in prostatic epithelium and cell lines may involve methylation . The RBP1 gene: Altered CRBP1 expression and promoter hypermethylation occur in several tumours, these changes were investigated in prostate tumorigenesis . The COL4A6 gene: COL4A6 expression is missing in nearly all cancerous tissues as evidenced by the Boolean function . The PTGDS gene: Lipocalin-type prostaglandin D syntheses (L-PGDS) and prostaglandin D2 (PGD2) metabolites produced by normal prostate stromal cells inhibited tumor cell growth through a peroxisome proliferator-activated receptor gamma (PPARgamma)-dependent mechanism . The SERBP1 gene: The expression of hepsin, uPA, PAI-RBP1 (SERBP1), PAI-1, and factor XIII may influence fibrinolysis and are regulated by the tumour microenvironment . The LMO3 gene: The protein encoded in this gene is a LIM-only protein (LMO), which is involved in cell fate determination. This gene has been noted to up-regulate in the prostate cancer samples . The DIO2 gene: Subtype II tumours represent the second clinically aggressive tumour subclass, and the gene expression feature that characterizes this subgroup includes several genes identified in supervised analysis to be associated with both high grade and advanced stage cancer, such as HDAC9 and DIO2. The TARP gene: TARP is exclusively expressed in the prostate in males and is up-regulated by androgen in LNCaP cells, an androgen-sensitive prostate cancer cell line . The HPN gene: Xu L has identified a pair of robust marker genes (HPN and STAT6) by integrating microarray datasets from three different prostate cancer studies .
Comparison of related methods and results.
Ben-Tor et al. 
4/137 (Among the top 137 genes, 8 are cancer-related genes. 4 genes (GAPDH, SLPI, HE4 and keratin 18) are ovarian genes.)
Covell et al. 
1/5 (1 out of the top 5 genes is a Bladder gene)
Up-regulated in tumor cells and down-regulated in normal cells
1/3 (1 out of the top 3 genes is a Breast gene)
5/62 (5 out of the top 62 genes are CNS genes)
2/37 (2 out of the top 37 genes are Colorectal genes)
11/68 (11 out of the top 68 genes are Leukemia genes)
1/4 (1 out of the top 4 genes is a Lung gene)
7/33 (7 out of the top 33 genes are Lymphoma genes)
3/12 (3 out of the top 12 genes are melanoma genes)
0/49 (0 out of the top 49 genes is a Mesothelioma gene)
2/9 (2 out of the top 9 genes are Pancreas genes)
6/36 (6 out of the top 36 genes are Prostate genes)
4/26 (4 out of the top 26 genes are Renal genes)
1/42 (1 out of the top 42 genes is a Uterine gene)
Statistically sound performance comparison among these 8 methods
As Ambroise and McLachlan  point out, the performance of a classification method may be overestimated when using the Leave-out-out method. In this study, therefore, we verified our experiment using a random average 3-fold method. This method randomly separates datasets into 3-folds and chooses one subset among the three as the validation set used to verify the model. The remaining two subsets are used as the model's training sets. The cross validation process is repeated 3 times with each of the three subsets used once for validation. This process is then repeated 100 times in order to gain a statistically impartial performance result for our model. In order to compare the classification performance of the 8 methods used in the paper, we used the SVM classifier with the linear kernel function and with default parameter settings.
Statistically sound performance comparison for the leukemia dataset.
.90(.87 to 1)
.93(.87 to .99)
.94(.89 to .1)
.95(.87 to .99)
.94(.88 to 1)
.96(.85 to 1)
.88(.67 to 1)
.91(.66 to .99)
.91(.69 to .99)
.91(.65 to 1)
.92(.69 to .99)
.92(.64 to 1)
.85(.50 to 1)
.88(.53 to .95)
.89(.51 to .94)
.89(.52 to 1)
.87(.54 to .97)
.89(.54 to 1)
.73(.67 to .91)
.73(.65 to .90)
.73(.66 to .91)
.73(.67 to .90)
.76(.69 to .92)
.75(.67 to .92)
.91(.79 to 1)
.93(.74 to .98)
.93(.72 to .96)
.94(.78 to 98)
.94(.76 to .1)
.94(.79 to .99)
.64(.46 to .79)
.61(.51 to .79)
.60(.50 to 76)
.67(.52 to 81)
.69(.50 to .85)
.73(.53 to .86)
.86(.75 to .95)
.85(.76 to .98)
.85(.75 to .94)
.86(.75 to .95)
.88(.78 to .99)
.86(.73 to .97)
.95(.88 to 1)
.98(.87 to .99)
.97(.85 to .1)
.98(.87 to 1)
.98(.88 to .99)
.97(.87 to 1)
Statistically sound performance comparison for the prostate cancer dataset.
.86(.82 to .95)
.86(.82 to .95)
.85(.80 to .97)
.86(.83 to .95)
.83(.80 to .93)
.84(.82 to .96)
.80(.67 to .94)
.82(.66 to .92)
.82(.67 to .90)
.81(.67 to .93)
.81(.68 to .93)
.80(.69 to .95)
.79(.65 to .94)
.81(.63 to .93)
.81(.62 to .95)
.81(.64 to .95)
.81(.67 to .94)
.82(.64 to .93)
.65(.53 to .80)
.65(.51 to .78)
.63(.50 to .79)
.65(.53 to .80)
.65(.52 to .78)
.63(.51 to .81)
.87(.76 to .95)
.84(.75 to .97)
.86(.76 to .98)
.86(.75 to .97)
.87(.78 to .95)
.87(.74 to .98)
.56(.43 to .70)
.57(.44 to .69)
.67(.53 to .74)
.70(.55 to .79)
.68(.52 to .75)
.73(.64 to .86)
.80(.65 to .91)
.81(.68 to .92)
.78(.63 to .91)
.82(.68 to .92)
.79(.65 to .90)
.81(.67 to .92)
.92(.85 to .95)
.90(.83 to .96)
.91(.84 to .95)
.92(.87 to .94)
.92(.82 to .95)
.93(.81 to .97)
Preliminary study of gene-gene interaction of biologically relevant leukemia genes identified by the SVST method
Due to the superior characteristics of our SVST method (i.e. identifying a greater number of biologically relevant genes and yielding better classification accuracy rates), we would like to further investigate the possible gene-gene interactions among these biologically relevant genes. Our hypothesis is that the gene-gene interactions among these biologically relevant genes, if present, may provide additional benefits with regards to the diagnosis of cancers. As a preliminary study, we ran the experiment using 15 biologically relevant genes selected from a leukemia dataset. At first, we screened several protein-protein interaction (PPI) websites, and we found the IPIR (integrated protein interaction resource, http://ymbc.ym.edu.tw/ipir/) to be an excellent tool for building PPI graphs of leukemia gene products. The IPIR is a powerful web tool which retrieves protein-protein interaction information from BIND, DIP, HPRD, MINT, MIPS, and IntAct databases.
The gene-gene interaction among identified leukemia genes.
Number of interacted
Bridge gene between
gene1 and gene2
There are several sub-networks among these genes. For instance, the sun-network links ZYX with TCF3, CST3, and SPTAN1via NEDD9, ATXN1, and TES, respectively (marked in yellow). The sun-network links TCF3 with ZYX and HOXA9 via NEDD9 and CREBBP, respectively. The sun-network links CD33 with CD79A and APTAN1 via PTPN6 and SRC, respectively. The sun-network links CD63 with TCRA via HLADRA. The sun-network links TCRA with CD63 and MPO via HLADRA and HSPA5, respectively. The sun-network links SPTAN1 with MPO, IL18, ZYX, and CD33 via ACTB, CASP3, TES, and SRC, respectively. The sun-network links MPO with SPTAN1 and TCRA via ACTB and HSPA5, respectively. The sun-network links CST3 with ZYX via ATXN1. The sun-network links HOXA9 with TCF3 via CREBBP. The sun-network links CD79A with CD33 via PTPN6. The sun-network links IL18 with SPTAN1 via CASP3.
Whether the identified PPI graph is the key mechanism to better classification performance currently remains unproven and is beyond the scope of this particular paper. However, our SVST method has the capability to identify a group of biologically relevant leukemia genes with a significant gene-gene interaction relationship. We believe this finding merits further study.
It is difficult in cancer research to identify sensitive and specific gene markers. In order to overcome problems caused by high dimensional input spaces, accurate and efficient gene selection methods are critical. Traditional selection approaches, however, do not consider the quality of the samples they analyze, the result of which affects the selection of biologically relevant genes.
In this paper, we have proposed two novel gene selection algorithms, the SVST and the RFGS methods. Both identify more biologically relevant genes concerning leukemia and prostate cancer. The proposed RFGS method is capable of searching for a global optimal or near optimal subset of genes due to their randomized characteristics. The proposed SVST method first extracts quality samples (i.e. support vector samples located only on support vectors) and avoids selecting incorrect genes. These quality samples are then used to form an optimal subset of genes that have a better chance to be biologically relevant.
We demonstrate experimentally that our proposed RFGS and SVST methods identify more genes relevant to cancers. Our proposed RFGS method has the ability to identify an average of 9 biologically relevant genes out of the top 25 genes in both leukemia and prostate cancers. Our proposed SVST method produces the best results among all 8 methods. From the top 25 genes selected using SVST method, we find that 15 are biologically relevant in patients with leukemia and 13 genes are biologically relevant in patients with prostate cancers. In contrast to traditional statistical methods, which only identify 8 or less genes in patients with leukemia and 8 or less genes in patients with prostate cancer, our methods yield significantly better results. The significance of identifying biologically relevant genes cannot be understated; research in the fields of biology and medicine can benefit substantially from the identification of biologically relevant genes to confirm recent discoveries in cancer research or suggest new avenues for exploration.
The authors thank the National Science Council for their financial support regarding project NSC 98-2221-E-320-005.
- Cho R, Campbell J, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockart D: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell. 1998, 2: 65-73. 10.1016/S1097-2765(00)80114-8.PubMedView ArticleGoogle Scholar
- De Risi J, Iyer V, Brown P: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997, 278: 680-686. 10.1126/science.278.5338.680.View ArticleGoogle Scholar
- Friedman N: Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000, 7 (3-4): 601-620. 10.1089/106652700750050961.PubMedView ArticleGoogle Scholar
- Chen JJ: Global analysis of gene expression in invasion by a lung cancer model. Cancer Research. 2001, 61: 5223-5230.PubMedGoogle Scholar
- Morley M: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.PubMed CentralPubMedView ArticleGoogle Scholar
- Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21: 631-643. 10.1093/bioinformatics/bti033.PubMedView ArticleGoogle Scholar
- Ramaswamy S: Multiclass cancer diagnosis using tumour gene expression signatures. Proc Natl Acad Sci USA. 2001, 98: 15149-15154. 10.1073/pnas.211566398.PubMed CentralPubMedView ArticleGoogle Scholar
- Hastie T, Tibshirani R, Eisen M, Brown P, Ross D, Scherf U, Weinstein J, Alizadeh A, Staudt L, Botstein D: Gene Shaving: a new class of clustering methods for expression arrays. Stanford University Technical report. 2000Google Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting patterns of gene expression with self-organizing maps. Proc Natl Acad Sci USA. 2000, 96: 2907-2912. 10.1073/pnas.96.6.2907.View ArticleGoogle Scholar
- Goldberg DE: Genetic Algorithms in Search, Optimization, and Machine Learning. 1989, Addison-Wesley Reading, MAGoogle Scholar
- Greer BT, Khan J: Diagnostic classification of cancer using DNA microarrays and artificial intelligence. Ann N Y Acad Sci. 2004, 1020: 49-66. 10.1196/annals.1310.007.PubMedView ArticleGoogle Scholar
- Li L, Weinberg RC, Darden TA, Pedersen LG: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA-KNN method. Bioinformatics. 2001, 17: 1131-1142. 10.1093/bioinformatics/17.12.1131.PubMedView ArticleGoogle Scholar
- Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 2001, 7: 673-679. 10.1038/89044.PubMed CentralPubMedView ArticleGoogle Scholar
- Tan AC, Gilbert D: Ensemble machine learning on gene expression data for cancer classification. Applied Bioinformatics. 2003, 2: 75-S83-Google Scholar
- Prinzie A, Poel Van den D: Random forests for multiclass classification: Random multinomial logit. Expert Systems with Applications. 2008, 34: 1721-1732. 10.1016/j.eswa.2007.01.029.View ArticleGoogle Scholar
- Chang C, Lin CJ: LIBSVM: a library for support vector machines. 2003, [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]Google Scholar
- Cortes , Vapnik : Support vector networks, Mach. Learning. 1995, 20: 273-297.Google Scholar
- Holland JH: Adaptation in Natural and Artificial Systems. Prostate. 1999, 40: 14-10.1002/(SICI)1097-0045(19990615)40:1<14::AID-PROS2>3.0.CO;2-6.View ArticleGoogle Scholar
- Jemal A, Siegel R, Ward E, Murray T, Xu J, Thun MJ: Cancer statistics 2007. CA Cancer J Clin. 2007, 57: 43-66. 10.3322/canjclin.57.1.43.PubMedView ArticleGoogle Scholar
- Ewens WJ, Grant GR: Statistical Methods in Bioinformatics: An Introduction (Statistics for Biology and Health). 2005, Springer-Verlag pressView ArticleGoogle Scholar
- Munro BH: Statistical Methods for Health Care Research. 2004, Lippincott Williams & WilkinsGoogle Scholar
- Dudoit S, Laan M, Keles S, Cornec M: Unified cross-validationmethodology for estimator selection and application to genomic. Bulletin of the International Statistical Institute, 54th Session Proceedings. 2003, LX (Book 2): 412-415.Google Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
- Slonim D, Tamayo P, Mesirov J, Golub T, Lander E: Class prediction and discovery using gene expression data. Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB). 2000, Universal Academy Press, Tokyo, Japan, 263-272.Google Scholar
- Jeronimo C, Henrique R, Oliveira J, Lobo F, Pais I, Teixeira MR, Lopes C: Aberrant cellular retinol binding protein 1 (CRBP1) gene expression and promoter methylation in prostate cancer. Journal of Clinical Pathology. 2004, 57: 872-876. 10.1136/jcp.2003.014555.PubMed CentralPubMedView ArticleGoogle Scholar
- Longnecker R: Epstein-Barr virus latency: LMP2, a regulator or means for Epstein-Barr virus persistence?. Adv Cancer Res. 2000, 79: 175-200. full_text.PubMedView ArticleGoogle Scholar
- Ben-Dor A: Tissue Classification with Gene Expression Profiles. Journal of Computational Biology. 2000, 7: 559-583. 10.1089/106652700750050943.PubMedView ArticleGoogle Scholar
- Park PJ, Pagano M, Bonetti M: A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data. Pacific Symposium on Biocomputing. 2001, 6: 52-63.Google Scholar
- Chen T, He HL, Church GM: Modeling Gene Expression with Differential Equations. Proc. of Pacific Symposium on Biocomputing. 1999, 29-40.Google Scholar
- Chuang HY, Tsai HK, Tsai YF, Kao CY: Ranking genes for discriminability on microarray data. Journal of Information Science and Engineering. 2003, 19: 953-966.Google Scholar
- Breiman L: Random forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Liu Z, Meng X: Integration of Improved BPNN Algorithm and Multistage Dynamic Fuzzy Judgement and Its Application on ESMP Evaluation. Journal of computers. 2009, 4 (1): 10.4304/jcp.4.1.69-76.
- Thorsteinsdottir U, Krosl J, Kroon E, Haman A, Hoang T, Sauvageau G: The oncoprotein E2APbx1a collaborates with Hoxa9 to acutely transform primary bone marrow cells. Molecular Cell Biology. 1999, 19 (9): 6355-6366.View ArticleGoogle Scholar
- Harlan DM, Graff JM, Stumpo DJ, Eddy RL, Shows TB, Boyle JM, Blackshear PJ: The human myristoylated alanine-rich C kinase substrate (MARCKS) gene (MACS). Analysis of its gene product, promoter, and chromosomal localization. Journal of Biological Chemistry. 1991, 266 (22): 14399-14405.PubMedGoogle Scholar
- Sperr WR, Florian S, Hauswirth AW: Valent, CD33 as a target of therapy in acute myeloid leukemia: current status and future perspectives. Leuk Lymphoma. 2005, 46: 115-1120. 10.1080/10428190500126075.View ArticleGoogle Scholar
- Macalma T, Otte J, Hensler ME, Bockholt SM, Louis HA, Kalff-Suske M, Grzeschik KH, Ahe von der D, Beckerle MC: Molecular haracterization of human zyxin. Journal of Biological Chemistry. 1996, 271 (49): 31470-31478. 10.1074/jbc.271.49.31470.PubMedView ArticleGoogle Scholar
- Barber KE, Harrison CJ, Broadfield ZJ, Stewart AR, Wright SL, Martineau M: Molecular cytogenetic characterization of TCF3 (E2A)/19p13.3 rearrangements in B-cell precursor acute lymphoblastic leukemia. Genes Chromosomes Cancer. 2007, 46: 478-486. 10.1002/gcc.20431.PubMedView ArticleGoogle Scholar
- Leinoe EB, Hoffmann MH, Kjaersgaard E, Johnsen HE: Multiple platelet defects identified by flow cytometry at diagnosis in acute myeloid leukemia. Br J Haematol. 2004, 127 (1): 76-84. 10.1111/j.1365-2141.2004.05156.x.PubMedView ArticleGoogle Scholar
- De Schouwer PJ, Dyer MJ, Brito-Babapulle VB, Matutes E, Catovsky D, Yuille MR: T-cell prolymphocytic leukemia: antigen receptor gene rearrangement and a novel mode of MTCP1 B1 activation. Br J Haematol. 2000, 110: 831-838. 10.1046/j.1365-2141.2000.02256.x.PubMedView ArticleGoogle Scholar
- Upender M, Gallagher PG, Moon RT: Localization of human alpha-fodrin gene (SPTAN1) to 9q33-q34 by fluorescence in situ hybridization. Cytogenet Cell Genet. 1994, 66: 39-41. 10.1159/000133660.PubMedView ArticleGoogle Scholar
- Zuo Z, Lu WP, Yu JB, Li JM, Liao DY: Extramedullary infiltration of acute monocytic leukemia/monoblastic sarcoma: a clinicopathologic and immunophenotype analysis of 5 cases. Zhonghua bing li xue za zhi Chinese journal of pathology. 2008, 37 (1): 27-30.PubMedGoogle Scholar
- Sun Y, Dong LJ, Tian F, Wang SQ, Jia ZL, Huang J: Identification of acute leukemia-specific genes from leukemia recipient/sibling donor pairs by distinguishing study with oligonucleotide microarrays. Zhongguo Shi Yan Xue Ye Xue Za Zhi (article in Chinese). 2004, 4 (12): 450-454.Google Scholar
- Astsaturov IA, Matutes E, Morilla R, Seon BK, Mason DY, Farahat N, Catovsky D: Differential expression of B29 (CD79b) and mb-1 (CD79a) proteins in acute lymphoblastic leukemia. Leukemia. 1996, 10: 769-773.PubMedGoogle Scholar
- Smith ML, Arch R, Smith LL: Development of a human acute myeloid leukemia screening panel and consequent identification of novel gene mutation in FLT3 and CCND3. Br J Haematol. 2005, 128: 318-23. 10.1111/j.1365-2141.2004.05324.x.PubMedView ArticleGoogle Scholar
- Taniguchi M, Nagaoka K, Kunikata T, Kayano T, Yamauchi H, Nakamura S, Ikeda M, Orita K, Kurimoto M: Characterization of anti-human interleukin-18 (IL-18)/IFN-γ-inducing factor (IGIF) monoclonal antibodies and their application in the measurement of human IL-18 by ELISA. J Immunol Methods. 1997, 206: 107-10.1016/S0022-1759(97)00094-X.PubMedView ArticleGoogle Scholar
- Gilles F, Glenn M, Goy A, Remache Y, Zeelentz A: A novel gene STORP (stomatin related protein) is localized 2 kb upstream of the promyelocytic gene on chromosome 15q22. Eur J Haematol. 2000, 64: 104-113. 10.1034/j.1600-0609.2000.90054.x.PubMedView ArticleGoogle Scholar
- Singh D: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1: 203-209. 10.1016/S1535-6108(02)00030-2.PubMedView ArticleGoogle Scholar
- DiLella AG, Toner TJ, Austin CP, Connolly BM: Identification of genes differentially expressed in benign prostatic hyperplasia. J Histochem Cytochem. 2001, 49 (5): 669-670.PubMedView ArticleGoogle Scholar
- Li M, Guan TY, Li Y, Na YQ: Polymorphisms of GSTM1 and CYP1A1 genes and their genetic susceptibility to prostate cancer in Chinese. Chin Med J. 2008, 121: 305-308.PubMedGoogle Scholar
- Wang GM, Kovalenko B, Huang Y, Moscatelli D: Vascular endothelial growth factor and angiopoietin are required for prostate regeneration. Prostate. 2007, 67: 485-99. 10.1002/pros.20534.PubMed CentralPubMedView ArticleGoogle Scholar
- Modur V, Nagarajan R, Evers BM, Milbrandt J: FOXO proteins regulate tumor necrosis factor-related apoptosis inducing ligand expression. Implications for PTEN mutation in prostate cancer. J Biol Chem. 2002, 277: 47928-47937. 10.1074/jbc.M207509200.PubMedView ArticleGoogle Scholar
- Rehman I, Goodarzi A, Cross SS, Leiblich A, Catto AW, Phillips JT, Hamdy FC: DNA methylation and immunohistochemical analysis of the S100A4 calcium binding protein in human prostate cancer. The Prostate. 2007, 67 (4): 341-347. 10.1002/pros.20401.PubMedView ArticleGoogle Scholar
- Dehan P, Waltregny D, Beschin A, Noel A, Castronovo V: Loss of type IV collagen alpha 5 and alpha 6 chains in human invasive prostate carcinomas. Am J Pathol. 1997, 151: 1097-1104.PubMed CentralPubMedGoogle Scholar
- Kim J, Yang P, Suraokar M, Sabichi AL, Llansa ND, Mendoza G, Subbarayan V, Logothetis CJ, Newman RA, Lippman SM: Suppression of prostate tumor cell growth by stromal cell prostaglandin D syntheses-derived products. Cancer Res. 2005, 65: 6189-6198. 10.1158/0008-5472.CAN-04-4439.PubMedView ArticleGoogle Scholar
- Morrissey C, True LD, Roudier MP: Differential expression of angiogenesis associated genes in prostate cancer bone, liver and lymph node metastases. Clin Exp Metastasis. 2008, 25: 377-388. 10.1007/s10585-007-9116-4.PubMedView ArticleGoogle Scholar
- Uzma SS, Robert HG: BFingerprinting the Diseased Prostate: Associations between BPH and Prostate Cancer. J Cell Biochem. 2004, 91: 161-169. 10.1002/jcb.10739.View ArticleGoogle Scholar
- Lapointe J: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA. 2004, 101: 811-816. 10.1073/pnas.0304146101.PubMed CentralPubMedView ArticleGoogle Scholar
- Maeda H, Nagata S, Wolfgang CD, Bratthauer GL, Bera TK, Pastan I: The T cell receptor gamma chain alternate reading frame protein (TARP), a prostate-specific protein localized in mitochondria. J Biol Chem. 2004, 279: 24561-24568. 10.1074/jbc.M402492200.PubMedView ArticleGoogle Scholar
- Xu L, Tan AC, Naiman DQ, Geman D, Winslow RL: Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics. 2005, 21: 3905-3911. 10.1093/bioinformatics/bti647.PubMedView ArticleGoogle Scholar
- Covell DG, Wallqvist A, Rabow AA, Thanki N: Molecular Classification of Cancer: Unsupervised Self-Organizing Map Analysis of Gene Expression Microarray Data. Molecular Cancer Therapeutics. 2003, 2: 317-332.PubMedGoogle Scholar
- Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS. 2002, 99 (10): 6562-6566. 10.1073/pnas.102102699.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.