Transcriptome classification reveals molecular subtypes in psoriasis
- Chrysanthi Ainali†1, 2,
- Najl Valeyev†3,
- Gayathri Perera†2,
- Andrew Williams2,
- Johann E Gudjonsson4,
- Christos A Ouzounis1, 5, 6,
- Frank O Nestle2Email author and
- Sophia Tsoka1Email author
© Ainali et al.; licensee BioMed Central Ltd. 2012
Received: 4 May 2012
Accepted: 29 August 2012
Published: 12 September 2012
Psoriasis is an immune-mediated disease characterised by chronically elevated pro-inflammatory cytokine levels, leading to aberrant keratinocyte proliferation and differentiation. Although certain clinical phenotypes, such as plaque psoriasis, are well defined, it is currently unclear whether there are molecular subtypes that might impact on prognosis or treatment outcomes.
We present a pipeline for patient stratification through a comprehensive analysis of gene expression in paired lesional and non-lesional psoriatic tissue samples, compared with controls, to establish differences in RNA expression patterns across all tissue types. Ensembles of decision tree predictors were employed to cluster psoriatic samples on the basis of gene expression patterns and reveal gene expression signatures that best discriminate molecular disease subtypes. This multi-stage procedure was applied to several published psoriasis studies and a comparison of gene expression patterns across datasets was performed.
Overall, classification of psoriasis gene expression patterns revealed distinct molecular sub-groups within the clinical phenotype of plaque psoriasis. Enrichment for TGFb and ErbB signaling pathways, noted in one of the two psoriasis subgroups, suggested that this group may be more amenable to therapies targeting these pathways. Our study highlights the potential biological relevance of using ensemble decision tree predictors to determine molecular disease subtypes, in what may initially appear to be a homogenous clinical group. The R code used in this paper is available upon request.
Psoriasis is one of the most prevalent chronic inflammatory disorders caused by an interplay of genetic factors and the environment on the background of dysregulated immune system . The disease affects 2 - 3% of the population worldwide  and can be variable in morphology, severity and distribution. There are several clinical variants of psoriasis, but the most common variant, plaque psoriasis, is characterised by chronic, symmetrical, silvery-scaled, sharply circumscribed plaques [1, 3]. Plaque psoriasis is the most common form of the disease and can begin in childhood and late adolescence (Type 1) or in adulthood (Type 2), with a predilection for elbows, knees and the scalp.
Although the cause of psoriasis remains unknown, it is thought to be a complex and multifactorial disorder brought about by the combination of multiple susceptibility genes [4–6], a dysregulated immune system [7, 8] and environmental factors . Through Genome Wide Association Studies (GWAS) [10, 11], a number of genetic variants have been identified as contributing towards psoriasis pathogenesis. A unifying model that integrates genetic, environmental and immunological aspects of skin inflammation has been proposed .
In recent years, progress has been made in understanding the pathogenesis and treatment of psoriasis. Pathogenesis is mainly linked to activation of several types of leukocytes that control cellular immunity and to a T-cell-dependent inflammatory process in skin that accelerates the growth of epidermal and vascular cells in psoriasis lesions. Current therapeutic approaches against the disease take advantage of proteins or antibodies aiming either at specific inflammatory co-activators or more generally at immune cells . While there is now increasing insight into the genes conferring disease susceptibility, much less is known about the types of regulatory networks of expressed genes which define the molecular signature of the disease.
The first large-scale and detailed gene expression studies of psoriasis identified various differentially expressed genes by comparing non-lesional and lesional skin against normal tissue [13–16]. Recent studies have attempted to elucidate the molecular pathways underlying in psoriasis [17–20]. However, determining genes that contribute to complex human disorders through analysis of microarray data is challenging due to the large number of gene predictors, their possible interactions, and the small number of samples. Termed the “small n, large p” problem , this implies that classical statistical methods cannot be implemented directly in functional genomics approaches for the identification of diagnostic or prognostic biomarkers. In this respect, decision trees have proven to be a sensible non-parametric method for classification and variable selection . Random forest (RF) classification is an ensemble of CART decision trees and has been found to outperform other machine learning techniques for analysis of microarray data [23–26].
In this study, a computational methodology based on decision tree predictors is developed to discover molecular sub-groups from gene expression data and illustrate gene signatures associated with each group. The random forest (RF) algorithm  is used here to (i) cluster psoriasis transcripts into subgroups and (ii) discriminate between disease phenotypes and generate gene signatures that best differentiate them. RF has been shown to be robust in noisy data, to avoid over-fitting in cases where the number of features is larger than the number of observations and to be particularly suitable for the feature selection process [24, 27, 28].
More specific to the current analysis, we first analysed gene expression profiles in normal and disease skin tissues, so as to define common differentially expressed genes. This core gene set was then used to group psoriatic tissue samples through RF clustering of real and synthetic data, as previously developed . This step resulted in dividing psoriatic tissues into two subgroups according to similarity of gene expression patterns. Finally, RF classification was used to derive gene signatures able to discriminate between normal and disease phenotypes, including the above-proposed new psoriatic subgroups. Such gene signatures are discussed in following sections with respect to their effect on defining distinct molecular characteristics and were validated through comparisons with other psoriasis gene expression studies.
Molecular profiling of psoriatic phenotypes followed by classification of tissue samples into appropriate disease classes has the potential to derive clusters of similar transcription responses from the entire repertoire of profiles generated. Especially in the case of a homogeneous clinical patient group, such as plaque-type psoriasis, the classification of transcriptional patterns into appropriate sub-groups may reveal distinct molecular mechanisms that may operate within this group and may explain variability in response and options of disease treatment. Overall, given the predictive nature of the decision model employed, such patient categorisations can lead to significant insights into disease mechanisms and novel, targeted therapeutic approaches.
Results and Discussion
Gene expression patterns define a core set of dysregulated genes among normal, non-lesional and lesional skin
Unsupervised hierarchical clustering was carried out on the set of 206 core genes to explore and visualise the patterns of gene expression from normal (NN) to non-lesional (PN) and then to lesional (PP) skin samples. Figure 2b shows an overview of gene expression for the core probe sets, clustered according to similarity of expression across NN, PN and PP samples. This visualisation represents a striking outline of the varying transcriptional patterns at each disease phase, progressing gradually from generally non-differentiated gene expression in non-inflamed tissues (NN, PN), to markedly differentiated genes in lesional samples (PP).
Principal component analysis (PCA) was used to assess the clustering of samples when progressing from un-inflamed to inflamed skin. There was a clear distinction between lesional (PP) and non-lesional (NN and PN) phenotypes (Figure 2c), manifested as distinct clusters of samples from normal to the involved phenotype through non-involved skin. Normal and psoriatic un-involved samples (NN and PN) co-clustered away from involved cases (PP), in agreement with previously published analyses [16, 19]. This demonstrated the changes in gene expression profiles across NN, PN and PP skin and revealed a marked difference between inflamed (PP) skin and un-inflamed (PN and NN) phenotypes.
Among the strongly dysregulated genes in the core gene set (Additional file 1: Table S1), several of the under-expressed genes were found to encode proteins involved in fibrotic processes and immune responses. For example, FN1, PDGFC, MYH10 are involved in the regulation of the actin cytoskeleton, which participates in fundamental processes such as the regulation of cell shape, motility and adhesion . DIXDC1, CGNL1 and SSPN encode cell adhesion and junction proteins. Betacellulin (BTC), IL1F7, CD81, FN1, PDGFC and SCARB2 are immune response genes. In addition, MEGF9, BTC, FN1, PHF2 belong to the family of growth factors that activate the epidermal growth factor receptor, EGFR (ErbB1) and according to a previous study BTC plays an important role in skin morphogenesis . Among the over-expressed genes, several participate in keratinocyte proliferation and differentiation (EREG, KLK8 and PPARD). Of note is KLK8, potentially involved in the modulation of hyperkeratosis in a psoriatic lesion and may be implicated in preventing excessive keratinocyte proliferation, resulting in increased shedding of corneocytes. This is clinically reflected in the copious quantities of scale that are shed by psoriasis patients . Genes LTB4R2 and PPARD are also involved in keratinocyte migration. Finally, a group of up-regulated genes SNRPG, SNRPD1, SNRPD3, SNRPA1, SNRPC, SF3B14, SFRS9 is involved in spliceosomal assembly. Overall, most dys-regulated genes were found to be consistent with current knowledge.
Distinctive gene expression patterns between lesional and non-lesional tissues (PP vs. PN)
Following the general patterns of psoriatic tissue differentiation, the use of decision tree ensembles was explored to classify samples into PN and PP classes and derive the major gene patterns able to discriminate the psoriatic phenotypes (see Figure 1, step1). We used 74 tissue samples from psoriasis patients, each characterised by a vector of core gene expression values, and a random forest (RF) classifier  was applied to distinguish samples in lesional (PP) and non-lesional (PN) phenotypes. The classifier employed 1000 trees with training of each tree performed on 2/3 of samples and testing on the remaining 1/3 (see Methods and Additional file 2: Supplementary Methods). The prediction accuracy of the classifier was high (accuracy 97.3%, OOB error rate 2.7%).
Identification of molecular sub-types within psoriatic tissue samples
In addition to key patterns that defined disease outcome in psoriatic tissues above, we used random forest in unsupervised mode, as a clustering platform to group lesional psoriasis samples based on their gene expression properties (see Figure 1, step 2). The aim was to generate two sub-groups among disease tissues (PP), before further classification runs could identify molecular differences among them (Figure 1, step 3, discussed later). As described previously , first synthetic data were generated by randomly sampling the gene expression observations. Then, a random forest predictor was built to distinguish the real from synthetic data (see Methods) and define a similarity measure between the psoriatic cases in the form of the random forest proximity measure. Finally, CLARA clustering of the proximity matrix partitioned the psoriatic cases into two groups, named PP01 and PP02 (Figure 1, step 2). The adjusted rand index to indicate the difference between the two identified sub-groups was -0.0269.
The RF-derived proximity measure can be used to generate a multi-dimensional scaling (MDS) plot, where dissimilarities between samples return a set of points in low dimensional Euclidian space, similar to principal component analysis. The MDS plot projects data into a 2D space giving the similarities among patients and their respective classes. The distinction of samples in two groups, PP01 (red circles) and PP02 (black circles) is shown through the MDS plot in Additional file 3: Figure S1. Similar clustering of PP phenotypes in two clusters has been noted through hierarchical clustering (data not shown) and was used as means of determining the optimum number of psoriatic sub-groups.
The relationship between these sub-groups and clinically measurable parameters, was assessed. Psoriasis Area and Severity Index (PASI), Body Mass index (BMI), Age of Onset, Age and Body Surface Area (BSA) were evaluated against subgroups PP01 and PP02. Of these, age was found to be significantly altered between the two subgroups (p-value 0.0184, Wilcoxon signed-rank test). It is emphasised here that plaque-type psoriasis constitutes a homogeneous clinical group, distinct from other forms of psoriasis. Therefore, it is not surprising that such coarse-grained clinical parameters can not capture the subtle differences in plaque psoriasis sub-groups (PP01, PP02). Instead, our focus here is to distinguish the underlying biological mechanisms, in terms of distinct biochemical pathways and interactions that act in these subgroups, as we report in following sections.
Functional annotation for most informative genes
Chromosomal region: chr4q13-q21
GO-BP: cell proliferation, positive regulation of cell proliferation
GO-MF: epidermal growth factor receptor binding, growth factor activity, growth factor activity
GO-CC: extracellular region, soluble fraction, plasma membrane, integral to membrane
Pathway: ErbB signalling, ERK signaling
Chromosomal region: chr19q13.2
GO-CC: cornified envelope, cytoplasm
Description: Protein C20orf11 (two hybrid-associated protein 1 with RanBPM) (Twa1)
Chromosomal region: chr20q13.33
GO-MF: protein binding
Description: budding uninhibited by benzimidazoles 3 homolog (yeast)
Chromosomal region: chr10q26
GO-BP: mitosis, cell proliferation, anaphase-promoting complex-dependent proteasomal ubiquitin-dependent protein catabolic process, negative regulation of ubiquitin-protein ligase activity during mitotic cell cycle
GO-MF: protein binding
GO-CC: kinetochore, nucleus, cytosol
Pathway: Cell cycle role of APC in cell cycle regulation
Description: interleukin 1 family, member 7 (zeta)
Chromosomal region: chr2q12-q14.1
GO-BP: immune response
GO-MF: cytokine activity, interleukin-1 receptor binding, interleukin-1 receptor antagonist activity
GO-CC: extracellular region
Pathway: Systemic lupus erythematosus signaling, role of cytokines in mediating communication between immune cells, graft-versus-host disease signaling, p38 MAPK signaling, atherosclerosis signaling
Pathway enrichment in the PP01 psoriatic group
NOTCH1 Intracellular Domain Regulates Transcription
Signaling by NOTCH1
Signaling by NOTCH
NOTCH1 Intracellular Domain Regulates Transcription
Synthesis of very long-chain fatty acyl-CoAs
Fatty Acyl-CoA Biosynthesis
PI3K events in ERBB4 signaling
PI3K events in ERBB2 signaling
Signaling by ERBB4
Signaling by ERBB2
AKT phosphorylates targets in the nucleus
Signaling by TGF beta
SHC1 events in ERBB4 signaling
GRB2 events in ERBB2 signaling
Signaling by BMP
SHC1 events in ERBB2 signaling
PIP3 activates AKT signaling
Nuclear signaling by ERBB4
Pathway enrichment in the PP02 psoriatic group
Transport of Glycerol from Adipocytes to the Liver by Aquaporins
Transport by Aquaporins
Signaling by TGF beta
Signaling by BMP
Respiratory electron transport
Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins.
The citric acid (TCA) cycle and respiratory electron transport
Identification of key genes associated with disease sub-classes and comparison with other studies
The pipeline outlined above was replicated with two other psoriatic datasets from  (Gudjonsson dataset) and  (Yao dataset). Skin samples were grouped into sub-types according to their gene expression patterns as for the GAIN dataset, using similarities derived from the proximity matrix through random forest (an MDS plot for Gudjonsson and Yao data is shown in Additional file 7: Figure S4). The circular representation of the most important genes was also followed here and the 19 most informative genes from Gudjonsson and 27 from Yao datasets are shown (Additional file 8: Figure S5 and Additional file 9: Figure S6). By comparing across the three datasets and the relevant gene signatures, the importance of specific genes was noted. BTC, CNFN, IL1F7 were important discriminant genes in the GAIN data, while SNRPC and SMURF2 played a greater role in the Yao and Gudjonsson datasets. Generating a consistent outcome of gene signatures across all datasets is challenging, as patient cohorts may vary significantly. Although the Yao data seem difficult to reproduce, considerable similarity exists between the other two datasets. Specifically, one of the disease subgroups in these dataset points to pathways related to NOTCH signaling, ErbB and TGF beta suggesting that this group may be more amenable to related therapeutic options (see Tables 2, 3, Additional file 10: Table S3 and Additional file 11: Table S4).
We note that evaluation of psoriasis transcriptomes has been assessed elsewhere  and the observed low reproducibility across various studies was attributed to wide variability in clinical protocols, platforms and sample handling among different datasets. It is envisaged that the application of the present and similar strategies for predictive modelling and stratification of expression patterns, as well as the availability of larger patient studies will bridge the disparities between various studies and yield a sharper picture of gene contributions to this complex disorder.
Large-scale genome characterisations, through the analysis of gene sequence and expression data, are gaining increasing interest and have the potential to greatly improve our understanding of the physiological and molecular mechanisms underlying disease pathogenesis and progression. Various models of data stratification and identification of patient groups through various data mining protocols are used to support a decision making process in biomedicine. Data mining procedures based on algorithms such as support vector machines (SVM), neural networks, decision tree algorithms and mathematical programming approaches have been used to select sets of genes for diagnostic purposes and to identify molecular roles which are - as yet- unknown . Here we have illustrated the use of random forest to partition psoriatic tissues in appropriate disease groups and generate estimates of relevant gene predictors.
Psoriasis is a common, complex immuno-genetic inflammatory disease of primarily the skin. The underlying genetics of the disease are complex with numerous implicated susceptibility genes, where replication of single loci has been confirmed for only a handful of these genes. Patients suffering from psoriasis can exhibit a host of different clinical phenotypes and response to therapy is varied and unpredictable, even within a similar clinical phenotype, suggesting underlying transcriptional differences between and within the clinical groups. The ability to investigate the underlying immuno-genomic components of these clinical sub-phenotypes has not been a possibility, until now. Identification of different transcriptional signatures and their associated molecular pathways contribute toward defining a set of biomarkers, which could serve as diagnostic and therapeutic responder tools. We have outlined a computational strategy to identify molecular sub-types and corresponding putative biomarkers that may be crucial in the understanding and prediction of disease pathogenesis. Of the 206 common differentially expressed genes identified between normal, psoriatic lesional and psoriatic non-lesional groups, 130 genes (63.1%) were up-regulated and 76 genes (36.9%) were down-regulated. Dysregulated genes discovered in this study were involved in epidermal cell modulation, cell cycling and immune responses.
Microarray analysis of gene expression has been widely used to differentiate lesional and non-lesional skin of psoriatic patients [38, 39]. Recently, large-scale analysis using whole genome array platforms on numerous patients per sample group have been undertaken with the aim of identifying gene expression profiles associated with a specific psoriatic phenotype [5, 6, 10, 40]. In this work, we present a method for identifying sub-phenotypes of lesional skin from psoriasis patients based on patterns of gene expression that characterise each group and differ significantly from normal human skin. This approach is based on a decision tree analysis of gene expression data, the extraction of associations among gene expression patterns and the identification of functional annotations and molecular signatures.
The random forest decision tree model was applied to lesional skin group to derive patient sub-groups (PP01 and PP02), which are characterised by specific differentially expressed genes. The PP01 group was defined by the up-regulation of HLA-E, which is the inhibitory ligand for innate NK cells. HLA-E takes part in processing and presenting antigen to innate immune cells. The PP02 group had more up-regulated genes related to the cells of the adaptive immune system such as CTLA-4 (associated with modulation of T helper responses), IFI30 (involved in MHC Class II antigen processing), IL4IL (immunomodulatory enzyme produced by dendritic cells), PTPN2 (associated with autoimmune disorders such as type 1 diabetes mellitus and Crohn’s disease) and most interestingly SERPINB8, which has been identified through Genome-Wide Association Studies (GWAS) as a new psoriasis susceptibility locus in the Chinese population .
With regards to mechanistic details on the pathways that operate in psoriatic sub-groups, the ErbB signaling pathway has been identified for subgroup PP01 (Table 2). This pathway consists of a family of four related receptor tyrosine kinases (ErbB1-4) which, when activated trigger many different signal transduction pathways leading to increased proliferation, survival, motility, and invasiveness . All of these responses are important aspects of wound healing and psoriasis has many elements in common with wound healing. The main clinical feature of psoriasis relates to the thickened epidermis as a result of what may initially have been an epidermal barrier insult. An attempt to restore epidermal integrity is reflected in the activation of the ErbB signaling pathway. However in psoriasis it is possible that this pathway, along with other signaling pathways is dysregulated .
Other signaling pathways seem to be in effect in psoriasis sub-group PP02 (Table 3), for example signaling by BMP. Bone morphogenetic proteins (BMP) are members of the transforming growth factor-beta (TGF beta) superfamily and regulate a large variety of biological responses in different cells and tissues. It has been reported that BMPs are implicated in a variety of pathobiologic processes in skin, including wound healing, psoriasis, and carcinogenesis .
In our analysis, when several patient clinical variables were compared across the two classes (PP01 and PP02), we found age to be significantly altered in these subgroups, indicating that this is an important factor in disease manifestations. It is worth noting that although the differences in PP01 and PP02 groups are quite marked on a transcriptional level, yet they are clinically difficult to distinguish. This observation may help explain why some patients have a different disease course to others and why some respond better to therapy than others within a given clinical phenotype. The ability to generate molecular sub-types provides putative biomarkers, which with further refinement and replication, could prove to be useful in predicting disease severity, progression and response to therapy in an individualised manner.
Random forest has become a popular tool for analysing high-throughput genomic data. Due to the large number of variables associated with characterisation of clinical samples through gene expression measurements, reduction of dimensionality through feature selection or prioritisation is critical in disease property prediction. Here, we use random forest for (i) disease classification through gene expression patterns and analysis of variable importance to generate potential disease biomarkers, and (ii) clustering of gene expression measurements to derive disease subgroups. Despite some limitation in reproducibility across different psoriasis datasets, we believe that through our study there is an emerging picture of important gene predictors in psoriasis, as well as differentiation of disease in patient subgroups. Future work based on richer datasets that profile larger patient cohorts, with stringent clinical phenotyping, will have the potential to draw clearer conclusions about this complex autoimmune skin disease.
In this study, we generated biologically meaningful phenotypic classes using a ‘core’ of the highest differentially expressed genes and then further addressing the molecular variations among the groups responsible for lesional psoriasis. This might uncover subtle differences in disease pathogenesis allowing the emergence of new treatments for psoriatic individuals and further facilitate the development of personalized treatments for the disease. To the best of our knowledge, this is the first analysis identifying substantial phenotypic groups in psoriasis, based on patient gene expression profiles and using a classification pipeline. Further analysis and discovery of patterns and associations of transcripts of different cell-types (such as T-cells, dendritic cells, keratinocytes) must be done to shed light on the contribution that different cell types make towards the pathogenesis of psoriasis. We would then gain a better insight into this unique skin disease and hopefully, resolve some of the outstanding issues related to its pathogenesis and treatment.
Microarray data on psoriatic gene expression were obtained from the Genetic Association Information Network (GAIN) Database [10, 45], available through the NCBI database of Genotypes and Phenotypes (dbGaP). These experiments describe tissue samples from 71 individuals, of which 34 were healthy controls (NN) and 37 patients affected by chronic plaque psoriasis. Paired samples from lesional (PP) and non-lesional (PN) tissues were extracted and gene expression was measured by microarray experiments on the Affymetrix HU133 Plus 2.0 platform. Raw data were normalized using quantile normalization and expression estimates were computed using the Robust Multichip Average (RMA) method .
Analyses performed on the above gene expression dataset were validated through comparison with gene expression datasets GSE14905 and GSE13355 from the ArrayExpress database . The first study consisted of 21 biopsies from healthy donors and 26 paired non-lesional and lesional plaque type psoriatic patients  and the second dataset comprised 64 normal samples and 58 psoriatic tissues [10, 18]. Both studies were conducted on hgu133plus2 Affymetrix chips.
Differential expression analysis
In order to define a ‘core’ dataset of differentially expressed genes in the psoriatic phenotypes examined, pairwise comparisons between 34 normal (NN) and 37 lesional (PP) and non-lesional (PN) gene expression vectors were performed. The differential expression between pairs of samples (PP vs. NN, PN vs. NN, PP vs. PN) was assessed using GenePattern . Significance scores were assigned to each probe (p-value < 0.05), multiple hypothesis testing was applied with FDR < 0.05 to reduce the false positives and the top ranked 5000 probes were extracted for each pair of samples. Of those, the set with the most common expression alteration among the three pairwise comparisons was selected. Probes that mapped to the same gene were averaged and the average intensity across all corresponding genes was used. A core set of 228 probes common to all three pairwise comparisons was established. Of these, a total number of 206 unique known genes were derived yielding 130 up-regulated and 76 down-regulated genes (Additional file 1: Table S1).
Hierarchical clustering and principal component analysis (PCA) were implemented to identify distinct patterns of gene expression within the ‘core‘206 differentially expressed genes. The PCA procedure was implemented as part of the PCA package in R (http://www.r-project.org). Unsupervised hierarchical clustering heat-maps were generated in R based on Euclidean distance. Z-scores were calculated from the level of normalized expressions of 206 genes according to the mean and standard deviation of a reference set (control samples, NN).
Decision tree classification model
An ensemble of decision trees model was built according to the random forest (RF) classifier using a deterministic algorithm (Classification and Regression Tree Algorithm, CART) . Given a gene expression matrix, a RF classifier was constructed to classify tissue samples into relevant disease classes (NN, PN, PP) based on gene expression measurements (variables). Details on the classification strategy are given in Supplementary Methods and a small example of the classification process is shown in Supplementary Information (Additional file 12: Figure S7). Variable importance measures were implemented through mean decrease in accuracy and the Gini Index (GI) , to find the genes that best discriminate between the different disease phenotypes. Both measures were tested and have been found to correlate well (Additional file 13: Figure S8). The Gini Index was adopted to express the relative effects of gene predictors in determining the relevant disease classes. To estimate the empirical p-value for GI, 1000 permutations of the tissue samples were implemented and the importance values were recalculated for the permuted data set. The maximum Gini Index over all the genes in every permutation was recorded and thereby an empirical distribution of the maximum importance was estimated, as in similar analyses [49, 50].
Clusters of disease sample sub-groups through decision tree classification
A procedure to generate clusters of disease samples from gene expression measurements through the use of RF is described here. The random forest proximity measure, defined through the number of times each tree detects these samples in the same terminal node, is used as a means to express the similarity between samples from gene expression observations (Additional file 2: Supplementary Methods). Psoriatic microarray data were used to generate molecular sub-types. Synthetic data are generated by randomly sampling the empirical marginal distributions of variables. RF classification is applied to distinguish the 37 psoriatic samples from the synthetic data and the dissimilarity matrix is used to indicate distances between psoriatic samples, as previously . Through multi-dimensional scaling, samples are represented as points before clustering through CLARA . This procedure was implemented in R. Statistical significance of disease clusters with respect to clinical variables was done through Wilcoxon signed-rank test and the clinical variables tested were Psoriasis Area and Severity Index (PASI), Body Mass index (BMI), Age of Onset, Age and Body Surface Area (BSA).
Network analysis and functional enrichment
Pairwise Pearson‘s correlation coefficient is estimated for the 206 differentially expressed genes that were common in all tissues. A similarity matrix was calculated for each skin sub-type and a co-expression network was visualised using the Cytoscape software. Markov Cluster Algorithm (MCL) was used to generate the interacting groups (clusters) via genes sharing higher-order connectivity in their local neighborhoods . To assess statistically significant enriched pathways involved in the four different skin groups, p-values were calculated using the hypergeometric statistical test and False Discovery Rate (FDR < 0.05) was used to correct for multiple comparisons. The default background distribution is considered to be the whole genome. Pathway enrichment analysis was performed using the ReactomePA package in Bioconductor [52, 53].
Collaborative Association Study of Psoriasis
Support for genotyping of samples was provided through the Genetic Association Information Network (GAIN). The dataset used for the analyses described in this manuscript were obtained from the database of Genotypes and Phenotypes (dbGaP) found at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000019.v1.p1. For samples and associated phenotype data, we kindly acknowledge the Collaborative Association Study of Psoriasis and Profs. J.T. Elder, J. Ding, W. Swindel, G. Abecasis, P. Stuart and R. Nair.
CA acknowledges financial support from the Alexander S. Onassis Public Benefit Foundation. FON acknowledges funding from the Wellcome Trust. ST and FON acknowledge funding from the EU (grant 261366). ST acknowledges support from the Leverhulme Trust (RPG-2012-686).
- Nestle FO, Kaplan DH, Barker J: Psoriasis. N Engl J Med. 2009, 361 (5): 496-509.View ArticlePubMedGoogle Scholar
- Lebwohl M: Psoriasis. Lancet. 2003, 361 (9364): 1197-1204.View ArticlePubMedGoogle Scholar
- Lowes MA, Bowcock AM, Krueger JG: Pathogenesis and therapy of psoriasis. Nature. 2007, 445 (7130): 866-873.View ArticlePubMedGoogle Scholar
- Capon F, Di Meglio P, Szaub J, Prescott NJ, Dunster C, Baumber L, Timms K, Gutin A, Abkevic V, Burden AD, et al: Sequence variants in the genes for the interleukin-23 receptor (IL23R) and its ligand (IL12B) confer protection against psoriasis. Hum Genet. 2007, 122 (2): 201-206.View ArticlePubMedGoogle Scholar
- Liu Y, Helms C, Liao W, Zaba LC, Duan S, Gardner J, Wise C, Miner A, Malloy MJ, Pullinger CR, et al: A genome-wide association study of psoriasis and psoriatic arthritis identifies new disease loci. PLoS Genet. 2008, 4 (3): e1000041-PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang XJ, Huang W, Yang S, Sun LD, Zhang FY, Zhu QX, Zhang FR, Zhang C, Du WH, Pu XM, et al: Psoriasis genome-wide association study identifies susceptibility variants within LCE gene cluster at 1q21. Nat Genet. 2009, 41 (2): 205-210.View ArticlePubMedGoogle Scholar
- Chung Y, Dong C: Don't leave home without it: the IL-23 visa to T(H)-17 cells. Nat Immunol. 2009, 10 (3): 236-238.View ArticlePubMedGoogle Scholar
- Volpe E, Servant N, Zollinger R, Bogiatzi SI, Hupe P, Barillot E, Soumelis V: A critical function for transforming growth factor-beta, interleukin 23 and proinflammatory cytokines in driving and modulating human T(H)-17 responses. Nat Immunol. 2008, 9 (6): 650-657.View ArticlePubMedGoogle Scholar
- Krueger JG: The immunologic basis for the treatment of psoriasis with new biologic agents. J Am Acad Dermatol. 2002, 46 (1): 1-23.View ArticlePubMedGoogle Scholar
- Feng BJ, Sun LD, Soltani-Arabshahi R, Bowcock AM, Nair RP, et al:Multiple Loci within the Major Histocompatibility Complex Confer Risk of Psoriasis. PLoS Genet. 2009, 8 (8): e1000606-View ArticleGoogle Scholar
- Strange A, Capon F, Spencer CC, Knight J, Weale ME, Allen MH, Barton A, Band G, Bellenguez C, Bergboer JG, et al: A genome-wide association study identifies new psoriasis susceptibility loci and an interaction between HLA-C and ERAP1. Nat Genet. 2010, 42 (11): 985-990.PubMed CentralView ArticlePubMedGoogle Scholar
- Valeyev NV, Hundhausen C, Umezawa Y, Kotov NV, Williams G, Clop A, Ainali C, Ouzounis C, Tsoka S, Nestle FO: A systems model for immune cell interactions unravels the mechanism of inflammation in human skin. PLoS Comput Biol. 2010, 6 (12): e1001024-PubMed CentralView ArticlePubMedGoogle Scholar
- Bowcock AM, Shannon W, Du F, Duncan J, Cao K, Aftergut K, Catier J, Fernandez-Vina MA, Menter A: Insights into psoriasis and other inflammatory diseases from large-scale gene expression studies. Hum Mol Genet. 2001, 10 (17): 1793-1805.View ArticlePubMedGoogle Scholar
- Haider AS, Duculan J, Whynot JA, Krueger JG: Increased JunB mRNA and protein expression in psoriasis vulgaris lesions. J Invest Dermatol. 2006, 126 (4): 912-914.View ArticlePubMedGoogle Scholar
- Oestreicher JL, Walters IB, Kikuchi T, Gilleaudeau P, Surette J, Schwertschlag U, Dorner AJ, Krueger JG, Trepicchio WL: Molecular classification of psoriasis disease-associated genes through pharmacogenomic expression profiling. Pharmacogenomics J. 2001, 1 (4): 272-287.View ArticlePubMedGoogle Scholar
- Zhou X, Krueger JG, Kao MC, Lee E, Du F, Menter A, Wong WH, Bowcock AM: Novel mechanisms of T-cell and dendritic cell activation revealed by profiling of psoriasis on the 63,100-element oligonucleotide array. Physiol Genomics. 2003, 13 (1): 69-78.View ArticlePubMedGoogle Scholar
- Gudjonsson JE, Aphale A, Grachtchouk M, Ding J, Nair RP, Wang T, Voorhees JJ, Dlugosz AA, Elder JT: Lack of evidence for activation of the hedgehog pathway in psoriasis. J Invest Dermatol. 2009, 129 (3): 635-640.PubMed CentralView ArticlePubMedGoogle Scholar
- Gudjonsson JE, Ding J, Johnston A, Tejasvi T, Guzman AM, Nair RP, Voorhees JJ, Abecasis GR, Elder JT: Assessment of the psoriatic transcriptome in a large sample: additional regulated genes and comparisons with in vitro models. J Invest Dermatol. 2010, 130 (7): 1829-1840.PubMed CentralView ArticlePubMedGoogle Scholar
- Gudjonsson JE, Ding J, Li X, Nair RP, Tejasvi T, Qin ZS, Ghosh D, Aphale A, Gumucio DL, Voorhees JJ, et al: Global gene expression analysis reveals evidence for decreased lipid biosynthesis and increased innate immunity in uninvolved psoriatic skin. J Invest Dermatol. 2009, 129 (12): 2795-2804.PubMed CentralView ArticlePubMedGoogle Scholar
- Suarez-Farinas M, Lowes MA, Zaba LC, Krueger JG: Evaluation of the psoriasis transcriptome across different studies by gene set enrichment analysis (GSEA). PLoS One. 2010, 5 (4): e10247-PubMed CentralView ArticlePubMedGoogle Scholar
- Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 2007, 8: 25-View ArticleGoogle Scholar
- Breiman L, Friedman JH, Olshen RA: Classification and Regression Trees. 1984, New York: Chapman and HallGoogle Scholar
- Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith PT, Van Eerdewegh P:Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol. 2005, 28: 171-View ArticlePubMedGoogle Scholar
- Diaz-Uriarte R: Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7: 3-View ArticleGoogle Scholar
- McKinney BA, Reif DM, Ritchie MD, Moore JH: Machine learning for detecting gene-gene interactions: a review. Appl Bioinform. 2006, 5 (2): 77-88.View ArticleGoogle Scholar
- Heidema A, Boer JM, Nagelkerke N, Mariman EC, DL VdA, Feskens EJ:The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006, 7: 23-38.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case–control studies. BMC Bioinform. 2009, 10 (1): 65-View ArticleGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. 2009, New York: Data Mining, Inference, and Prediction, Second EditionView ArticleGoogle Scholar
- Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S: Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol. 2005, 18 (4): 547-557.View ArticlePubMedGoogle Scholar
- Roffers-Agarwal J, Xanthos JB, Miller JR: Regulation of actin cytoskeleton architecture by Eps8 and Abi1. BMC Cell Biol. 2005, 6: 36-PubMed CentralView ArticlePubMedGoogle Scholar
- Schneider MR, Antsiferova M, Feldmeyer L, Dahlhoff M, Bugnon P, Hasse S, Paus R, Wolf E, Werner S: Betacellulin regulates hair follicle development and hair cycle induction and enhances angiogenesis in wounded skin. J Invest Dermatol. 2008, 128 (5): 1256-1265.View ArticlePubMedGoogle Scholar
- Kishibe M, Bando Y, Terayama R, Namikawa K, Takahashi H, Hashimoto Y, Ishida-Yamamoto A, Jiang YP, Mitrovic B, Perez D, et al: Kallikrein 8 is involved in skin desquamation in cooperation with other kallikreins. J Biol Chem. 2007, 282 (8): 5834-5841.View ArticlePubMedGoogle Scholar
- Wang M, Chen X, Zhang H: Maximal conditional chi-square importance in random forests. Bioinform. 2010, 26 (6): 831-837.View ArticleGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 1575-1584.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker KG, Hosack DA, Dennis G, Lempicki RA, Bright TJ, Cheadle C, Engel J:PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics. 2003, 4: 61-PubMed CentralView ArticlePubMedGoogle Scholar
- Yao Y, Richman L, Morehouse C, de los Reyes M, Higgs BW, Boutrin A, White B, Coyle A, Krueger J, Kiener PA, et al:Type I interferon: potential therapeutic target for psoriasis?. PLoS One. 2008, 3 (7): e2737-PubMed CentralView ArticlePubMedGoogle Scholar
- Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinform. 2008, 9: 319-View ArticleGoogle Scholar
- Nomura I, Gao B, Boguniewicz M, Darst MA, Travers JB, Leung DY: Distinct patterns of gene expression in the skin lesions of atopic dermatitis and psoriasis: a gene microarray analysis. J Allergy Clin Immunol. 2003, 112 (6): 1195-1202.View ArticlePubMedGoogle Scholar
- Lee E, Trepicchio WL, Oestreicher JL, Pittman D, Wang F, Chamian F, Dhodapkar M, Krueger JG: Increased expression of interleukin 23 p19 and p40 in lesional skin of patients with psoriasis vulgaris. J Exp Med. 2004, 199 (1): 125-130.PubMed CentralView ArticlePubMedGoogle Scholar
- Swindell WR, Xing X, Stuart PE, Chen CS, Aphale A, Nair RP, Voorhees JJ, Elder JT, Johnston A, Gudjonsson JE: Heterogeneity of inflammatory and cytokine networks in chronic plaque psoriasis. PLoS One. 2012, 7 (3): e34594-PubMed CentralView ArticlePubMedGoogle Scholar
- Sun LD, Cheng H, Wang ZX, Zhang AP, Wang PG, Xu JH, Zhu QX, Zhou HS, Ellinghaus E, Zhang FR, et al: Association analyses identify six new psoriasis susceptibility loci in the Chinese population. Nat Genet. 2010, 42 (11): 1005-1009.PubMed CentralView ArticlePubMedGoogle Scholar
- Elder J, Kansra S, Stoll S: Autocrine regulation of keratinocyte proliferation. J Clin Ligand Assay. 2004, 27: 137-142.Google Scholar
- Citri A, Yarden Y: EGF-ERBB signalling: towards the systems level. Nat Rev Mol Cell Biol. 2006, 7 (7): 505-516.View ArticlePubMedGoogle Scholar
- Botchkarev V: Bone Morphogenetic Proteins and Their Antagonists in Skin and Hair Follicle Biology. J Invest Dermatol. 2003, 120: 36-47.View ArticlePubMedGoogle Scholar
- Nair RP, Duffin KC, Helms C, Ding J, Stuart PE, Goldgar D, Gudjonsson JE, Li Y, Tejasvi T, Feng BJ, et al: Genome-wide scan reveals association of psoriasis with IL-23 and NF-kappaB pathways. Nat Genet. 2009, 41 (2): 199-204.PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264.View ArticlePubMedGoogle Scholar
- Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al: ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003, 31 (1): 68-71.PubMed CentralView ArticlePubMedGoogle Scholar
- Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet. 2006, 38 (5): 500-501.View ArticlePubMedGoogle Scholar
- McDonough CW, Hicks PJ, Lu L, Langefeld CD, Freedman BI, Bowden DW: The influence of carnosinase gene polymorphisms on diabetic nephropathy risk in African-Americans. Hum Genet. 2009, 126 (2): 265-275.PubMed CentralView ArticlePubMedGoogle Scholar
- Sohn I, Owzar K, George SL, Kim S, Jung SH: A permutation-based multiple testing method for time-course microarray experiments. BMC Bioinform. 2009, 10: 336-View ArticleGoogle Scholar
- Kaufman L, Rousseeuw PJ: Finding groups in data: an introduction to cluster analysis. 1990, New York: Wiley: Wiley series in probability and mathematical statistics Applied probability and statisticsView ArticleGoogle Scholar
- Yu G: ReactomePA. 101, R package version: Reactome Pathway AnalysisGoogle Scholar
- Yu G, Wang LG, Han Y, He QY:clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012, 16 (5): 284-287.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.