An integrative approach to characterize disease-specific pathways and their coordination: a case study in cancer
© Xu et al.. 2008
Published: 20 March 2008
Skip to main content
© Xu et al.. 2008
Published: 20 March 2008
The most common application of microarray technology in disease research is to identify genes differentially expressed in disease versus normal tissues. However, it is known that, in complex diseases, phenotypes are determined not only by genes, but also by the underlying structure of genetic networks. Often, it is the interaction of many genes that causes phenotypic variations.
In this work, using cancer as an example, we develop graph-based methods to integrate multiple microarray datasets to discover disease-related co-expression network modules. We propose an unsupervised method that take into account both co-expression dynamics and network topological information to simultaneously infer network modules and phenotype conditions in which they are activated or de-activated. Using our method, we have discovered network modules specific to cancer or subtypes of cancers. Many of these modules are consistent with or supported by their functional annotations or their previously known involvement in cancer. In particular, we identified a module that is predominately activated in breast cancer and is involved in tumor suppression. While individual components of this module have been suggested to be associated with tumor suppression, their coordinated function has never been elucidated. Here by adopting a network perspective, we have identified their interrelationships and, particularly, a hub gene PDGFRL that may play an important role in this tumor suppressor network.
Using a network-based approach, our method provides new insights into the complex cellular mechanisms that characterize cancer and cancer subtypes. By incorporating co-expression dynamics information, our approach can not only extract more functionally homogeneous modules than those based solely on network topology, but also reveal pathway coordination beyond co-expression.
The recent development of microarray technology has significantly facilitated the identification of disease-related genes [1–4]. However, many disease phenotypes are determined not by individual genes, but by the coordinated effect of many genes. Insight into the structure and coordination of disease-related pathways is crucial to understanding the pathophysiology of complex diseases. However, it has proved difficult to infer pathways from microarray data by deriving modules of multiple related genes, rather than individual genes. The major challenges are: (1) Genes involved in a pathway may exhibit complex expression relationships beyond co-expression, which may be overlooked by standard microarray analysis methods such as clustering . (2) Pathways are dynamic and the current static annotation of pathways may not serve as a good template. In fact, pathways are manual dissections of the underlying dynamic gene regulatory network. Under different conditions, different segments of the ensemble network will be activated, leading to condition-specific activation of pathways .
To measure co-expression dynamics, we use second-order expression similarity, which we proposed previously . Briefly, if we define first-order expression similarity as the expression similarity of two genes from one dataset, then second-order similarity measures whether two gene pairs simultaneously exhibit either high or low expression similarity across multiple datasets. In general, high first-order similarity suggests the existence of a functional link between two genes, and clustering based on the second-order similarity captures multiple functional links always activated and deactivated under similar conditions. Such functional links are likely to comprise a functional module. Interestingly, genes in a second-order cluster may not always have high first-order similarity (see an example in Figure 1b); therefore, second-order analysis allows us to identify functional modules that are inaccessible to co-expression analysis.
Given multiple gene pairs sharing high second-order similarity, we further divide them into network components based on their connectivity on the differential co-expression graph (see an example in Figure 1c). We observe that genes within a connected network component are more likely to participate in the same specific pathway than those between different components (see Supplementary document in Additional file 1), which, in turn, are likely to be involved in different relevant pathways. This may reveal high-order cross-pathway coordination. In fact, hierarchical clustering of differentially co-expressed gene pairs based on their second-order similarity results in a hierarchical modularity in terms of relevance of functional links. We designed a linear scaling model to select modules by considering both module size (number of edges) and within-module second-order similarity. Then, given selected modules, we can further infer datasets (phenotypic conditions) in which a module is activated, i.e. in which genes in the module coordinate.
Applying our methods to 32 cancer-related microarray datasets, and 23 non-cancer related datasets, we derived 162 second-order clusters consisting of 224 network modules, activated either in cancer or in specific cancer subtypes. In particular, we identified a breast cancer specific network module that involved in tumor suppression via platelet-derived growth factor (PDGF)-like signaling, more importantly, a hub gene PDGFRL that may play an important role in this tumor suppressor module.
We curated 32 human microarray datasets (1,764 expression profiles in total) measuring cancers of 12 tissues, and 23 datasets (1,158 expression profiles) not related to cancer (e.g. normal tissues, chronic granulomatous disease, Huntington's disease, inflammatory response). For details of the datasets refer to the Supplementary Table 1 in Additional file 2. We first identify gene pairs which consistently demonstrate higher correlation in cancer versus non-cancer datasets based on a robust correlation estimator, the normalized Percentage Bend correlation (for details see Methods). In following sections, if not specified, the term correlation will by default refer to the normalized Percentage Bend correlation. These criteria result in 6,035 gene pairs covering 1,967 genes. The 6,035 gene pairs, each representing a potential conditional functional link, can be represented as a differential co-expression network. In this network, each gene is represented as a node and each differential co-expression relationship is represented as an edge.
It has been reported that co-expression networks follow a scale-free node degree distribution . We observed that the differential co-expression network also follows such a topology, where only a small number of nodes act as “highly connected hubs” (see node degree distribution in Supplementary Figure 1 in Additional file 1). This indicates that most gene-gene co-expression relationships differing between cancer and other phenotypes are associated with only a few “hub” genes. Such hub genes exhibit a high degree of coordination with many other genes in neoplastic states, and are therefore likely to play important roles in carcinogenesis and cancer progression. In fact, most hub genes fall into two main functional categories: 1) core processes of neoplastic states such as cell division and chromosome organization; or 2) dynamic interactions between cancer cells and their microenvironment such as angiogenesis, immune response, and cell adhesion (see Supplementary Table 2 in Additional file 2). For those hub genes with unknown functions, we can predict their cancer-related functions based on their neighbor genes. For example, the 16 out of the 33 interacting partners of the ADP-ribosylation factor-like 6 interacting protein (ARL6IP) are involved in cell division (hypergeometric test p-value 1.6 × 10-24). Thus, ARL6IP is likely to be involved in cell proliferation, consistent with its initial characterization as an interaction partner of the Ras superfamily member ARL6 . As another example, while microfibrillar-associated protein 2 (MFAP2) has long been known to bind to various components of the elastic extracellular matrix , it has not been clear whether it serves more than a mechanical function. We found that 6 out of its 24 neighbor genes are involved in cell adhesion (p-value 7.7 × 10-5). In fact, a recent study found that MFAP2 binds to a neighbor gene Notch1 and activates it .
The differential co-expression network provides a summary of co-expression links frequently active across all types of cancers. However, it does not provide clues as to which set of links tend to be simultaneously active and inactive under which types of cancer. That is, the edges of a differential co-expression network may not be active in the same subset of datasets. In fact, the largest connected component of the differential co-expression network contains 5944 edges, which comprises 98% of all the edges in the network. Thus, based on connectivity alone we cannot break the network into functionally coherent and cancer-subtype specific modules.
To dissect the networks, we integrate two types of information, the co-expression dynamics and the network connectivity, to extract cancer-subtype specific network modules. First, we employ the second-order clustering approach to utilize the co-expression dynamics information. This includes two steps: (i) for any two genes connected with an edge in the differential co-expression network, we calculate the expression correlation in each of the 32 cancer microarray datasets and store it in a vector, termed the first-order expression correlation profile of the genes; (ii) we then perform hierarchical clustering of all the gene pairs based on the Euclidean distance between the first-order expression correlation profiles. Unlike commonly used clustering approaches, the unit of the second-order clustering is a gene pair instead of a gene, and the distance between units is computed based on the first-order expression correlation profiles instead of the original gene expression profiles, hence the term “second-order” clustering . Since each edge represents a frequently occurring co-expression relationship in multiple cancer datasets, it likely represents a functional link. If a cluster of gene pairs follows the same co-expression pattern across multiple cancer datasets, it represents a module of functional links being turned on or off simultaneously across different cancer phenotypes.
One main feature of our approach is that it can simultaneously discover network modules and the types of cancer in which the modules are activated. Figure 2a shows a module that is activated in most of the cancer datasets. The genes of the module are mostly involved in cell division and genetic stability, representing a cell proliferation signature, a key feature of cancer. Figure 2c shows a network module which tends to be activated in only solid tumors. The genes of the module are mostly involved in cell adhesion and organogenesis, which is specific to solid tumor versus blood cancers or neoplastic cell lines. In the next section, we will detail two network modules which are activated predominately in breast cancer data sets.
Our analysis resulted in a second-order cluster containing two connected network modules that tend to be more active in all seven breast tumor datasets relative to the rest of the datasets. The average correlation of these modules in breast tumor and other cancer datasets are 0.49 and 0.23 respectively (the t-test of co-expressions between breast tumor and the rest of cancer datasets gives a p-value of 1.56 × 10-95).
Most interestingly, we found one main function of this network module appears to be tumor suppression via the inhibition of PDGF superfamily signaling. A hub gene with high degree of this module is the gene PDGF receptor-like (PDGFRL) (degree 11). While its precise biological function is not known, PDGFRL encodes a 375aa product with significant sequence similarity to the extracellular domain of PDGFR. Indeed, mutations in PDGFRL have been found in individual cancer samples [12–15]. PDGFRL is located in chromosome 8p22-8p21.3, where multiple studies have suggested the existence of a putative breast cancer tumor suppressor gene , , . Recently, an in-depth study of the region using microcell-mediated chromosome transfer found that indeed PDGFRL expression is decreased in the majority of breast cancer cells . Many genes in this network module have been found to be involved in PDGF superfamily signaling. For example, Cysteine-rich protein (SPARC) binds to PDGF-AB and PDGF-BB dimers, inhibiting the binding of these growth factors to their cell surface receptors  and inhibiting PDGF-induced vascular smooth muscle proliferation . Also, connective tissue growth factor (CTGF) has structural similarities with PDGF . Recently, two new PDGF ligands were discovered that have a N-terminal complement subcomponent C1R/C1S, UEGf, BMP1 (CUB) domain . Interestingly both C1R and C1S are members of this network module. Also, LRP1 is a physiological modulator of the PDGF signaling pathway . In total, we found direct Pubmed literature support for 19 out of the 52 member genes to be involved in (or related to) the PDGF-superfamily signalling. Furthermore, it is known that zinc finger protein (ZNF148) binds to the PDGFR  gene promoter. We screened the promoter regions of the 52 member genes, and found that the binding sites of ZNF148 are significantly enriched (hypergeometric p-value 0.016). In addition, it has been reported that PDGFR positively regulates collagen production (for example , ). Our module showed the breast tumor specific co-expression between PDGFRL and collagens COL3A1, COL5A2 and COL6A3. Many of the above evidences support the hypothesis that PDGFRL has an agonistic function to PDGFR signaling.
Although abundant evidence suggests the individual involvement of many of these genes in tumor suppression, their coordinated function has never been elucidated. Here by adopting a network perspective, we have identified their interrelationships and a gene (PDGFRL) that may play a central role in this tumor suppressor network.
Another identified breast cancer-specific network module (Figure 3b) may be involved in the coordination of the inflammatory response to cancer pathology. This network consists of 5 genes: tumor necrosis factor receptor superfamily member 1B (TNFRSF1B); vascular cell adhesion molecule 1 (VCAM1); leukocyte-associated immunoglobulin-like receptor 1 (LAIR1); and Cathepsin L1 (CSTL). They are arranged around the macrophage-associated antigen (CD163). Several lines of evidence have implicated these genes individually in breast cancer, for example: increased plasma levels of VCAM1 is associated with advanced breast cancer ; genetic variation in TNFRSF1B may predict the late onset of breast carcinoma, and relapse and death for patients with breast carcinoma ; finally, the breast cancer cell line exhibiting the highest in vitro invasiveness also expressed the highest amount of CTSL1 splice variant L-A3 .
Most of the genes of this module are related to tumor necrosis factor (TNF), an inflammatory cytokine. It has been reported that activation of rat CD163 on peritoneal macrophages induces the production of pro-inflammatory mediators including TNF ; TNF directly interacts with TNFRSF1B  and is a mediator of TNF function in the mouse ovary . Finally, transcription of VCAM1 in endothelial cells can be induced by TNF .
A transcription factor, octamer-binding transcription factor-1 (POU2F1, also known as Oct-1) has been predicted to bind promoter region of VCAM1, LAIR1 and CTSL (hypergeometric p-value 0.012). The binding of POU2F1 to VCAM1 promoter is indeed supported by the literature [35–37]. Also, POU2F1 has been found to bind SP3 , which is reported to activate the transcription of CSTL .
The tight and coordinated expression of the genes in this network module reveals an induced inflammatory response that may be important in breast tumor progression .
A major advantage of second-order clustering is that it can identify functionally related genes beyond co-expression, as illustrated in Figure 1b. We elaborate on this point in this section. To allow readers to easily assess the magnitude of the correlation, only in this section we use the Pearson correlation to measure the co-expression level.
Based on the definition of second-order clustering, connected network components within the same second-order cluster show coordinated activities, which implies their functional relevance. In the example of the two modules in the general cancer cluster in Figure 2a and 2b, each module may play different roles in the regulation of specific biological processes -- cell division and nucleic acid metabolism, respectively; the latter is clearly required for cell division. Given the fact that these processes belong to the same second-order cluster, they may represent facets of the same underlying neoplastic process. However, member genes of the two modules exhibit distinct expression patterns: The average Pearson correlation between genes of the two modules is only 0.13.
As another example, the two breast cancer modules described in last section are also related. Indeed, the collaboration of PDGF signalling and TNF have long been known to be required for tissue repairing , and their abnormal expression play important but partially defined roles in breast tumor development and progression . On the other hand, member genes of the two modules exhibit relatively distinct expression patterns. For example, two hub genes of modules, PDGFRL and CD163, also show very weak expression similarities across all seven breast cancer datasets: the average Pearson correlation between these two genes is only 0.24.
Besides the above examples, we found coordinated modules within the majority of identified second-order clusters. Overall, from the total 162 second-order clusters, 25% give rise to more than one connected network module. To estimate the amount of cross module co-expression within second-order clusters, for each second order cluster, we first determined the active cancer datasets, in which the average Pearson correlation of gene pairs in the cluster is greater than 0.5. For 72% of those module pairs within the same second-order cluster, the average gene pairwise Pearson correlation between modules in the active datasets is less than 0.5 (normalized Percentage Bend correlation approximately < 0.35), and for 30% module pairs the cross-module average gene pairwise Pearson correlation is less than 0.3 (normalized Percentage Bend correlation approximately < 0.19).
Furthermore, even genes in the same network module are not necessarily highly co-expressed when the module is active, despite their high degree of functional homogeneity, as discussed previously. We found that in 32 of 224 (14 %) modules, the average pairwise Pearson correlation of any two genes in the module is < 0.5 in the corresponding active datasets. Such modules could therefore easily be overlooked by traditional clustering methods.
The rapid accumulation of microarray data provides unprecedented opportunities to study the molecular mechanisms underlying disease pathogenesis and progression. Although many studies utilized multiple microarray datasets to derive consistent lists of genes specific for (subtypes of) cancer [43–45], little attention has been paid to derive genetic networks characterizing different types of cancer. Segal et al.  used predefined biologically meaningful gene sets including known biological pathways, and have successfully identified activated or repressed biological modules in a wide variety of neoplastic conditions. The approach, however, relies on the knowledge of pre-defined biological modules and has limited use in the discovery of novel association between genes. A recent study by Choi et al.  compared the two co-expression networks summarized from 10 tumor and normal datasets, respectively, and have identified functional differences between normal growth and cancer in terms of gene coexpression changes in broad areas of physiology. However, due to the multifaceted nature of cancer, interactions in such a derived summary network may not be simultaneously active in individual datasets, i.e. specific cancer conditions.
In this study, we propose an unsupervised method that integrates both co-expression dynamics and network topology information to characterize cancer (subtype) specific network modules. The identified modules, such as modules activated across all cancer subtypes or only in solid tumors, are novel, but consistent with known molecular mechanisms. Importantly, we have discovered a potential tumor suppressor network particularly active in breast tumors, and provide compelling evidence that the hub gene PDGFRL is a true tumor suppressor gene. Compared to commonly used differential or co-expression analysis, our approach has the following advantages: (1) our unsupervised approach simultaneously discovers network modules and the conditions (e.g. cancer subtypes) in which they are activated, thus providing new insights into the complex cellular mechanisms that characterize cancer and cancer subtypes. (2) Compared to existing approaches [47–51] which can only identify densely connected network modules based solely on network topology information, our approach incorporating co-expression dynamics information (second-order similarity) can extract more diverse types of modules regardless of network density. It is known that many biological pathways do not necessarily form densely connected modules. (3) Our approach can reveal coordination of pathways beyond co-expression. (4) Our method can be applied to any types of molecular networks beyond co-expression network, when data of multiple networks under different conditions are available.
In the current framework, the selection of biologically meaningful modules still need certain amount of manual intervention from biology expert. We are looking for more systematic ways for module selection, especially by putting the framework into the context of network statistics to improve the robustness. For example, although the scaling model we constructed is based on direct observation of the distribution of network properties S and D, their log-linear relationship suggests that it should be straightforward to make use of the exponential random graph models that have been used in recent years to study statistical aspects of networks . In essence, such models linearly combine network properties and assign the probability of observed networks as the exponential of such linear combinations. Integrating this work with the exponential family probabilistic models may provide both better estimation of the coefficients in the linear scaling model and more accurate selection of network modules via hypothesis testing. We intend to explore this direction in future work.
The choice of datasets depends on the research question. Ideally there should be a balanced and sufficient sampling of different phenotypes (in particular different tissues for this cancer study). Particularly, a paired Wilcoxon test between cancer and normal samples of the same tissue would significantly eliminate the amount of tissue-specific co-expressions. However, due to the limited amount and the heterogeneity of existing data it is currently impractical to achieve this goal. Using a weighted sampling scheme could potentially bypass the imbalance effects. We aim to investigate this strategy, and use statistical models of the correlation values to determine the weighting factors.
We curated 32 cancer and 23 non-cancer human gene expression datasets mainly from the Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) databases, each containing more than 15 microarrays, on either Affymetrix or cDNA platforms. In each dataset, if there are multiple probes that correspond to the same gene, we choose the one that contains the least amount of missing values. For datasets containing absolute expression measurements, we convert all values <= 10 to 10, then perform a base 2 log transform.
We used Percentage Bend Correlation  (with β=0.05) to obtain a robust correlation estimate. Percentage Bend Correlation first detects outliers in expression values of each gene then reduces the effects of those outliers in the correlation calculation. Only gene pairs with a large number of valid samples m>= 15 are used to calculate correlation. To make the correlation estimates comparable across different datasets of variable sample sizes and among different gene pairs of different amount of missing expression measures within the same dataset, we performed Fisher's z-transform  to reduce sample size effect. Given a correlation estimate r and sample size n, the Fisher's Z scores (divided by its theoretical standard deviation) is calculated as , which theoretically has an asymptotically standard normal distribution. Note that sample size n may be different from gene pair to gene pair due to missing values, and from dataset to dataset. In reality, we observed that the distributions of the z-score are still different from dataset to dataset: we therefore normalized z-scores to enforce the standard normal distribution. After that, standardized correlations r' are obtained by inverting the z-score with a fixed n of 30.
We define a gene pair to be differentially co-expressed between cancer and non-cancer if it satisfies the following two criteria: (1) their expression correlation in cancer datasets is sufficiently strong (can be either positively or negatively high). This is done by setting threshold for average summed square of correlations in cancer datasets. i.e, for the gene pair i, where there are c valid correlation estimations (c=32 if there are no missing values) and k is dataset index corresponding to all valid correlation estimations; and (2) the Wilcoxon ranksum test of correlations between cancer and non-cancer datasets gives a p-value <= 0.01.
We hierarchically clustered the differentially co-expressed gene pairs based on their expression correlation profiles using the CLUSTER program  with complete linkage and Euclidean distance. The Euclidean distance is averaged to provide a simple estimation given the existence of missing correlations (due to the missing value problem). where r i (k) and r j (k) are the correlations of gene pairs i and j in the dataset k, respectively, and c is number of valid correlations. In the hierarchical tree, each leaf node represents a gene pair, and each inner node corresponds to a second-order cluster of gene pairs (edges) which may comprise zero, one, or more connected network components. In cases where the size of the differential network is too big to be processed using hierarchical clustering (HC), the gene pairs were first separated using k-means clustering then processed the smaller clusters separately by hierarchical clustering. As for our experience, the biologically meaningful modules normally contain less than a few hundred edges, thus k-means clustering will keep most of modules intact.
The functional enrichment analysis is done by the hypergeometric test on genes. We selected 419 Gene Ontology (GO) functions (i.e. biological process terms) which are 4 levels below the root in the GO hierarchy. Each gene may be directly or indirectly associated with some of these functions. A set of genes will be considered to have a enriched function when (1) the functional homogeneity modeled by the hypergeometric distribution  is significant at a significance level 0.01 and (2) there are at least 2 genes in the set are associated with the function.
10kb upstream sequences for each gene were obtained from NCBI Gene database. After applying RepeatMasker , we used the MATCH program of TRANSFAC  (version 9.2) to scan the sequences for the presence of transcription factor binding sites based on position weight matrices. We used vertebrate-specific matrices, and chose cut-offs to minimize the sum of false positives and false negatives. We kept only the top 3,000 hits per matrix, sorted by the matrix similarity score. Altogether we obtained 349,178 predicted transcription factor target relationships of 180 transcription factors. A hypergeometric test was performed for each network module to search for over-represented transcription factor binding sites.
We thank Xiaotu Ma, Reza Alaghband and Adrian Dobra for helpful advices on method design, Yu Huang, Haiyan Hu and Mike Mehan for providing technical help, Jen-Tsan Chi and Chao Cheng for advise on result interpretation. We also thank anonymous reviewers for their suggestions. This work is supported by the NIH grants U54CA112952, R01GM074163, and the NSF grant 0515936.
This article has been published as part of BMC Genomics Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.