Since the introduction of the Illumina HumanMethylation27 BeadChip platform, which measures the methylation of over 27,000 CpG sites across the human genome, several studies have reported genomic sites with aberrant methylation in cancers. These publicly available datasets, including several performed by The Cancer Genome Atlas (TCGA), now allow for an integrative analysis of DNA methylation across multiple cancer types. We took a pathway-level approach to this integrative analysis, illustrating the use of our newly developed gene set enrichment testing web-based application, LRpath (
The identification of predefined sets of biologically related genes enriched with differentially expressed genes is used routinely in the analysis and interpretation of data from microarrays, RNA-Seq, and other high-throughput methods. The most commonly used approach to identifying enriched sets of genes is based on counting the number of differentially expressed genes in a particular biological concept. A biological concept is a pre-defined, biologically-related set of genes, derived from any one of a number of different annotation sources
. In particular, such focus on biological concepts rather than individual genes has proven useful in cancer research. Several groups have developed tools looking at the change in groups of genes sharing the same functions or regulatory modules, as detailed in Furney et al., where additional resources for cancer genomic and epigenomic studies can be found
. Enrichment analysis is not limited to transcriptomic data; pathway analysis using epigenetic changes can also provide valuable information as demonstrated by a lymphoma study where inflammatory signalling, especially the tumor necrosis factor α network, was found to be differently dysregulated between two tumor subtypes
. For the analyses conducted in this manuscript, we used genes harbouring differentially methylated CpG sites in their promoter proximity, rather than differential expression, in multiple cancer types. The statistical significance of such overlap between genes of interest and a particular concept is often established using Fisher’s exact test. A number of tools that utilize this, or a very similar approach have been developed, such as David/EASE
[4, 5], Onto-Express
[6, 7], ConceptGen
, the Gostats package of Bioconductor
[9, 10], and FuncAssociate
As all of these programs require a list of differentially expressed genes as input, the analytical results are influenced by the significance cut-off selected by the user. Thus, several methods have been proposed that offer alternative approaches that do not require a significance cut-off. Gene Set Enrichment Analysis (GSEA) uses differential expression statistics of all genes, without categorizing them into differentially and non-differentially expressed, and a non-parametric method to identify enriched gene sets
. Our recently published LRpath method uses logistic regression to functionally relate the odds of gene set membership with the significance of differential expression and calculates adjusted P-values as a measure of statistical significance
. An alternative interpretation of how LRpath works comes from the random sets method; that is, LRpath tests whether the significance levels of a particular set of genes is significantly higher (or lower) than those of a randomly chosen set of genes of the same size
We recently developed a web-based application for LRpath with greatly expanded and novel gene set annotations, including metabolite, transcription factor and microRNA target sets, and literature-derived annotations, and that also includes clustering analysis functionality, allowing one to identify and compare biological concept signatures across multiple studies. LRpath is particularly suitable for such an integrative study, because it performs well with both small and large sample sizes
, as it does not depend on non-parametric resampling of samples to assess significance of enrichment. Additional benefits of using the LRpath program include (1) the ability to perform both “directional” and “non-directional” enrichment tests that allow for two different perspectives to enhance interpretation and (2) the ability to easily compare and visualize results across multiple studies using LRpath clustering functionality.
Epigenetic mechanisms such as DNA methylation and histone modifications play essential roles in cell differentiation and transcriptional regulation and are identified as key mediators of cancer progression. For example, transcription of a number of tumor suppressor genes such as p16
, BRCA1, p53 and MLH1 has been demonstrated to be silenced by promoter hypermethylation
. Furthermore, genomic instability associated with the hypermethylation of the DNA mismatch repair enzyme gene MLH1 may not only deregulate critical genes involved in the initial stages of carcinogenesis, but also those involved in the later invasion and metastasis stages of transformation
In cancer, recurrent patterns of aberrant DNA methylation alteration are evident, especially in promoter regions, implicating the contribution of specific altered pathways driven by methylation change. For example, DNA hypermethylation of gene promoters commonly marks disease progression and silencing of putative tumor suppressor genes. Conversely, DNA hypomethylation occurs most commonly in a genome-wide manner, especially within repeat elements such as LINE1, Alu, and PG4s (potentially G-quadruplex-forming sequences)
[17–19] and is associated with genomic instability
[20, 21]. Recently, the hypomethylation of PG4-dense regions were reported in cancer, indicating the role of DNA methylation in genomic stability through a structural change in G4 formation, resulting in DNA breakpoint hotspots
. In general, demethylation of the genome can lead to 1) the reactivation of transposable elements, thereby altering the transcription of adjacent genes, 2) the activation of oncogenes such as H-RAS, and 3) the biallelic expression of imprinted loci (e.g. loss of IGF2 imprinting)
[22–24]. Studies of aberrant DNA methylation can benefit diagnostic and prognostic marker discovery by identifying frequent methylation targets and also can provide new insights for improved classification, diagnosis, therapies, and prognosis.
The relative contribution of epigenetic mechanisms to multiple cancer types is not well understood, in particular to what extent epigenetic mechanisms target similar genes and pathways as somatic mutations. Here, we hypothesize that during the pathogenesis of cancer, certain pathways or biological gene groups are commonly dysregulated via DNA methylation across cancer types. To test our hypothesis, we employed LRpath and clustering analysis on data from ten tumor versus normal DNA methylation studies to unravel the commonly altered pathways and other biological concepts across multiple cancers. The ability of the method employed by LRpath to implicate important biological pathways and groupings has previously been demonstrated
. In this paper, we describe the first example of pathway analysis coupled with the DNA methylome of various tumor types.