Discovery of age-associated DNA methylation signatures
To identify robust age-associated DNAm signatures, we first searched and downloaded various DNAm profiles from diverse studies available in the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/; Figure 1). We then excluded studies without age information or with small numbers of samples (<10). We also excluded samples of diseased tissues other than cancer. It is known that technical bias exists across different array platforms [17], so we considered only the Illumina Infinium HumanMethylation27 Bead Chip array, which was the most widely used among the downloaded profiles. Consequently, we collected DNAm profiles of 2149 samples (1537 disease-free normal and 612 cancer samples) from eight studies available in the GEO. Additionally, we downloaded 1844 publicly available DNAm profiles (275 normal and 1569 cancer samples) of five cancer types (breast, ovarian, glioblastoma, kidney, and colon) evaluated on the same platform from The Cancer Genome Atlas (TCGA) consortium [18–22]. In total, we gathered DNAm profiles of 1812 normal healthy and 2181 cancer samples. These samples included diverse tissue types and exhibited a wide range of ages from 0 to 91 years (Additional file 1: Table S1). We next normalized DNAm levels (range from 0 to 1) using a single beta-score measure that indicates conceptually the normalized levels of DNAm (Methods). The normalized DNAm levels were well correlated between normal or cancer samples, but had higher correlation scores between normal samples (Figure 2A,B), and for both normal and cancer tissues, the DNAm levels in CGI regions of individual samples were much lower than those in non-CGIs (Additional file 2: Figure S1). Moreover, the DNAm levels of normal and cancer tissue showed different patterns depending on the genomic regions. In CGIs, for example, the average DNAm levels were generally higher in cancer than those in normal tissue, except for ovarian cancer samples (Additional file 2: Figure S1).
Next, using various regression methods, we identified an aging signature associated with the normalized DNAm levels. Because some studies have reported that methylation levels can change most dramatically during childhood [12], we applied nonlinear regression models as well as linear regression models. After finding statistically significant age-associated DNAm sites using single studies or combined multiple studies, we examined the characteristics and biological meaning of the sites through analysis of the associations of DNAm patterns with age, gene ontology, sequence conservation, and protein networks. We also compared the age-associated DNA loci across different studies or tissue types. To identify tissue type-invariant age-associated DNAm signatures, we integrated data from all samples of normal tissues, cancers, or both, after removing noisy samples (Methods). We compared the methylation patterns of the normal signature to those in cancer according to different genomic regions. We finally checked the potential for age prediction using the methylation levels of the age-associated signature, regardless of tissue type.
CpG loci are widely associated with age in disease-free normal samples
The distribution of DNAm levels showed a significant disparity between normal and cancer tissues depending on the genomic region. We first checked age-associated DNAm sites separately in normal or cancer samples using a linear regression model. For example, CG23854009 (located at 62802940 on chr19; correlation coefficient R = 0.83) and CG00888479 (19141824 on chr20; R = 0.81) sites showed linear hypermethylation patterns according to age in normal samples from the GSE32393 study [23] (Figure 2C). In contrast, hypomethylation patterns were observed at CG23124451 (37878077 on chr20; R = -0.76) and CG25256723 (167822568 on chr1; R = -0.78) in normal samples from the GSE41037 study [7] (Figure 2D). We checked the number of significantly age-associated CpG loci in each study (P < 0.0001 by linear regression). The numbers of significant CpG loci were quite different between studies (Figure 2E), mainly because of different ranges of age and/or different numbers of samples across studies (Additional file 2: Figure S2). Although the numbers of age-associated loci in normal tissues varied between studies, the numbers were significant in all studies included (P < 0.05 using a Z test of 100 age-permutation tests; Figure 2E). In cancer samples, however, the numbers of age-associated loci were not significant in some studies, including GSE26126 and GSE30760.
Individual studies included samples of various tissue types with different age ranges (Additional file 1: Table S1). Therefore, the average DNAm levels per CpG site were quite different between studies in both disease-free normal samples (P < 2.2e–300 using a Kruskal–Wallis test) and cancer samples (P < 2.2e–300) (Figure 3A). Similar results were also observed for the average DNAm levels per sample unit (Figure 3B). However, most study pairs with normal samples showed significantly greater degrees of overlap of age-associated CpG loci than would be expected by chance (Figure 3C). Moreover, the results of hierarchical clustering of the P-values of the degrees of overlap demonstrated that common age-associated CpG loci were independent of tissue or cell type. In the case of cancer samples, the degrees of overlap of age-associated CpG loci between study pairs were also significant, but less so than for normal samples (Figure 3D).
Age-associated DNA methylation signatures in normal and cancer
To identify tissue type-invariant age-associated CpG loci, we integrated the normalized DNAm levels of all the disease-free normal samples after removing noisy samples. Using this integrated data set, we first identified CpG loci with a linear relationship with age. For example, the CG19722847 site was linearly hypomethylated (30740381 on chr12; R = -0.65) and CG22736354 was linearly hypermethylated (18230698 on chr6; R = 0.8) according to age, regardless of tissue type (Figure 4A). However, DNAm levels of some loci showed nonlinear patterns according to age (Figure 4B). For these nonlinear relationships, we also observed more rapid changes in DNAm levels at younger ages, which is consistent with previous studies [9, 12]. We also observed similar phenomena for gene-level DNAm levels (Additional file 2: Figure S3A, B). We thus identified tissue type-invariant age-associated DNA methylation signatures using second- and third-degree nonlinear regression models in addition to the linear model. For the threshold, we used three measures, including false discovery rate (FDR) (<0.01), correlation coefficient (≥0.55), and residual error (<0.15), to reduce tissue-type variations. We identified 127 unique CpG loci in the combined normal samples, which we termed a “tissue type-invariant age-associated DNAm signature”. Among these, 80 CpG loci had a linear relationship with age and the other 47 loci were identified using nonlinear models (Additional file 1: Table S2). Seventy-seven loci were hypermethylated and 50 were hypomethylated with age. We also applied a similar approach to examine the DNAm levels of the combined 2181 cancer samples. Compared with normal samples, only 26 age-associated CpG loci were identified (Figure 4C). Interestingly, there was no CpG locus common to the normal and cancer age-associated signatures (Figure 4D). These epigenetic phenomena were also observed with the gene-level DNAm values (Additional file 2: Figure S3C, D). In case of the combined normal and cancer samples, only 18 CpG loci were identified as age-associated. We examined the positions on human chromosomes of the age-associated CpG loci of each of the signatures by separating hypomethylated (blue) and hypermethylated (green) loci, in normal (Figure 4E), cancer (Figure 4F), or combined samples (Additional file 2: Figure S4). Generally, the 127 age-associated loci in normal tissue were distributed throughout the human genome, except for chromosomes 18 and 21. In contrast to a previous study using male pediatric samples [12], the X chromosome had the largest number of age-associated loci. This difference may be caused by differences in sex, age range, and tissue types. We checked the significance of the numbers of loci by chromosome using hypergeometric tests (green bars for hypermethylation and blue bars for hypomethylation with age in Figure 4E). Chromosomes X (P = 8.1E–08), 22 (1.3E–03), 12 (1.7E–02), 1 (4.0E–02), and 16 (4.9E–02) were preferentially enriched for hypermethylated loci with age, whereas chromosomes Y (P = 3.4E–05), X (9.4E–04), 3 (9.5E–03), and 11 (4.3E–02) were enriched for hypomethylated loci. Thus, the sex chromosomes, especially X, were enriched for age-associated CpG loci in disease-free normal tissues. In cancer samples, chromosomes 3 (P = 0.03), 5 (1.7E–03), 6 (0.03), 7 (0.02), 10 (0.01), 11 (1.8E–04), and 21 (0.01) were enriched with age-associated hypomethylated loci (Figure 4F). Interestingly, there were no age-associated loci on the X or Y chromosomes in cancer samples. With the normal and cancer samples matched in age distribution, we also observed similar trends such as no overlap in signature between the normal and cancer samples (Additional file 2: Figure S5).
It was previously suggested that a difference in methylation variation might exist with gender [8]. We therefore identified age-associated signatures for males and females separately (Additional file 2: Figure S6). We found 560 (87 hypermethylated and 473 hypomethylated) and 152 (103 hypermethylated and 49 hypomethylated) CpG loci in the male and female samples, respectively. Even though the number of age-associated loci differed between male and female, their ratio distributions by chromosome were similar (P = 0.64 using a Wilcoxon rank-sum test; Additional file 2: Figure S6A). Moreover, the number of hyper- and hypomethylated loci on the X chromosome were similar between males and females (P = 0.22 using a Fisher’s test; Additional file 2: Figure S6B).
We built a feasible age-prediction model to see whether the normal 127-site signature could be used as a tissue-invariant age predictor. We applied a multiple linear regression model after identifying a feasible subset of the signature using a genetic algorithm (Methods). The selected age-prediction model was composed of 20 CpG loci of the signature (see “Predicted age” column in Additional file 1: Table S2). The correlation between the actual ages of the combined normal samples and their predicted ages using the model was highly significant (R = 0.91, P = 0.002 from 10,000 random selection tests using all loci in the platform; Figure 5A), which indicates that the DNAm levels of the age-associated signature sites can be used to predict the age of tissues, regardless of tissue type. We next compared the age-associated normal signature with those identified in previous studies (Additional file 1: Table S3). Most previous studies identified age-associated loci using a FDR threshold in a linear model. Thus, we compared the loci resulting from only linear regression (FDR < 0.01) and found that 430 age-associated CpG loci were age-associated in the integrated normal samples. For instance, a recent study using the Illumina Human 450 K platform and a linear regression model identified 137993 CpG loci associated with age in blood cells of 421 healthy subjects aged from 14 to 94 years [11]. Of these 137993 loci, the 6696 CpG loci present on the Illumina 27 K overlapped 73% with our 430 age-associated loci. Another study by Day et al. [13] found that 4747 CpG loci correlated with age in four tissue types, including brain samples, using a linear regression method, and the degree of overlap with our loci was 47%. Notably, we observed higher degrees of overlap of CpGs with previous studies that used only normal samples than with other studies that included diseased samples (Figure 5B and Additional file 1: Table S3). Sixteen of our 127 age-associated loci were not identified in the previous studies. Interestingly, 13 of these 16 loci were located on the X chromosome (see “Unique CpG” column in Additional file 1: Table S2).
Our collected DNAm profiles included diverse tissue types in the normal samples. We next identified tissue-type-specific age-associated signatures (Additional file 2: Figure S7). Although the number of age-associated loci in normal samples varied from one (for prostate) to 2713 (for peripheral whole blood) loci across tissue types (Additional file 2: Figure S7A), most tissue-type pairs showed significant degrees of overlap of the age-associated CpG loci compared with random expectation, except for the prostate tissue samples (Additional file 2: Figure S7B). In the cancer samples, we could not find significant tissue-type-specific age-associated loci with the threshold we used in most cases.
Characteristics of tissue type-invariant age-associated DNA methylation signature
We investigated the genomic locations of the loci in the age-associated DNAm signatures in normal or cancer samples. Of the 127 loci in normal samples, 78 were located in CGI regions and the others in non-CGI regions (Additional file 1: Table S4); whereas in cancer, 22 loci were located in CGIs and four in non-CGIs. Thus, while the normal signature was enriched in CGI regions, the cancer signature was even more enriched in CGI regions (Fisher’s exact test, P = 0.02). In the normal age-associated signature, hypermethylated loci were enriched in CGI regions (41 age-hypermethylated and 37 age-hypomethylated), whereas hypomethylated loci were enriched in non-CGIs (40 age-hypomethylated and 9 age-hypermethylated) (Chi-square test, P = 0.0002; Figure 5C). Interestingly, the cancer signature had only age-hypomethylated loci in both CGI and non-CGI regions (Figure 5D). Similar results were also detected for gene-level DNAm patterns (Additional file 2: Figure S8). We next checked the changing rates of DNA methylation in the normal signature within CGI or non-CGI regions. Age-hypermethylated loci increased more rapidly in CGI regions (Figure 5E). In contrast, hypomethylated loci decreased more rapidly in non-CGI regions (Figure 5F). The cancer signature of only hypomethylated loci showed much smaller changes with age than did the normal signature (Additional file 2: Figure S9).
Next, we investigated the nucleotide composition surrounding the 127 CpG loci in the normal signature (Additional file 2: Figure S10A). The sequences surrounding the normal signature showed significant overrepresentation of thymine (T) residues at -6, -3, +1, +6, +8, and +10 bases from the CpG loci (10,000 random selection tests of all CpG loci in the platform, P = 0.031, 0.036, 0.008, 0.036, 0.002, and 0.002, respectively), adenine (A) residues at +3 and +7 bases (P = 0.023, 0.035), and guanine (G) residues at -1 base (P = 0.04). Interestingly, the sequence motifs surrounding age-associated CpG loci were quite different between hypermethylated and hypomethylated loci. Sequences surrounding age-hypomethylated loci presented AT-rich sequences (Figure 5G), whereas the sequences surrounding age-hypermethylated loci were enriched for GC-rich sequences (Figure 5H); these are also enriched in CGI regions [24]. These phenomena were also observed in the cancer signature (Additional file 2: Figure S10B).
Analysis of gene ontology descriptors for the age-associated DNA methylation signature in normal samples indicated that the aging-related terms such as regulation of protein kinase activity (P = 0.01), metabolic processes (P = 0.04), immune system processes (P = 0.04), and neuron differentiation (P = 0.04) were significantly enriched in the CGI regions (Additional file 1: Table S5A). Genes related to age-associated loci in non-CGI regions (n = 48) also carried aging-related ontology terms including protein maturation (P = 0.04), and negative regulation of cell proliferation (P = 0.07) (Additional file 1: Table S5B). In cancer, the aging-related terms such as neuron apoptosis (P = 0.02) and muscle organ development (P = 0.03) were significantly enriched (Additional file 1: Table S6). We compared the normal signature with bivalent chromatin domain regions. Previously, human aging-associated DNA hypermethylation was found to occur preferentially at bivalent chromatin domains in ES cells [25]. Interestingly, we found that our normal age-associated hypermethylated loci overlapped significantly with the previously reported bivalent regions (P = 3.08E–31 using a Z-test of 10,000 random selection tests; Additional file 2: Figure S11).
Disruption of age-associated DNA methylation signature in cancer
For several of the 127 normal age-associated DNAm loci, multiple CpG sites were identified in a single gene. For example, CG13697378 (68285433 on chr1), CG09118625 (68285559 on chr1) and CG24871743 (68285238 on chr1) are located in the DIRAS family, GTP-binding RAS-like 3 (DIRAS3) gene. DIRAS3 is known as a tumor-suppressor gene that is expressed in normal ovarian or breast epithelial cells, but is rarely expressed in tumors [26]. These three loci are in CGI regions that show a positive correlation between DNAm levels and age (R = 0.57, 0.62, or 0.67, respectively) in normal tissues (Figure 6A). In cancer, however, no correlation with age that observed and methylation levels were generally high regardless of age. Interestingly, the age-associated DNAm increases or decreases of the normal signature were aberrantly accelerated in cancer samples, indicating that abnormal acceleration in age-associated DNAm change might induce tumorigenesis. In another example, CG19235307 (130642844 on chr3) and CG18303397 (130642825 on chr3), which are in non-CGI regions, are situated in MBD4, methyl-CpG-binding domain protein 4, which is associated with histone-modifying and chromatin-remodeling complexes [27]. These two loci showed a negative correlation between DNAm levels and age in normal tissues. However, the correlation with age disappeared in cancer samples and the DNAm levels of the loci were aberrantly lower (Figure 6B). Similar phenomena were also observed in the multiple age-associated CpG loci in MYF5 (myogenic factor 5) (Figure 6C) and PRR34 (Figure 6D).
Of the normal signature, eight genes were common with the tumor suppressor genes and PTTG1 was common with the proto-oncogenes listed in the UniProt Knowledgebase (http://www.uniprot.org/uniprot/; see “Tumor related” columns in Additional file 1: Table S2) [28]. The degree of overlap between the normal signature and the tumor suppressor genes was significant (P = 0.02; hypergeometric test). DNAm levels of some tumor suppressor genes including RRP22 (CG16612562 located at 28042045 on chr22), CDKN2B (CG18979223 located at 21995769 on chr9), and DIRAS3 increased with age; whereas those of SEMA3B (CG02657721 located at 50280835 on chr3) decreased (Figure 6A,E). Those tumor suppressor genes also showed abnormal acceleration in methylation changes in cancer samples.
Interaction network and sequence conservation analysis of the age-associated DNA methylation signature
We examined the human protein interaction network of the 127 normal age-associated loci, mapped to 122 unique genes (Figure 7A). For this human protein interaction network analysis, we integrated protein interactions from a number of open databases (Methods). We found that a protein interaction subnetwork that included the first neighbors of 122 gene products under the integrated network included 1163 proteins and 12,620 interactions between them. Analysis of the number of interacting neighbors revealed that the age-associated gene products in the normal signature had relatively more interacting partners than the average for all proteins in the prepared interaction network, and the overall distributions differed significantly between them (Wilcoxon rank-sum test, P = 0.0043; Figure 7B). Furthermore, the 122 unique gene products tended not to interact directly with each other: only one interaction existed among the 122 gene products (Z test, P = 0.0038; Figure 7C). This indicates that the age-associated gene products cover a large portion of the human protein interaction network. We also analyzed the DNA sequence conservation scores of the 127 CpG positions using the average phyloP [29] (Figure 7D). The results showed that 15 CpG positions (“Conserved CpG” in Additional file 1: Table S2) had significantly higher conservation scores (average phyloP score > 1.3) and that this number was significantly higher than random expectations (P = 3.56E–09 from a Z test using the phyloP distribution of 10,000 random selection tests). Gene ontology term analysis of the highly conserved loci showed that aging-associated terms such as metabolic processes (P = 1.40E–06), muscle system processes (P = 8.68E–05), and cell proliferation (P = 1.87E–02) were enriched (Additional file 1: Table S7). Of highly conserved loci, MYF5 and MYF6 are associated with myogenic regulation, which is related with a decrease of muscle in mass, strength, and contraction in aging [30]. The sequences surrounding MYF5 and MYF6 in CGI are enriched guanine and cytosine nucleotides (Figure 7E).