Ethics statement
All participants gave written informed consent. This study was approved by Medical Research Ethics Committee (MREC) of the University Medical Center Utrecht, The Netherlands.
Pre-processing of genotype, methylation and expression data
Genotype, methylation and expression data were collected for different numbers of samples. For the 148 healthy subjects eventually analyzed in this paper, data was available for all three layers of genetic information after quality control, as described below. Our final data set consisted of 72 males and 76 females with a mean age of 52 (range: 19–88); all subjects were of Dutch ancestry with at least three of the four grandparents born in The Netherlands.
Genotype SNP data
Genotype data for subjects was generated on two different array platforms, 105 individuals on Illumina CytoSNP (299,173 SNPs) and 96 on Illumina 300k chips (300,299 SNPs). For each SNP platform, quality control procedures were initially performed separately using PLINK[24]. Subjects were excluded based on > 5% missing genotypes and gender errors (Additional file11). We used linkage disequilibrium (LD) based SNP pruning to select the most informative SNPs (R2<0.2), only for subsequent quality control steps. This resulted in ~60k SNPs for both sets to assess heterozygosity (F<3 Standard Deviation (SD)), homozygosity (F>3SD) and relatedness by pairwise identity by descent (IBD) values (pihat > 0.1). Datasets were merged with Hapmap Phase 3 individuals to check ethnicity (Additional file12) (ethnic outliers detected by visual inspection). After these QC procedures on subjects (excluding in total 8 individuals) quality control on SNPs was performed as follows. All SNPs were filtered on missingness (> 2%) and Hardy Weinberg (p>1e-6) before merging the two datasets. 84,367 SNPs were shared between the two datasets. No related samples were detected in the merged datasets (according to criteria described above). We imputed the merged dataset with Hapmap2, release 24 using Beagle[40]. SNPs with an imputation score > 0.8 and present originally in one or both datasets were extracted and 417,708 SNPs remained for all further analyses.
DNA methylation data
Methylation data was obtained using Illumina HumanMethylation 27 beadchips for two batches of 105 and 96 healthy subjects. The assay detects methylation status at CpG sites after bisulfate conversion, by means of probes designed for either methylated or unmethylated sequence. Methylation probes were classified into 3 different categories depending on the location of the probe with respect to a CpG island. Based on the UCSC Table browser (http://genome.ucsc.edu/;[41]), NCBIbuild36, categories were defined as CpG island, CpG island shore (sequences up to 2kb from an island), or outside CpG islands/shores. Ethnical outliers and samples with gender errors in genotype data were removed from the methylation data. Gender was checked by hierarchical clustering of X-chromosomal probes, excluding four individuals. Another three individuals were removed based on detection p-values (> 0.01 for > 1% of probes) and 3,027 of 27,578 probes were excluded based on detection values (p>0.01 for > 1% of the samples). Both channels of the methylation array were quantile normalized independently. Beta values of a probe were calculated by dividing the methylated signal by the sum of the methylated and unmethylated signal. Next, five potential array outliers were removed in an unbiased fashion. Specifically, we used the SampleNetwork R function package[42] to calculate the Interarray based sample connectivity score Z.k. We removed samples with a Z.k value less than −3 since their connectivity is 3 standard deviations below the mean value. Batch effects of dataset, plate, array and position were removed using ComBat[43]. After these procedures, 24,561 probes remained and were mapped to the human genome using the UCSC Human BLAT Search function. In total, 25 probes did not map to the human genome, whereas 338 probes did not map uniquely (mapped more than once), and both these probes have been removed. Moreover, 904 probes that contained a SNP, based on Hapmap release 27, with a minor allele frequency (MAF) > 1% have been removed as well, leaving a total of 23,294 probes for analyses.
Gene expression data
Gene expression data was generated in two batches, one on Illumina H8 beadchip (26 healthy subjects) and one on Illumina H12 beadchip (147 healthy subjects). BeadStudio© software version 3.2.3 was used to generate background-corrected gene expression data. Data was normalized, transformed and filtered separately before merging and batch effect removal. Specifically, the datasets were separately quantile normalized and log2 transformed using the Lumi package for R[44]. Probes were filtered based on detection values generated by BeadStudio©. The detection p-value threshold was set at 0.01. This resulted in 17,433 expression probes overlapping between both batches. Batch effects resulting from the use of different arrays at different time points were removed using ComBat[43]. An unbiased analysis based on interarray correlations identified 16 samples from batch 2 as potential outliers, which were subsequently removed from the analysis. Of 17,433 probes, 15,983 mapped to a single genomic location, based on a previous study[45]. In addition, 465 probes contained a SNP, based on Hapmap release 27, with a MAF > 1% and have been removed, leaving 15,983 probes for analyses.
DNA methylation and gene expression data have been processed using the same blood sample, excluding possible batch effects, such as the effect of different time points.
Identifying cis and trans effects between DNA methylation and gene expression
We called a methylation probe cis acting with respect to a given gene expression probe if there was a significant association (as defined below) within a 500kb interval between the probes. A methylation probe was called trans acting if it was significantly associated with the expression probe (as defined below) outside the 500kb interval.
To determine whether a significant association exists between expression and methylation levels we used a multivariate linear regression model for regressing the gene expression level (dependent variable) on the methylation level (independent variable) with age and gender as covariates. We took methylation levels as independent variable since we are interested in the epigenetic control of gene expression levels. Associations can be positive (DNA methylation levels and gene expression levels both increase or decrease) or negative (increased methylation level corresponds with a decrease in gene expression level and vice versa). The Wald test p-value for the association between methylation and expression was used as significance level. Correction of the significance level for multiple testing was performed separately for identifying cis acting methylation probes (FDR correction) and trans acting methylation probes (Bonferroni correction).
Identification of cis-and trans-acting SNPs
Expression levels and methylation levels that were significantly associated with each other were tested for association with SNPs to identify cis-and trans-acting genetic variations. For this analysis, the real and imputed (imputation score > 0.8) genotypes were used, and a MAF threshold of 5% for these SNPs was set.
Analogous to our previous definition, a SNP significantly associated with a given gene expression or DNA methylation probe was called cis-acting with respect to the probe if the SNP and the probe were within 500kb of each other, and trans-acting if they were more than 500kb apart.
To determine whether a significant relationship exists between a SNP and a methylation or expression level we again used a multivariate linear regression model for regressing the methylation or expression level (dependent variable) on the SNP (independent variable) with age and gender as covariates. The regressions were performed using the PLINK software[24]. Correction for multiple testing was performed separately for cis-acting SNPs (0.05 divided by the number of probes) and trans-acting SNPs (0.05 divided by the number of possible combinations (p<0.05/(#probes*417,708).
Evaluating causal relationships using local edge orienting scores of observed cis effects
To evaluate the fit of different causal models involving 3 variables (i.e., a cis-acting SNP, a cis-acting methylation probe, and a corresponding expression probe), we calculated the single marker local edge orienting score (LEO.NB.SingleMarker) as described elsewhere[19, 32]. In short, a SNP can be used as causal anchor for evaluating the causal relationships between methylation and expression levels if the SNP is associated with at least one of them. We use the SNP as causal anchor for calculating the LEO score since genotypes are fixed at each locus as opposed to variable methylation and expression levels[19]. In this case, one can evaluate the fit of the following five models describing the causal relationships between a SNP (denoted S), a methylation probe (M) and an expression probe (E): model 1: S→M→E; model 2: S→E→M; model 3: M←S→E; model 4: S→E←M; model 5: S→M←E. For each causal model a chi-square test based model fitting p-value was calculated with the structural equation modelling (SEM) R package[46]. The relative fit of causal model 1 (SNP→Methylation→Expression) was assessed using the single anchor local edge orienting score (LEO.SingleMarker), which is the logarithm (base 10) of the ratio of the model fitting p-value divided by that of the next best fitting alternative model[19]. Thus a positive LEO.SingleMarker score indicates that the causal model S→M→E fits the data better than all other competing models. As significance threshold we used the LEO threshold of 0.8, as recommended in[19] based on extensive simulations as well as empirical studies. We decided to focus on local cis effects since there is considerable debate in the literature whether trans relationships are reproducible[29, 30]. Since we were interested in causal direction for predetermined three-way associations, we only selected SNPs associated with both the methylation and expression levels in cis. To protect the causal analysis from biases due to age and gender, we utilized residuals of methylation and expression levels corrected for age and gender in the causal analysis using a linear regression by Limma in R[47].
Weighted correlation network analysis of gene expression and methylation data
A detailed description of our correlation module based analyses can be found in Additional file5. Here we provide a terse summary. Weighted correlation network analysis implemented in the WGCNA R package[26, 27] was first applied to the expression data to identify co-expression modules. Co-expresssion modules correspond to clusters of interconnected genes defined as branches of a hierarchical cluster tree. Since modules are defined without respect to gene ontology information they are initially labelled by arbitrary integers and coded by colours. Next WGCNA was applied to the methylation data to find co-methylation modules. For easier interpretation of the relationships between expression and methylation modules, we use the same module labels for modules that show significant overlap. The matching of module labels was performed using the function matchLabels from the WGCNA R package; it is based on significance of module overlaps quantified using Fisher’s exact test. Weighted networks have the advantage of preserving the continuous nature of co-expression and co-methylation information, which is particularly useful when studying module preservation. To assess the preservation of expression and methylation modules in the corresponding complementary data set, we use the network module preservation statistics described in[28] and implemented in the function modulePreservation in the WGCNA R package. Network module preservation statistics assess whether the density and connectivity patterns of modules defined in a reference data set are preserved in a test data set. Network preservation statistics do not require that modules be identified in the test data set and hence independent of the ambiguities associated with module identification in the test data set. The permutation test of the modulePreservation function leads to a composite module preservation statistic referred to as Zsummary. The Zsummary statistic of a given module summarizes the evidence that the network connections of the module are more significantly preserved than those of random set of genes of equal size. We adopted the following recommended significance thresholds for Zsummary[26–28]: Zsummary<2 implies no evidence that the module is preserved, 2<Zsummary<10 implies weak to moderate evidence, and Zsummary>10 implies strong evidence for module preservation. Thus, we report Z summary for each expression and methylation module in the methylation and expression test data sets, respectively.
Since modules group together highly correlated variables, it is advantageous to summarize the variable profiles using a single representative. We use the module eigengene E, defined as the first principal component of the standardized matrix containing the variables in the module. The module eigengene can be intuitively understood as a weighted average of the variable profiles in the module.