A total of 610 individuals from the BSGS data set had both gene expression and DNA methylation measures as well as high density genotypes [14]. After low-level QC and normalization, expression and methylation probes values were corrected for batch, sex and age effects (see Methods). We removed probes with SNPs in them [2, 15] and probes on sex chromosomes. The final dataset consisted of 16,659 and 303,078 expression and methylation probes respectively that survived all filtering and QC steps, resulting in ≈5 × 109 possible pairwise comparisons (see Fig. 1 for the analysis work flow).
Cellular composition of the whole blood drives correlation structure between expression and methylation values
The pairwise Pearson correlation (ρ) was calculated between all pairs of expression and methylation probes. Assuming bivariate normal distribution for expression-methylation values for each probe pair, the Fisher Z transformation of the correlation coefficient allows us to obtain asymptotic p-values (H0: ρ = 0, H1: ρ ≠ 0) based solely on the correlation coefficient and sample size [16, 17]. The number of probe pairs that survived a Bonferroni correction threshold of 0.05/(16,659 × 303,078) was 17,995,980. In order to simplify initial analysis of the correlation structure, we restricted our attention to probe pairs that pass Bonferroni threshold and have a correlation |ρ|≥0.5. Given a maximum sample size of 610 individuals (ignoring potential missing values for some of the probe pairs) any correlation coefficient |ρ|≥0.27. was deemed significant. The correlation structure among those probes was visualized as a graph where the nodes denote expression/methylation probes and the edges corresponds to a correlation of |ρ|≥0.5 between the probes (Additional file 1: Figure S1). By following the edges from one node to another we can define subsets of connected nodes,known as graph components [18]. The resulting graph consisted of 24 components, with the largest component containing 5099 nodes and 23 small components with median number of nodes equal 2.
We utilized gene expression and DNA methylation data from FACS sorted blood cell types to access the cell-type specificity of expression and methylation probes from the largest graph component. Those purified cell-types data sets consist of DNA methylation data from Reinius et al. [19] and gene expression data from Primary Cell Atlas [20]. The methylation dataset consists of CD19+ B cells, CD4+ T cells, CD8+ T cells, CD56+ NK cells, monocytes, eosinophils, and neutrophils cell types and lacks data for basophils in comparison to the cell types that are measured in the BSGS dataset. Multiple lines of evidence were used to show the cell type specificity of the methylation and expression probes from the largest correlation graph component. Firstly, it was observed that 158 out of 500 hematopoietic cell-type specific methylation probes from Houseman et al. [21] are in the largest graph component. We also performed hierarchical clustering of the purified blood cell [19] samples based on the methylation probes from the largest graph component, and this grouped the samples according to their cellular identity (Additional file 2: Figure S2). Similarly, the cell type specificity of the expression probes from the largest correlation graph component was addressed with hierarchical clustering of purified blood cell [20] samples (gene expression levels for B-cells, CD4+ T cells, CD8+ T cells, NK cells, monocytes and neutrophils were available in the dataset) based on the expression probes from the component and these also grouped samples according to their cellular identity (Additional file 3: Figure S3).
The observation that the largest graph component represented the majority of the probes in the correlation graph, their cell type specificity (i.e. ability to separate purified cell samples into a clusters according to their cellular identity) led to the hypothesis that the differential cell counts among individuals are responsible for the majority of the observed correlation structure. Adjustment of expression and methylation values for nucleated cell proportions was performed in the 422 out of 610 samples in our study that had blood cellular composition measured. The correlations were recalculated based on the adjusted values, which dramatically shifted the mean correlation of the top probe pairs (|ρ|≥0.5 before adjustment) towards zero (Fig. 2).
Looking at a expression-methylation correlations from a previously published study with similar experimental setup [11], we were able to match 2,016 out of their 2,650 and 568 out of their 798 significantly correlated trans and cis probe pairs respectively in our dataset, where probes were defined as being in cis if located within 0.5Mbp of each other on the same chromosome and in trans otherwise. The authors did not have access to cell counts, but recognized possibility of cell counts biasing correlation estimates. Indeed, for those probe pairs in our dataset, the majority of their observed correlations shifted towards zero upon adjustment for cellular composition (Additional file 4: Figure S4). The mean of the absolute value of the shift between the correlations before and after the adjustment is 0.23 and 0.34 for cis and trans probe pairs respectively. Interestingly, a small proportion of the correlations between probes in cis (Additional file 4: Figure S4) and same chromosome probe pairs (Fig. 2) were robust to the correction for cell proportions in our data.
Correcting for cellular composition using predicted cell counts
Whole blood cell counts were only available for the twins and their siblings in the BSGS dataset and not their parents. In order to overcome the reduced sample size due to the availability of blood cellular composition, we predicted cell proportions via previously published algorithm that uses DNA methylation measurements [21] for all 610 samples (see Additional file 5: Figure S5, Additional file 6: Table S1 and Methods). The original method predicts 6 cell types (B cells, CD4 T cells, CD8 T cells, NK cells, monocytes and granulocytes). To obtain finer grained representation of blood composition we re-trained the method on the data from Reinius et al. [19] which allows us to predict 7 cell types (the 8 cell types measured in the 422 subsample minus basophiles, see Methods). Figure 3 shows a comparison of the correlation coefficients corrected for observed (422 sample size) and predicted (610 sample size) cell proportions. Probe pairs for Fig. 3 are selected based on p-values (Bonferroni threshold of 9.9 × 10-12) from the data corrected on predicted cell proportions.
The shift of the correlations corrected on observed cellular proportions towards zero indicates that using the predicted cellular composition was unable to remove all of the bias in the observed correlations due to differences in cellular composition. In addition we observed large graph component (11,251 probes) in the corrected-on-predicted-proportions graph (Additional file 7: Figure S6), with the majority of probes in that component being once again hematopoietic cell type specific, thus demonstrating an unaccounted for bias due to differential cell counts. This is not surprising given that there will always be some variance that is not accounted for by the predictor. Given the range of correlations between the predicted and observed cellular proportions r ≈ 0.75 − 0.95 (Additional file 6: Table S1), we estimated that 10 to 44 % of the original correlation coefficient (no adjustment) remains when using the predicted cellular proportions to adjust expression and methylation values (see Methods). We decided to avoid a high level of false positive calls at the expense of a reduced sample size by restricting our analysis to the 422 individual subset.
Phenotypically correlated expression and methylation probes
In the final 422 individual subset where all the gene expression and DNA methylation measurements are corrected for the observed cell type proportions, there are 3,321 probe pairs that passed the Bonferroni threshold (Additional file 8: Table S4), which map to 232 and 1,922 unique expression (Additional file 9: Table S5) and methylation (Additional file 10: Table S6) probes respectively. Of these probe pairs, 614 are located on the same chromosome and 2,707 otherwise. Again, the graph representation of correlation structure can be split into one largest component (1,554 nodes) and 144 components with median number of nodes equal 4 (Fig. 4). The majority of probes in each probe pair from the largest graph component are located on different chromosomes. This component contains expression probes that related to inflammation and cytotoxic T-cells (e.g. GZMH, CCL5, GPR56) (Additional file 11: Table S2), suggesting another not accounted for confounder such as the inflammation status of an individual.
Shared QTLs
To avoid potential confounding of unobserved factors such as inflammation, we searched for probes sharing expression and methylation QTLs and thus are likely to be correlated due to common genetic control underlying their variation. We previously estimated heritabilities of expression and methylation probes in the BSGS dataset as well as performed genome wide QTL mapping [2, 14]. Probes in the final list (3,321 probe pairs) are heritable, with the mean heritability of the expression and methylation probes equal to 0.29 and 0.48 respectively (Additional file 12: Figure S7 and Additional file 8: Table S4). For each probe in a probe pair located on the same chromosome, we selected all SNPs with association p-value <10-5 for both methylation and expression levels. Of 614 same chromosome probe pairs 458 share at least one QTL at this threshold (Additional file 13: Figure S8, Additional file 14: Figure S9, Additional file 15: Figure S10 and Additional file 16: Table S7). These 458 probe pairs map to 135 genomic regions, of which 125 pairs contain expression probe(s) tagging a single gene (Additional file 17: Figure S11). The methylation probes from the probe pairs that do not share QTL(s) tend to be less heritable, with a mean heritability of 0.42 compared to 0.69 for probes with a shared QTL (Additional file 18: Figure S12). Whilst expression probes from both shared and non-shared QTL pairs have mean heritability equal 0.30. A clearer picture is obtained by looking at variance explained by the best m/e SNPs for each probe. The best eSNP(s) (Additional file 19: Table S8) explain 29.4 and 5.4 % of variance on average for the expression probes from probe pairs with and without shared QTL(s) respectively. Likewise, the best mSNP(s) (Additional file 20: Table S9) expain 44.1 and 4.1 % of variance on average for the methylation probes from probe pairs with and without shared QTL(s) respectively (Additional file 21: Figure S13).
The majority of the probe pairs that do not share a QTL contribute to the largest component (144 out of 156 probe pairs), as do probe pairs located on different chromosomes (2,668 out of 2,707 probe pairs). In contrast, same chromosome probe pairs with shared QTL(s) mainly contribute to the small components of the correlation graph (447 out of 458 probe pairs).
Of the probe pairs that share QTL(s), 95.5 % are located within 1Mbp of each other, whilst only 11.5 % of pairs that do not share a QTL located within this distance. Probe pairs with and without shared QTL(s) have a mean absolute value Pearson corelation of 0.41 and 0.34 respectively (Fig. 5).
We observed great variability in the number and proportion of shared SNP(s) associated with the expression and methylation levels of a pair which likely represents LD structure at each genomic location (Additional file 13: Figures S8 and Additional file 22: Figure S14). There was 93 probe pairs (47 genomic regions) that had the same best SNP from m and e QTL association mappings. The majority of the probe pairs with shared SNP(s) (95.6 %) have their best m and e SNPs within 1 Mbp window, unlike the probes without a shared QTL (3.2 %, Fig. 6).
Concordance in genetic control of expression and methylation levels
We estimated genetic correlations between DNA methylation and gene expression for all probe pairs passing Bonferroni correction threshold (3,321 pairs) with bivariate gREML utilizing SNP based genomic relationship matrix [22, 23] (GRM) (Fig. 7). It is important to note that in this settings SNP based GRM reconstitutes pedigree structure of the BSGS dataset. Probe pairs that share QTL(s) have greater mean genetic correlation −0.69/0.68 (for positive and negative peaks respectively) in contrast to −0.48/0.4 and −0.45/0.45 for same chromosome probe pairs with no shared QTL and probe pairs on different chromosomes respectively (Fig. 7 and Additional file 8: Table S4).