- Research article
- Open Access
Guidance for DNA methylation studies: statistical insights from the Illumina EPIC array
© The Author(s). 2019
- Received: 1 October 2018
- Accepted: 2 May 2019
- Published: 14 May 2019
There has been a steady increase in the number of studies aiming to identify DNA methylation differences associated with complex phenotypes. Many of the challenges of epigenetic epidemiology regarding study design and interpretation have been discussed in detail, however there are analytical concerns that are outstanding and require further exploration. In this study we seek to address three analytical issues. First, we quantify the multiple testing burden and propose a standard statistical significance threshold for identifying DNA methylation sites that are associated with an outcome. Second, we establish whether linear regression, the chosen statistical tool for the majority of studies, is appropriate and whether it is biased by the underlying distribution of DNA methylation data. Finally, we assess the sample size required for adequately powered DNA methylation association studies.
We quantified DNA methylation in the Understanding Society cohort (n = 1175), a large population based study, using the Illumina EPIC array to assess the statistical properties of DNA methylation association analyses. By simulating null DNA methylation studies, we generated the distribution of p-values expected by chance and calculated the 5% family-wise error for EPIC array studies to be 9 × 10− 8. Next, we tested whether the assumptions of linear regression are violated by DNA methylation data and found that the majority of sites do not satisfy the assumption of normal residuals. Nevertheless, we found no evidence that this bias influences analyses by increasing the likelihood of affected sites to be false positives. Finally, we performed power calculations for EPIC based DNA methylation studies, demonstrating that existing studies with data on ~ 1000 samples are adequately powered to detect small differences at the majority of sites.
We propose that a significance threshold of P < 9 × 10− 8 adequately controls the false positive rate for EPIC array DNA methylation studies. Moreover, our results indicate that linear regression is a valid statistical methodology for DNA methylation studies, despite the fact that the data do not always satisfy the assumptions of this test. These findings have implications for epidemiological-based studies of DNA methylation and provide a framework for the interpretation of findings from current and future studies.
- DNA methylation
- Epigenome-wide association study (EWAS)
- Multiple testing
- Illumina EPIC array
There is increasing interest in the role of epigenetic processes in health and disease, with the primary focus of most epigenetic epidemiological studies being on DNA methylation (DNAm) . Platforms such as the Illumina 450 K Human Methylation microarray (450 K array) and the Illumina EPIC Human Methylation microarray (EPIC array) have enabled the economical, high-throughput profiling of methylomic variation across large numbers of samples. In recent years a number of epigenome-wide association studies (EWAS), which aim to identify DNAm differences associated with environmental exposure and disease, have been reported for a range of complex phenotypes including cancer [2–4], autoimmune disorders [5–7], psychiatric illnesses [8, 9], neurodevelopmental disorders [10, 11] and dementia [12, 13].
Primarily due to the dynamic nature of the epigenome throughout development, across different cell types and in response to environmental exposures, much has previously been written regarding the specific nuances of performing an EWAS compared to a genome-wide association study (GWAS) of genetic variation [14–16]. However, this literature is mainly focused on study design and interpretation rather than specific analytical issues relating to the characteristics of the data. One concern that has merited some discussion relates to whether the distribution of DNAm data violates the assumptions of Gaussian linear regression [17, 18], the most commonly used analysis model as it allows for the inclusion of covariates relating to both biological and technical confounders. For each molecule of DNA in a single cell, DNAm is a binary entity, in that at any cytosine it is either present or absent. However, as almost all DNAm studies profile either bulk tissue - comprising multiple cell types - or a population of purified cells, the analyses are essentially measuring the proportion of cells (taking a value between 0 and 1) in a sample that are methylated at a specific genomic position . While across the sites profiled on Illumina arrays DNAm has a bimodal distribution with peaks of hypomethylation (i.e. unmethylated sites) and hypermethylation (i.e. methylated sites), there is a significant subset of sites exhibiting intermediate levels of DNAm (proportion of methylated alleles = ~ 0.5). As the presence/absence of DNAm primarily distinguishes different cell types and tissues, in studies of a single tissue, which the majority of epigenetic epidemiology studies are, it is unlikely that the distribution at individual DNAm sites (the standard unit of analysis in an EWAS) will be bimodal. However, it is likely that the distributions will be variable and often non-normal, meaning that the assumption that the residuals of a linear regression fit are normally distributed may not hold. Furthermore, as DNAm levels are bounded by the limits of 0 and 1 it means that at the extreme ends of the distribution the variance is compressed. States of hypo and hypermethylation often define cell types and would not be expected vary biologically within a cell type, beyond any technical noise in the assay. This is exacerbated by the fact that the sensitivity of the microarray technology is less precise at these extremes of the distribution, and hence some measured variation is often present for these theoretically non-variable sites. This property of the data is called heteroskedasticity, defined as a relationship between the mean and variance of a dataset, and violates another assumption of linear regression. Although these concerns should be considered when it comes to deciding the statistical methodology, it is not currently known whether these violations are sufficient to bias analyses and introduce false positive or even false negative findings.
Consistent with studies of other types of genomic variation, another challenge for EWAS is how to account for the multiple testing burden in a typical analysis; for example, the Illumina EPIC array assays DNAm at base pair resolution for > 850,000 sites across the genome. Currently, a range of approaches are used to establish an appropriate significance threshold and there is no standard significance threshold as is used in GWAS. A common approach is a Bonferroni correction for the number of probes on the array [20–23] although this is often presumed to be too conservative as DNAm values at neighboring probes are known to be correlated , and many sites on the array are non-variable. An alternative, potentially more powerful, approach sets a permissible false discovery rate (FDR), and identifies the top associated sites that satisfy this criterion . While FDR can be calculated by generating the empirical null distribution of test statistics , it is most commonly applied using the approach introduced by Benjamini and Hochberg . This makes the assumption that under the null hypothesis the p-values across individual sites are uniformly distributed , which is not necessarily true. In EWAS it is not uncommon to see inflated test statistics [29, 30], even in the scenario of no true associations , indicating a skewed p-value distribution and perhaps reflecting unaccounted confounders such as differences in cellular composition, or certain environmental exposures such as smoking. This variation in the distribution of p-values across studies means that the FDR approach often demonstrates variable behaviour making it challenging to compare results across studies. A better approach would be to estimate the number of independent tests performed in a EWAS and make the appropriate adjustment to the significance level. Saffari and colleagues have previously applied the methodology successfully used for GWAS to DNAm data profiled on the Illumina 450 K array  in an attempt to establish a standard multiple testing threshold, however this is yet to be repeated for the EPIC array.
In this study, we used a large population based study, Understanding Society (n = 1175), where DNA extracted from whole blood was profiled using the EPIC array [33, 34] to investigate potential statistical biases of DNAm association analyses, with the goal of providing recommendations for future epigenetic epidemiology studies. First, we used a permutation procedure to establish an appropriate significance threshold that accounts for the multiple testing burden of the EPIC array. Second, we investigated whether the assumptions of linear regression are satisfied when measuring DNAm as beta-values and whether any violations bias the results of DNAm studies. Although transformations of beta-values (e.g. conversion to M-values ) have been proposed in order to better satisfy the assumptions of linear regression, these approaches have not been unanimously adopted by the community therefore we seek to determine the validity of studies that analysed beta-values. Finally, we used the significance threshold derived from our simulations to explore the statistical power of DNAm studies across various scenarios. These results of our analyses will inform the optimal approach to designing and analysing DNAm data.
Estimating a multiple testing corrected significance threshold for the EPIC array
After a stringent quality control (QC) pipeline (see Methods) and the exclusion of DNAm sites that may be technically biased by either the presence of genetic variants or cross-hybridisation to multiple genomic loci, our final dataset included DNAm estimates for 804,826 sites across the autosomes and X chromosome derived from 1175 individuals. Applying the Bonferroni correction formula for multiple testing, the significance threshold for hypothesis testing would be set to P < 6.21 × 10− 8 (0.05/804,826). In order to establish a significance threshold for EPIC array DNAm studies that controls for the number of independent tests (as opposed to the total number of sites tested), we used a permutation approach previously applied to GWAS  and 450 K array DNAm studies . This method preserves the correlation structure between sites and simulates null association studies by randomly assigning case control status. Repeating this process 100 times generates the distribution of p-values obtained by chance. From this distribution we calculated the 5% family-wise error rate (FWER) to be 9.42 × 10− 8 (Additional file 1: Figure S1). Using the inverse of the Bonferroni correction formula this is equivalent to correcting for 530,639 independent tests (0.05/9.42 × 10− 8), a reduction of 34.1% compared to the total number of sites included in the analysis.
Estimating multiple testing corrected significance threshold for a genome-wide DNAm study
Testing the assumptions of linear regression for DNAm analyses
Summary of DNA methylation sites significantly rejecting the assumptions of linear regression
N reject null hypothesis
% reject null hypothesis
Characterising DNAm sites that infringe the assumptions of linear regression
Recently, M-values have been proposed as an alternative to beta-values in EWAS analyses of traits and exposures due to their more desirable statistical properties . Although a direct comparison of beta-values and M-values is beyond the scope of this manuscript, we repeated our analyses on M-values to further interpret the results presented above. Using our experiment-wide significance threshold, 70.1% of DNAm sites demonstrated significant bias of at least one assumption when using M-values; that is just 1.09% less than the original analysis based on beta-values (Additional file 2: Table S5). Furthermore, 85.9% of DNAm sites that are considered statistically inappropriate based on beta-values were also classed as statistically inappropriate when analysed as M-values. As with the beta-value analysis, the primary assumption violated by M-values related to the shape of the distribution of residuals. In fact, a comparable number of sites demonstrated excess kurtosis regardless of whether beta-values (67.6%) or M-values (66.7%) were used. Furthermore, albeit more subtly, DNAm sites with methylation levels at the extreme ends of the distribution were more likely to fail the statistical tests (Additional file 1: Figure S8), consistent with the results of the analysis using beta-values.
Evaluating the impact on DNAm studies of sites that do not meet the assumptions of linear regression
Estimating the power of an EPIC array DNAm study
Summary of statistical power to significantly detect differential methylation between cases and controls
Mean Difference = 2%
Mean Difference = 5%
This study used a large DNAm dataset generated using the Illumina EPIC array to assess the statistical properties that influence the analytical design for hypothesis testing in epigenome-wide association studies. We estimated that there are 530,639 independent tests in a whole blood EPIC array DNAm study, which equates to a corrected significance threshold of 9.42 × 10− 8. For ease, we propose 9 × 10− 8 would be an appropriate EPIC array experiment-wide significance threshold that should be adopted by the field to minimize the reporting of false positives. Although this EPIC array experiment-wide threshold is not substantially different to a Bonferroni correction for the actual number of tests, our estimate is comparable to that proposed using a similar methodology to data from the older 450 K array , which includes approximately half the number of sites (P = 2.4 × 10− 7) that were converted to M-values. Our results indicate that the correlation in DNAm across sites included on the Illumina EPIC array is relatively small and does not encompass large genomic regions; Saffari and colleagues also observed that strong correlations between neighboring sites were typically only observed within 1 kilobase , consistent with the minimal reduction from number of actual tests to number of independent tests we report. This challenges the argument that a Bonferroni correction is too conservative and therefore a more relaxed multiple testing threshold can be applied. Existing and future studies which report results at a more lenient threshold, particularly those with small sample sizes and lower statistical power should be interpreted with caution.
We attempted to extrapolate from the experiment-wide threshold for the EPIC array to estimate an appropriate threshold for all potential tests across the genome, including those not currently profiled by the EPIC array, by using simulations to profile how the number of independent tests changes as the coverage of the microarray increases. At sufficient density, the number of independent tests should plateau; however this behaviour was not really evident across the range of densities we were able to simulate, suggesting that the EPIC array does not interrogate a large part of the variation in DNAm across the genome. Therefore, our estimate of the number of independent tests in the genome is likely to be imprecise. Moreover, given the wide confidence interval around the estimated genome-wide multiple testing burden, we recommend this result is taken with some caution. Future large population based studies that include more DNAm sites across the genome would be required to address this question. We propose that our experiment-wide significance threshold should be adopted for all future EPIC array EWAS. The use of a standardized significance threshold would benefit the field by providing a common standard for reporting associations and facilitate the comparison of results across different studies. While the threshold has been determined to minimize the reporting of false positives, it does not prevent them entirely; prudent study design and effective control of confounders are still required for high quality EWAS studies. Furthermore, replication of associations across independent datasets is still required to validate robust associations.
We also tested the assumptions of linear regression, the most commonly used tool for identifying associations between differential DNAm and a trait, when measuring DNAm using beta-values (i.e. as a proportion) and conclude that the majority of sites do not satisfy the assumption of normally distributed errors. This was particularly the case for DNAm sites that have low levels of variation or are located at the extreme ends of the distribution. While we use our experiment-wide p-value threshold to quantify the number of probes not satisfying these assumptions in order to gauge the pattern of results, we caution against using this threshold to classify sites as passing or failing these assumptions. As the statistical evidence required to reject the null hypothesis in these tests is unlikely to equate to the degree of violation of the assumption needed to influence the results of the regression analysis, it may not follow that sites that fail these tests will lead to incorrect conclusions if a linear regression model is used. As these assumptions were tested on an EWAS of chronological age, it is possible that our results are specific to this particular analysis. Furthermore, we used a European adult whole blood cohort as a basis for our assessment, which may mean that the results are not applicable to studies of other tissues, cell-types, ages or ethnicities. It is also likely that these violations of these assumptions will be more important for studies based on smaller sample sizes. For these reasons, rather than report a list of DNAm sites that do not satisfy the assumptions, we focused on characterising these sites in order to provide general guidelines. Although the specific sites that not do vary within a sample may differ between studies, we predict that it is always the non-variable sites that fail the tests of the assumptions. Some studies remove non-variable sites prior to hypothesis testing [38–40] and our results support such a filtering step. However, as we found no evidence that the lack of normal residuals, an incorrectly specified link function, or heteroskedasticity leads to either false positive or false negative associations, our data also suggests that this is not strictly necessary. A number of studies have used transformations of beta-values, for example using log ratios of methylation percentage referred to as M-values in order to obtain a normal distribution [3, 10, 18] or regression based on an alternative distribution (e.g. beta regression ); our results show that the use of linear regression with beta values in DNAm studies, even if the data do not satisfy the standard assumptions of this test, does not appear to lead to biased results. Despite considering the four key assumptions of linear regression, we did not specifically investigate the effect of outlier DNAm values, which may arise due to either technical or biological artefacts (e.g. rare SNP effects). The presence of outliers can introduce false positive associations as linear regression estimates are derived by minimising the sum of the residuals, therefore extreme values, which would lead to large residuals, can lead to larger, and therefore significant, estimated slope coefficients.
Finally, we performed power calculations to ascertain the sample size required for EPIC array studies using our proposed experiment-wide significance threshold. Most complex phenotypes are expected to be associated with small effects (typically < 5% difference between cases and controls), and our calculations indicate that with a sample size of 500 cases and 500 controls, 81% of sites have > 80% power to detect an effect of 2%. This estimate should be reassuring to the epigenetic community, as there are an increasing number of studies approaching or surpassing this sample size [9, 42–45]. Our approach advances previous efforts  by taking into account the individual properties of each DNAm site and uses an empirically derived significance threshold to provide an overview of power across the EPIC array. Finally, we have developed an online tool (https://epigenetics.essex.ac.uk/shiny/EPICDNAmPowerCalcs/) where users can perform their own bespoke calculations to quantify the power of their specific study for individual DNAm sites; we are currently extending this power calculation application for use with quantitative trait variables, and will implement an updated version in the near future.
We show that linear regression is a valid statistical methodology for DNAm studies, despite the fact that the data do not always satisfy the assumptions of the test. Additionally, we propose that a significance threshold of P < 9 × 10− 8 should be adopted to adequately control the false positive rate for EPIC array based analyses and should be accepted as a standard for reporting results. These findings have implications for epidemiological-based DNAm studies and provide a framework for the interpretation of findings from current and future studies.
All analyses were performed using the statistical language R .
Genomic-wide profiling of DNAm in understanding society
The DNAm dataset generated as part of the Understand Society study has been analysed in two previously published studies [34, 48] and a detailed description of the sample, DNAm data generation and data preprocessing can be found in the original publication . Briefly, Understanding Society (https://www.understandingsociety.ac.uk) is a longitudinal panel survey of 40,000 UK households which has collected sociodemographic information, biomedical measures and blood samples from participants. DNA was extracted from whole blood samples to facilitate genomic profiling including DNAm.
DNAm data preprocessing
DNAm was profiled for a subset of 1193 individuals from the Understanding Society study using the Illumina Infinium HumanMethylationEPIC BeadChip. Raw signal intensity data were processed from idat files through a standard pipeline using the bigmelon  and wateRmelon  packages in R. A number of quality control steps were performed to these data prior to normalization. First, outlier samples were identified using principal component analysis and mahalanobis distance equivalents, second, successful bisulphite conversion was confirmed using control probes, third the ages of the samples were estimated using the Horvath Epigenetic Clock algorithm  and compared to reported age at sampling, and fourth visualisation of principal components. These data were then normalized using the dasen method , which performs background adjustment and between-sample quantile normalization of methylated (M) and unmethylated (U) intensities separately for Type I and Type II probes. A second round of sample filtering was then performed excluding samples that were either dramatically altered as a result of normalisation or samples that had > 1% of sites with detection p-value > 0.05. DNAm sites were filtered to exclude those with a bead count < 3 or > 1% of samples with detection p-value > 0.05. The raw DNAm data of the final sample set was then re-normalized with the dasen method. Prior to data analysis, SNP probes, probes with non-specific binding, probes affected by common SNPs , and 65 probes annotated to the Y chromosome were additionally removed. The final dataset contained 1175 individuals and 804,826 DNAm sites (787,400 annotated to autosomes, and 17,426 annotated to the X chromosome).
Estimating a significance threshold for DNAm studies using the EPIC array
To estimate an experiment-wide significance threshold for the EPIC array, we applied the permutation procedure previously described by Dudbridge and Gusnanto . For each permutation, 50% of our 1175 samples (n = 557) were randomly assigned as “cases” and 50% (n = 558) as “controls” to simulate a null EWAS (i.e. no differences between cases and controls). Each of the 804,826 sites was then tested for association with this simulated phenotype using a linear regression model including sex, age, and six estimated cellular composition variables (B cells, CD8 T cells, CD4 T cells, monocytes, granulocytes, natural killer T cells) [52, 53] as covariates. We repeated this procedure 1000 times recording the smallest p-value (i.e. the most significant) from each permutation. The EPIC array significance threshold was estimated by taking the 5th percentile point of these 1000 minimum p-values representing the 5% family-wise error rate (FWER).
Estimating a genome-wide significance threshold for DNAm studies
This function was fitted using a least squares approach in R to find the value of u, which represents the number of independent tests in the entire DNA methylome. To calculate the methylome-wide significance threshold we applied the Bonferroni correction using this estimate (Pgenome = 0.05/u).
Testing the assumptions of linear regression models used in DNAm studies
To assess the validity of linear regression models in studies of DNAm, an EWAS of age was performed including sex, processing chip and six estimated cellular composition variables (B cells, CD8 T cells, CD4 T cells, monocytes, granulocytes, natural killer T cells) [52, 53] as covariates. For each of the 804,826 models (one per DNAm site) we tested for violations of the assumptions of linear regression using the gvlma (Global Validation of Linear Model Assumptions) R package . This package performs four tests to test the performance of the model fit with regards to the four assumptions of a linear regression: linearity, homoskedasticity, uncorrelatedness and normality of the residuals (Additional file 1: Figure S2). The gvlma package provides a numerical measure of violation through significance testing for skewness, kurtosis, link function, and heteroskedasticity. Briefly, the package calculates a directional test statistic for each assumption using the standardized residuals from the fitted linear model. These test statistics are each compared to a 1 degree-of-freedom chi-square distribution to calculate a p-value for hypothesis testing. In addition to obtaining a p-value for each of these four tests, the software also generates a “global” p-value, which is an omnibus test of the four others. The global test statistic is the sum of the four components (one for each assumption) and compared to a 4 degree-of-freedom chi-square distribution. The formula for each component and further details can be found in the original manuscript proposing the method . The null hypothesis for the global test is that all four assumptions hold, and the alternative hypothesis is that at least one does not (i.e. a significant p-value indicates that a linear model is not appropriate). In order to assess how DNAm sites on the EPIC array performed across these five tests we plotted Quantile-Quantile (QQ) plots of the observed vs expected p-values. To characterize sites which perform poorly in these tests we visualized correlations between the p-values from the five gvlma tests and both the mean level of DNAm and two measures of variance (standard deviation and range of the middle 80% of values). For the purpose of assessing which assumptions are most commonly violated, and which are most commonly violated simultaneously, we applied the experiment wide p-value threshold derived in the previous sections (P < 9.42 × 10− 8), to identify sites that reject the assumptions of linear regression. Finally to investigate the impact of violating the assumptions of linear regression, we calculated the mean rank across the 1000 null EWAS permutations as an indicator of how likely a site was to be associated by chance and any bias in association analyses. These mean ranks were then compared with the p-values of the gvlma tests.
Estimating statistical power for EPIC array studies
Power calculations were performed for each of the 804,826 sites in the dataset using the function pwr.t.test from the R package pwr . We consider the scenario with a binary outcome (i.e. case control study), using a two-sample t-test to compare the means of the two groups where the null hypothesis of each test is that the means of the two groups are equal. To calculate power, the parameters sample size, effect size and significance level were provided. The significance level was set as our previously calculated experiment-wide threshold of 9.42 × 10− 8. The effect size was provided as Cohen’s d, which is the expected difference between the two group means divided by their pooled standard deviation . In order to get a power estimate for the overall study, calculations were performed for every site individually using that site’s variance estimated from the Understanding Society dataset, for two different mean differences (2 and 5%). Power calculations were also performed for a range of total sample sizes (n = 100, 200, 500, 1000, 2000 and 5000) consisting of equal numbers of cases and controls. For each combination of parameters (sample size and mean difference), we calculated the percentage of sites that had sufficient statistical power across the full range of possible values (0–100%). While we only present results for a subset of the possible scenarios as a guide to the power of a typical EWAS study, we have also developed an R shiny app  to allow users to perform bespoke power calculations (https://epigenetics.essex.ac.uk/shiny/EPICDNAmPowerCalcs/). In this app, the user can specify sample size and mean difference to generate a summary results table and downloadable figure. As performing > 800,000 power calculations is time consuming, the app uses a binning method, grouping sites with similar variances and plotting a smoothed curve, to speed up the calculation. For more accurate results the user can increase the number of bins, or chose to calculate the power for all sites individually. There is also an option to search for a specific DNAm site of interest and calculate its power under user defined parameters.
Analysis was facilitated by access to the Genome high performance computing cluster at the University Of Essex School Of Biological Sciences.
DNA methylation data generation in UKHLS was funded through enhancements to the Economic and Social Research council (ESRC) grants ES/K005146/1 and ES/N00812X/1. MK is supported by the University of Essex and ESRC (RES-596-28-0001). EH, JM and LS time on this project was supported by MRC grant K013807. The funding bodies played no role in the design of the study, data collection, analysis, and interpretation or in the writing of the manuscript.
Availability of data and materials
Individual level DNA methylation are available on application through the European Genome-phenome Archive under accession EGAS00001001232 (https://www.ebi.ac.uk/ega/home). Specific details can be found here (https://www.understandingsociety.ac.uk/about/health/data). Phenotype linked to DNA methylation data are available through application to the METADAC (www.metadac.ac.uk). Analysis scripts used in this manuscript are available on https://github.com/ejh243/EPICStatsPaper.
EH designed the study. GM undertook primary statistical analyses with input from TG, YB, MK, LS, JM and EH. GM and EH drafted the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Ethical approval for the Understanding Society nurse visit was obtained from the National Research Ethics Service (Reference: 10/H0604/2). Participants gave written consent for blood sampling.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Murphy TM, Mill J. Epigenetics in health and disease: heralding the EWAS era. Lancet. 2014;383(9933):1952–4.PubMedView ArticleGoogle Scholar
- Heyn H, Carmona FJ, Gomez A, Ferreira HJ, Bell JT, Sayols S, Ward K, Stefansson OA, Moran S, Sandoval J, et al. DNA methylation profiling in breast cancer discordant identical twins identifies DOK7 as novel epigenetic biomarker. Carcinogenesis. 2013;34(1):102–8.PubMedView ArticleGoogle Scholar
- Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet. 2009;41(2):178–86.PubMedPubMed CentralView ArticleGoogle Scholar
- Lange CP, Campan M, Hinoue T, Schmitz RF, van der Meulen-de Jong AE, Slingerland H, Kok PJ, van Dijk CM, Weisenberger DJ, Shen H, et al. Genome-scale discovery of DNA-methylation biomarkers for blood-based detection of colorectal cancer. PLoS One. 2012;7(11):e50266.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31(2):142–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Rakyan VK, Beyan H, Down TA, Hawa MI, Maslau S, Aden D, Daunay A, Busato F, Mein CA, Manfras B, et al. Identification of type 1 diabetes-associated DNA methylation variable positions that precede disease diagnosis. PLoS Genet. 2011;7(9):e1002300.PubMedPubMed CentralView ArticleGoogle Scholar
- Murphy TM, Wong CC, Arseneault L, Burrage J, Macdonald R, Hannon E, Fisher HL, Ambler A, Moffitt TE, Caspi A, et al. Methylomic markers of persistent childhood asthma: a longitudinal study of asthma-discordant monozygotic twins. Clin Epigenetics. 2015;7:130.PubMedPubMed CentralView ArticleGoogle Scholar
- Pidsley R, Viana J, Hannon E, Spiers HH, Troakes C, Al-Saraj S, Mechawar N, Turecki G, Schalkwyk LC, Bray NJ, et al. Methylomic profiling of human brain tissue supports a neurodevelopmental origin for schizophrenia. Genome Biol. 2014;15(10):483.PubMedPubMed CentralView ArticleGoogle Scholar
- Hannon E, Dempster E, Viana J, Burrage J, Smith AR, Macdonald R, St Clair D, Mustard C, Breen G, Therman S, et al. An integrated genetic-epigenetic analysis of schizophrenia: evidence for co-localization of genetic associations and differential DNA methylation. Genome Biol. 2016;17(1):176.PubMedPubMed CentralView ArticleGoogle Scholar
- Ladd-Acosta C, Hansen KD, Briem E, Fallin MD, Kaufmann WE, Feinberg AP. Common DNA methylation alterations in multiple brain regions in autism. Mol Psychiatry. 2014;19(8):862–71.PubMedView ArticleGoogle Scholar
- Berko ER, Suzuki M, Beren F, Lemetre C, Alaimo CM, Calder RB, Ballaban-Gil K, Gounder B, Kampf K, Kirschen J, et al. Mosaic epigenetic dysregulation of ectodermal cells in autism spectrum disorder. PLoS Genet. 2014;10(5):e1004402.PubMedPubMed CentralView ArticleGoogle Scholar
- De Jager PL, Srivastava G, Lunnon K, Burgess J, Schalkwyk LC, Yu L, Eaton ML, Keenan BT, Ernst J, McCabe C, et al. Alzheimer's disease: early alterations in brain DNA methylation at ANK1, BIN1, RHBDF2 and other loci. Nat Neurosci. 2014;17(9):1156–63.PubMedPubMed CentralView ArticleGoogle Scholar
- Lunnon K, Smith R, Hannon E, De Jager PL, Srivastava G, Volta M, Troakes C, Al-Sarraj S, Burrage J, Macdonald R, et al. Methylomic profiling implicates cortical deregulation of ANK1 in Alzheimer's disease. Nat Neurosci. 2014;17(9):1164–70.PubMedPubMed CentralView ArticleGoogle Scholar
- Mill J, Heijmans BT. From promises to practical strategies in epigenetic epidemiology. Nat Rev Genet. 2013;14(8):585–94.PubMedView ArticleGoogle Scholar
- Relton CL, Davey Smith G. Epigenetic epidemiology of common complex disease: prospects for prediction, prevention, and treatment. PLoS Med. 2010;7(10):e1000356.PubMedPubMed CentralView ArticleGoogle Scholar
- Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12(8):529–41.PubMedPubMed CentralView ArticleGoogle Scholar
- Laird PW. Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet. 2010;11(3):191–203.PubMedView ArticleGoogle Scholar
- Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11:587.PubMedPubMed CentralView ArticleGoogle Scholar
- Birney E, Smith GD, Greally JM. Epigenome-wide association studies and the interpretation of disease -omics. PLoS Genet. 2016;12(6):e1006105.PubMedPubMed CentralView ArticleGoogle Scholar
- Panni T, Mehta AJ, Schwartz JD, Baccarelli AA, Just AC, Wolf K, Wahl S, Cyrys J, Kunze S, Strauch K, et al. A genome-wide analysis of DNA methylation and fine particulate matter air pollution in three study populations: KORA F3, KORA F4, and the normative aging study. Environ Health Perspect. 2016;124(7):983–90.PubMedPubMed CentralView ArticleGoogle Scholar
- Spiers H, Hannon E, Schalkwyk LC, Smith R, Wong CC, O'Donovan MC, Bray NJ, Mill J. Methylomic trajectories across human fetal brain development. Genome Res. 2015;25(3):338–52.PubMedPubMed CentralView ArticleGoogle Scholar
- Cardenas A, Houseman EA, Baccarelli AA, Quamruzzaman Q, Rahman M, Mostofa G, Wright RO, Christiani DC, Kile ML. In utero arsenic exposure and epigenome-wide associations in placenta, umbilical artery, and human umbilical vein endothelial cells. Epigenetics. 2015;10(11):1054–63.PubMedPubMed CentralView ArticleGoogle Scholar
- Elliott HR, Tillin T, McArdle WL, Ho K, Duggirala A, Frayling TM, Davey Smith G, Hughes AD, Chaturvedi N, Relton CL. Differences in smoking associated DNA methylation patterns in south Asians and Europeans. Clin Epigenetics. 2014;6(1):4.PubMedPubMed CentralView ArticleGoogle Scholar
- Ong ML, Holbrook JD. Novel region discovery method for Infinium 450K DNA methylation data reveals changes associated with aging in muscle and neuronal pathways. Aging Cell. 2014;13(1):142–55.PubMedView ArticleGoogle Scholar
- Heyn H, Moran S, Hernando-Herraez I, Sayols S, Gomez A, Sandoval J, Monk D, Hata K, Marques-Bonet T, Wang L, et al. DNA methylation contributes to natural human variation. Genome Res. 2013;23(9):1363–72.PubMedPubMed CentralView ArticleGoogle Scholar
- Noble WS. How does multiple testing correction work? Nat Biotechnol. 2009;27(12):1135–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995;57(1):289–300.Google Scholar
- Moskvina V, Schmidt KM. On multiple-testing correction in genome-wide association studies. Genet Epidemiol. 2008;32(6):567–73.PubMedView ArticleGoogle Scholar
- Zaghlool SB, Al-Shafai M, Al Muftah WA, Kumar P, Falchi M, Suhre K. Association of DNA methylation with age, gender, and smoking in an Arab population. Clin Epigenetics. 2015;7(1):6.PubMedPubMed CentralView ArticleGoogle Scholar
- Absher DM, Li X, Waite LL, Gibson A, Roberts K, Edberg J, Chatham WW, Kimberly RP. Genome-wide DNA methylation analysis of systemic lupus erythematosus reveals persistent hypomethylation of interferon genes and compositional changes to CD4+ T-cell populations. PLoS Genet. 2013;9(8):e1003678.PubMedPubMed CentralView ArticleGoogle Scholar
- Lehne B, Drong AW, Loh M, Zhang W, Scott WR, Tan ST, Afzal U, Scott J, Jarvelin MR, Elliott P, et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 2015;16:37.PubMedPubMed CentralView ArticleGoogle Scholar
- Saffari A, Silver MJ, Zavattari P, Moi L, Columbano A, Meaburn EL, Dudbridge F. Estimation of a significance threshold for epigenome-wide association studies. Genet Epidemiol. 2018;42(1):20–33.PubMedView ArticleGoogle Scholar
- Gorrie-Stone TJ, Smart MC, Saffari A, Malki K, Hannon E, Burrage J, Mill J, Kumari M, Schalkwyk LC. Bigmelon: tools for analysing large DNA methylation datasets. Bioinformatics. 2018:bty713.Google Scholar
- Hannon E, Gorrie-Stone TJ, Smart MC, Burrage J, Hughes A, Bao Y, Kumari M, Schalkwyk LC, Mill J. Leveraging DNA-methylation quantitative-trait loci to characterize the relationship between Methylomic variation, gene expression, and complex traits. Am J Hum Genet. 2018;103(5):654–65.PubMedPubMed CentralView ArticleGoogle Scholar
- Dudbridge F, Gusnanto A. Estimation of significance thresholds for genomewide association scans. Genet Epidemiol. 2008;32(3):227–34.PubMedPubMed CentralView ArticleGoogle Scholar
- Bell JT, Tsai PC, Yang TP, Pidsley R, Nisbet J, Glass D, Mangino M, Zhai G, Zhang F, Valdes A, et al. Epigenome-wide scans identify differentially methylated regions for age and age-related phenotypes in a healthy ageing population. PLoS Genet. 2012;8(4):e1002629.PubMedPubMed CentralView ArticleGoogle Scholar
- Peña EA, Slate EH. Global validation of linear model assumptions. J Am Stat Assoc. 2006;101(473):341.PubMedPubMed CentralView ArticleGoogle Scholar
- Glossop JR, Nixon NB, Emes RD, Haworth KE, Packham JC, Dawes PT, Fryer AA, Mattey DL, Farrell WE. Epigenome-wide profiling identifies significant differences in DNA methylation between matched-pairs of T- and B-lymphocytes from healthy individuals. Epigenetics. 2013;8(11):1188–97.PubMedView ArticleGoogle Scholar
- Fryer AA, Emes RD, Ismail KM, Haworth KE, Mein C, Carroll WD, Farrell WE. Quantitative, high-resolution epigenetic profiling of CpG loci identifies associations with cord blood plasma homocysteine and birth weight in humans. Epigenetics. 2011;6(1):86–94.PubMedPubMed CentralView ArticleGoogle Scholar
- Gao Z, Fu HJ, Zhao LB, Sun ZY, Yang YF, Zhu HY. Aberrant DNA methylation associated with Alzheimer’s disease in the superior temporal gyrus. Exp Ther Med. 2018;15(1):103–8.PubMedGoogle Scholar
- Triche TJ, Laird PW, Siegmund KD. Beta regression improves the detection of differential DNA methylation for epigenetic epidemiology. bioRxiv. 2016:054643.Google Scholar
- Hannon E, Schendel D, Ladd-Acosta C, Grove J, Hansen CS, Andrews SV, Hougaard DM, Bresnahan M, Mors O, Hollegaard MV, et al. Elevated polygenic burden for autism is associated with differential DNA methylation at birth. Genome Med. 2018;10(1):19.PubMedPubMed CentralView ArticleGoogle Scholar
- Wahl S, Drong A, Lehne B, Loh M, Scott WR, Kunze S, Tsai PC, Ried JS, Zhang W, Yang Y, et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature. 2017;541(7635):81–6.PubMedView ArticleGoogle Scholar
- Joehanes R, Just AC, Marioni RE, Pilling LC, Reynolds LM, Mandaviya PR, Guan W, Xu T, Elks CE, Aslibekyan S, et al. Epigenetic signatures of cigarette smoking. Circ Cardiovasc Genet. 2016;9(5):436–47.PubMedPubMed CentralView ArticleGoogle Scholar
- Joubert BR, Felix JF, Yousefi P, Bakulski KM, Just AC, Breton C, Reese SE, Markunas CA, Richmond RC, Xu CJ, et al. DNA methylation in newborns and maternal smoking in pregnancy: genome-wide consortium meta-analysis. Am J Hum Genet. 2016;98(4):680–96.PubMedPubMed CentralView ArticleGoogle Scholar
- Tsai PC, Bell JT. Power and sample size estimation for epigenome-wide association scans to detect differential DNA methylation. Int J Epidemiol. 2015.Google Scholar
- R Development Core Team. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2008.Google Scholar
- Gorrie-Stone TJ, Smart MC, Saffari A, Malki K, Hannon E, Burrage J, Mill J, Kumari M, Schalkwyk LC. Bigmelon: tools for analysing large DNA methylation datasets. Bioinformatics. 2019;35(6):981–6.PubMedView ArticleGoogle Scholar
- Pidsley R, Wong CCY, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14.PubMedPubMed CentralView ArticleGoogle Scholar
- Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013;14(10):R115.PubMedPubMed CentralView ArticleGoogle Scholar
- McCartney DL, Walker RM, Morris SW, McIntosh AM, Porteous DJ, Evans KL. Identification of polymorphic and off-target probe binding sites on the Illumina Infinium MethylationEPIC BeadChip. Genomics Data. 2016;9(September):22–4.PubMedPubMed CentralView ArticleGoogle Scholar
- Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13:86.PubMedPubMed CentralView ArticleGoogle Scholar
- Koestler DC, Christensen B, Karagas MR, Marsit CJ, Langevin SM, Kelsey KT, Wiencke JK, Houseman EA. Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics. 2013;8(8):816–26.PubMedPubMed CentralView ArticleGoogle Scholar
- Champely S: pwr: Basic Functions for Power Analysis. In., 1.2-2 edn. https://CRAN.R-project.org/package=pwr; 2018.
- Cohen J. Statistical power analysis for the behavioral sciences, Second edn. Hillsdale, N.J: Lawrence Erlbaum Associates; 1988.Google Scholar