In humans, methylation occurs mainly in the context of cytosines followed by guanines (CpGs) . Over 70% of CpG sites throughout the genome are methylated, however, CpG-rich regions (known as CpG-islands), found in approximately 60% of gene promoter regions, are usually unmethylated [2, 3]. DNA methylation is an important epigenetic mechanism used by cells to regulate gene expression and is essential for normal cell development [4, 5]. Aberrant DNA methylation patterns have been observed in various human diseases [6, 7], including cancer where hypermethylation of CpG-islands with resultant transcriptional silencing of tumour suppressor genes is recognized as a common mechanism for gene regulation [8, 9]. As such, the determination of genome-wide DNA methylation status plays a crucial role in improving our understanding of mechanisms of disease formation.
Several methods have been developed to detect the DNA methylation of cytosines distributed over the human genome. These include methylated DNA immunoprecipation sequencing (MeDIP-seq ), reduced representation bisulfite sequencing (RRBS ), methylated DNA captured by affinity purification (MethylCap-seq ), whole-genome bisulfite sequencing (WGBS) and the lower-resolution assays such as Infinium HumanMethylation27 (HM27K) array [13, 14] and Infinium HumanMethylation450 BeadChip (HM450K bead array; Illumina, Inc, CA, USA) . Each of these methods has advantages and short comings when detecting differentially methylated regions in disease studies (see [13, 14, 16–19] for reviews). Choosing which technology to use is usually determined by cost, with array technologies providing a cheaper option, albeit at lower resolution. However, a recent study profiling regions of differential methylation across a range of human samples demonstrated that only a small fraction of CpGs across the genome vary in methylation status . This means that whole-genome sequencing approaches may not be necessary to undertake comprehensive methylation profiling, suggesting great promise for continued use of array based approaches such as the HM27K and HM450K assays.
The HM450K array is a cost and time efficient technology that makes it possible to assess the methylation status of over 450,000 CpGs in the genome for large sample cohorts . The array includes coverage of 96% of CpG Islands and CpG shores, 99% of RefSeq genes, 94% of loci present on HM27K bead array, and additional CpGs identified as variable from various WGBS methylation investigations .
To detect the methylation status at individual CpG loci, the Illumina Infinium assay relies on hybridization of bisulfite-converted DNA fragments to bead-bound probes . Two probe types exist on the array, Infinium I and Infinium II. Infinium I type probes interrogate the methylation status of a CpG using the ratio between two probes that hybridize either the methylated or unmethylated DNA template flanking the CpG of interest. Infinium II type probes use a single probe with a single fluorescent tagged base ligation that is specific for the methylated or unmethylated states of the CpG of interest. The different chemistries of Infinium I and infinium II probes and the fact they interrogate different sets of CpG populations, means that the probe groups have different distributions of DNA methylation measurements on the HM450K bead array . As such, several R packages with a range of normalization methods for the HM450K bead array have been developed to account for this difference [13, 21–25].
While recent reports have illustrated the accuracy and reproducibility of this platform [15, 26–28], many studies have also reported that probes on the array may produce erroneous results due to genomic factors other than methylation that affect hybridization or base ligation. For example, a number of studies have shown that a probe’s ability to measure accurate DNA methylation can be affected by SNPs at the interrogated CpG (20,879 probes in , 40,484 probes in , and 66,877 probes in ). It has also been suggested that SNPs within 10 bp of the interrogated CpG can affect probes (36,535 in ). Given that previous studies on gene expression arrays, which also rely on hybridization, have shown that single nucleotide polymorphisms (SNPs) and short insertions and deletions (INDELs) overlapping probe regions affect hybridization [32–34], SNPs are likely to impact our ability to measure methylation using the HM450K array. In addition to variants affecting methylation calling, a significant number of probes have been shown to map to multiple locations in the genome. Cross-reactivity of these regions can compromise true signal detection by the array and many studies have suggested removal of these probes from analysis (29,233 X chromosome probes  and 40,590 autosomal probes ). This effect is confounded, since bisulfite treatment of DNA converts unmethylated C to T, rendering the “bisulfite genome” with reduced complexity, which facilitating more multiply mapped probes. It has also been suggested that probes which span regions in the genome containing repeats yield erroneous methylation calls [13, 35] and should be filtered. Probe filtering has even occurred in regions of copy-number change , in spite of a study that analyzed the effect of copy number on observed methylation at CpG sites using the HM27K array, and concluded that there was no systematic copy number effect on methylation status .
In many of these previous studies, the effects of factors causing noise in methylation measurement was not determined directly but inferred through observed trends such as increased standard deviation in probes affected by SNPs at the CpG across multiple samples from the same tissue . Usually, in absence of more information, a conservative approach has been taken, aggressively filtering any probe which may be potentially affected. None of the previous studies have performed a systematic analysis of all of the potential factors affecting probes on the HM450K bead array. In this study, we perform a rigorous analysis of the effects of SNPs, INDELs, repeats and multi-mapping probes. In contrast to previous studies, we have compared these confounding effects against WGBS data. Our analysis yields a set of probes which should be removed during analysis as we have shown they provide a noisy signal (i.e. increased deviation in measurement using the HM450K bead array compared to whole-genome bisulfite sequencing). By removing these probes, we show recovery of biologically relevant results which would have been missed without our probe filtering approach.