RELIC: REgression on Logarithm of Internal Control probes
The Illumina HumanMethylation450 BeadChip contains 93 pairs of internal normalization control probes (name with prefix NORM_A, NORM_T, NORM_G or NORM_C), while its successor MethylationEPIC BeadChip contains 85 pairs. The two probes in each pair are designed to target the same DNA region within housekeeping genes and contain no underlying CpG sites: one probe will extend to incorporate a base A or T (NORM_A or NORM_T, red channel), and the other probe will incorporate a base G or C (NORM_G or NORM_C, green channel). These probe pairs were designed to monitor array performance in different color channels and thus can be used for dye-bias correction. If there were no dye-bias, the intensity values from the two probes of each pair would be expected to be the same with a ratio close to 1.
Scatterplots of the log transformed intensity values between red and green channels for these normalization control probes demonstrates a clear linear relationship in every sample. A typical plot is shown in Additional file 1: Figure S1, which is for a normal breast tissue sample from [4] (GEO accession number: GSM815146). This motivates the new method, RELIC, which first performs a regression on the logarithms of the intensity values of the normalization control probes to derive a quantitative relationship between red and green channels, and then uses the relationship to correct for dye-bias on intensity values for whole array. Specifically, for each sample RELIC adjusts all intensity values from the green channel as:
$$ {I}_{i,adj}={e}^{{\widehat{\beta}}_1. \log \left({I}_i\right)+{\widehat{\beta}}_0}, $$
where i is the index of probe and I
i
is the intensity value of the probe in the green channel and I
i,adj
is the adjusted intensity value; \( {\widehat{\beta}}_0 \) and \( {\widehat{\beta}}_1 \) are linear regression coefficients from the regression analysis of the logarithms of intensity values between paired normalization control probes for the same sample (the logarithms of intensity values from the green channel, i.e., NORM_G and NORM_C, are modeled as independent variable). The intensity values from the red channel remain unchanged. One advantage of deriving the relationship between red and green channels using log transformed intensity values of normalization control probes is to assure non-negative values after the adjustment using the derived relationship.
Evaluation datasets
Dataset 1: 450K dataset of a total of 39 methylation laboratory standard control samples reported by [13]. Human unmethylated DNA (HCT116 double knock out (DKO) of both DNA methyltransferases DNMT1 (-/-) and DNMT3b (-/-)) and fully methylated DNA (HCT116 DKO DNA enzymatically methylated) were obtained commercially (Zymo Research, Irving CA) and mixed together in different proportions to create laboratory control samples with specific methylation levels: 0, 5, 10, 20, 40, 50, 60, 80 and 100% methylated. Replicates for each methylation level (n = 10, 3, 2, 3, 3, 2, 3, 3 and 10, respectively) were independently assayed on different arrays.
Dataset 2: 450K dataset of 22 samples reported by [4]. These samples included three replicates from the HCT116 WT cell-line, three replicates from the HCT116 DNMT1 and DNMT3B double KO (DKO) cell-line, and 16 other samples (GEO accession number: GSE29290). In particular to evaluate RELIC and other dye-bias correction methods, we used the six replicates from the HCT116 WT and HCT116 DKO cell-lines, and the matched bisulfite pyrosequencing (BPS) data for 15 probes in the two cell-lines reported in the Table one of [4]. As described in [4] the fifteen CpGs were selected for technical validation of the 450K array measures (six sites for Infinium I assay and nine sites for Infinium II assay) using the more accurate BPS method as the “gold standard”.
Dataset 3: 450K dataset of 24 samples reported by [6]. These samples included 12 blood samples and 12 saliva samples for ten individuals, with two individuals having two technical blood/saliva replicates (GEO accession number: GSE73745). More specifically, we used these samples and the matched bisulfite pyrosequencing (BPS) data for three probes (cg19754622, cg16106427, cg08899523) to evaluate RELIC and other dye-bias correction methods.