Muscle and adipose NGT and T2D samples were collected as previously described . Briefly, we attempted to contact participants and participants’ relatives from previous diabetes-related studies [45,46,47,48] and also recruited subjects by newspaper advertisements. We excluded individuals with any diseases or drug treatments that might confound analyses. We defined glucose tolerance categories of NGT and T2D using World Health Organization (WHO) criteria . Biopsies were performed by 9 experienced and well-trained physicians from 2009 to 2013 in 3 different study sites (Helsinki, Kuopio, and Savitaipale). The study was approved by the coordinating ethics committee of the Hospital District of Helsinki and Uusimaa. A written informed consent was obtained from all subjects.
Islet samples were collected as previously described . Briefly, samples were procured from the Integrated Islet Distribution Program, the National Disease Research Interchange (NDRI), or ProdoLabs. Islets were shipped overnight from distribution centers, prewarmed in shipping media for 1–2 h before harvest, and cultured in tissue culture-treated flasks. Genomic DNA was then isolated from islet explant cultures and used for sequencing.
Whole genome sequencing libraries were generated from 50 ng genomic DNA fragmented by Covaris sonication. DNA end repair achieved using Lucigen DNA Terminator Repair Enzyme Mix. Sequencing adapters were added according to Illumina PE Sample Prep instruction. Libraries were size-selected on Invitrogen 4–12% polyacrylamide gels excising 200–250 bp fragments. Libraries were amplified with 10 PCR cycles and purified using AMPure beads (Beckman).
Whole-genome bisulfite sequencing
Whole-genome bisulfite sequencing was performed using Epigenome/TruSeq DNA Methylation Kit (Illumina). Libraries were prepared for each sample using 50 ng of input DNA by denaturing the DNA at 98 °C for 10 min. Bisulfite conversion was generated at 64 °C for 2.5 h and DNA purified using EZ DNA Methylation Gold Kit (Zymo Research). Bisulfite converted libraries were generated by random-primed DNA synthesis, 3′ tagging, and purification using AMPure beads (Beckman). Sample-specific index sequences were added with 10 cycles of amplification.
Library quality was assessed using Qubit (Thermo Fisher Scientific) and Agilent Bioanalyzer. Paired-end 125 bp sequencing was performed on Illumina HiSeq 2500 instruments to 30× genome coverage.
Genomic DNA was extracted from each tissue using DNeasy Blood and Tissue Kits (QIAGEN), according to the manufacturer’s recommendations. 200 ng of genomic DNA per sample was submitted to the Center for Inherited Disease Research at The Johns Hopkins University, where they were bisulfite-converted using EZ DNA methylation Kits (Zymo research), as part of the TruSeq DNA Methylation protocol (Illumina). DNA methylation was measured using the Illumina Infinium HD Methylation Assay with Infinium MethylationEPIC BeadChips according to manufacturer’s instructions.
WGS data processing
Raw FASTQ files were evaluated with FastQC . Adapter sequences were trimmed using Atropos , and reads with at least one pair shorter than 25 bp were excluded. Reads were aligned to the reference genome (GRCh37) using BWA MEM , followed by Samblaster  for marking duplicates.
WGS variant calling
SNPs and indels were called separately for all sample BAM files using GATK HaplotypeCaller . Variants were filtered using GATK Variant Quality Score Recalibration. Quality score cutoffs were chosen by comparing rates of discordance with SNP array genotypes.
WGBS data processing
Raw FASTQ were pre-processed as above and aligned using bwa-meth . Methylation values were extracted using the MethylDackel ‘extract’ command, including bias correction based on the values recommended by the ‘mbias’ command, and forward- and reverse-strand CpGs were merged with a minimum coverage cutoff of 10 (https://github.com/dpryan79/methyldackel). Methylation level data from the X and Y chromosomes were excluded.
EPIC Array data processing
The EPIC data are part of a much larger, unpublished study. As such, all samples were processed jointly with other samples from the larger study. We processed raw signal idat files using minfi v1.20.2  with the Illumina normalization method. We analyzed the quality of each sample looking for outliers across a variety of measures including fraction of failed probes (detection p-value > 0.05), median methylated and un-methylated intensity, control probe signal (using the returnControlStat function from shinyMethyl v1.10.0 ), distribution of the overall methylation profile, and principal component analysis. None of the samples included in this study were flagged as outliers. In addition, we verified the identity of each sample by comparing genotypes assayed on the EPIC array to imputed genotypes using the HRC reference panel r1.1  and Illumina Omni2.5 array genotypes.
For both the earlier 450k and recent EPIC Illumina methylation array, previous studies [59,60,61,62] have identified poor quality probes that either do not uniquely map to the reference genome or contain common genetic variation. These properties make the signal at these probes un-reliable. We removed such probes from the EPIC array. First, we removed cross-reactive probes on the EPIC chip by mapped non-control probes back to the entire bisulfite-converted genome, using Novoalign’s -b4 option, with allowance for up to three mismatches in the probe alignment (−R120 option). We kept only unique mapping probes. Second, we removed probes with a SNP within 10 bp of the 3′ end of the probe, within the target CpG itself, and finally, in the case of type I probes, if the variant overlapped the single base extension site. We used 10 bp as this cutoff is consistent with previous studies . For SNPs we used common (MAF ≥ 1%) SNPs, indels, or structural variation in the phase 3 1000 Genomes European dataset, common (MAF ≥ 1%) SNPs in the HRC reference panel r1.1, and SNPs appearing at all our own samples, even at low frequency, after imputation to the Haplotype Reference Consortium (HRC) reference panel. As a final step, we combined our blacklist with a previously published blacklist  for a total of 120,627 probes which were removed from analysis. In addition, we removed probes per tissue with a high detection p-value (p-value > 0.05 in ≥ 5% of samples from the larger study). After blacklist filters, we removed 578 adipose probes, 733 muscle probes, and 2206 islet probes based on the per sample filters.
Identification of higher across-tissue variance regions in WGBS
Using all 58 WGBS samples, we calculated the variance in beta values for each CpG. We then searched for blocks of consecutive CpGs that had 1) variance above the third quartile of variance levels and 2) a non-missing methylation value in at least 20 samples, a cutoff which was determined by looking at the distribution of variance values as a function of missing values (Additional file 1: Figure S12). This analysis identified approximately 200,000 blocks of high across-tissue variance CpGs genome-wide. Blocks contained an average of eight CpGs, spanned an average of 533 bp, and had higher relative enrichment in enhancer chromatin states (Additional file 1: Figure S5).
Feature construction for BoostMe and random forests
We used the same 648 features in the BoostMe and random forest algorithms (see Additional file 1: Table S3 for a detailed list). Prior to feature construction, we applied a further set of exclusion criteria to filter the CpGs included in training, validation, and testing. Only autosomal CpGs were used (n = 25,586,776). We overlapped WGS data with the WGBS data from all samples and excluded CpGs for which the CG dinucleotide on either strand was disturbed by a SNP or indel that was 2 bp long (indels longer than 2 bp were not considered). We also excluded all CpGs located in ENCODE blacklist regions .
Features constructed from the WGBS data included the nearest non-missing neighboring CpG beta values, the distances to these CpGs, and the sample average feature. Neighboring CpG features were taken within the sample of interest. The sample average feature was created by taking the average beta value of all samples within each tissue at the CpG of interest, not including the sample being interrogated. Samples in which the CpG was not sequenced above 10× coverage were excluded from the calculation. CpGs without a measurement above 10× coverage from at least two additional samples were also excluded.
We constructed both general and tissue-specific genomic features. General genomic features were the same across all tissues and included GC content, recombination rate, GENCODE annotations, and CpG island (CGI) information. GC content data was downloaded from the raw data used to encode the gc5Base track on hg19 from the UCSC Genome Browser [64, 65]. DNA recombination rate annotations from HapMap were downloaded from the UCSC hg19 annotation database (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/). CGI coordinates were obtained from UCSC browser. CGI shores and shelves were calculated from CGI coordinates by taking 2 kb flanking regions. GENCODE v25 transcript annotations were downloaded from the GENCODE data portal (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25).
Tissue-specific genomic features included ATAC-seq, chromatin states, histone marks, and transcription factor binding sites (TFBS). These features were all binary, with 0 indicating that the CpG of interest did not overlap that feature, and 1 indicating overlap. Chromatin state annotations were obtained from a previously published 13 chromatin state model for 31 diverse tissues that included islets, skeletal muscle, and adipose . This model was generated from cell/tissue ChIP-seq data for H3K27ac, H3K27me3, H3K36me3, H3K4me1, and H3K4me3, and input from a diverse set of publicly available data [20, 66,67,68]. ATAC-seq data was obtained from previously published studies for islets , skeletal muscle , and adipose . TFBS data was obtained as described in , with additional PWMs from . TFBS data was filtered for each tissue by the ATAC-seq feature to only include hits overlapping an ATAC-seq peak. We merged hits from multiple motifs of the same transcription factor to reduce the number of variables included in the algorithm and optimize computational efficiency.
BoostMe and random forests implementation
For BoostMe, we used the xgboost package (version 0.6–4)  in R  (version 3.3.1). For random forests, we used the ranger package (version 0.6.0) in R, which facilitates random forest training and testing on multiple CPUs . For both algorithms, we used regression trees to predict a continuous methylation value between 0 and 1 for CpGs of interest. Algorithms were trained on individual samples within each tissue and disease state combination. We trained only on CpGs with at least 10× coverage and no more than 80× coverage. Random forest variable importance was calculated using the mean decrease in variance at each split as implemented in the ranger package. BoostMe variable importance was evaluated for each variable as the loss reduction after each split using that variable as implemented in the xgboost package.
Due to the computational constraints of decision tree algorithms, we replicated a previously-described form of repeated random subsampling validation  to evaluate both BoostMe and random forests. Briefly, within each sample, algorithms were trained on a random sample of 1,000,000 CpGs and validated on a hold-out set of 1,000,000 CpGs. Final performance benchmarking was done on another held-out test set of 1,000,000 CpGs (Table 2). This process of random sampling, algorithm training, and benchmarking was repeated with ten different random seeds, and prediction performance was calculated by averaging the performance statistics across each of the ten algorithms. Hyperparameters tuned for random forests included the number of trees grown and number of variables to possibly split on at in each node, however, similar to previous work , we found that performance was robust to different settings, and thus did not estimate parameters.
For BoostMe, we used the mlr package (version 2.9)  in R to perform a hyperparameter grid search to tune the following xgboost package parameters: learning rate (eta), minimum loss reduction required to make a further partition on a leaf node of the tree (gamma), maximum depth of a tree (max_depth), minimum sum of instance weights needed in a child node (min_child_weight), subsample ratio of the training instance (subsample), subsample ratio of columns when constructing each tree (colsample_bytree), L2 regularization term (lambda), and L1 regularization term (alpha). We found that optimal hyperparameters had varying combinations for each sample, however, improvements in RMSE were only marginally better than using all default settings for each sample. Additionally, due to the number of hyperparameters available to tune, an exhaustive search of hyperparameter space was relatively computationally expensive. Therefore, we used all default settings as described in the xgboost package to benchmark BoostMe performance.
We implemented DeepCpG (version 1.0.4) as described in . Briefly, for each of the five tissue and T2D status combinations (adipose NGT, adipose T2D, muscle NGT, muscle T2D, and islet) the data was first divided by chromosome into training (chr. 1, 3, 5, 7, 9, 11, 13, 15), validation (chr. 16, 17, 18, 19, 20, 21, 22), and test sets, corresponding to a rough 40–20-40 split. The DNA module and CpG module were trained on separate NVIDIA Tesla K80 GPUs and the performance of each module was evaluated individually on the test set. The joint module was trained with the best-performing DNA and CpG modules, and its predictions were used for final benchmarking. In contrast to original single-cell bisulfite implementation of DeepCpG which was trained and tested on binary methylation values, we trained and tested on continuous methylation values to parallel our implementation of BoostMe and random forests. We found that this change made no difference in the accuracy of the model (Additional file 1: Table S4).
We experimented with six different hyperparameter combinations for each DNA model, including three architectures (CnnL2h128, CnnL2h256, CnnL3h256) and two dropout rates (0, 0.2). We then selected the best-performing combination based on AUC and reported the motifs significantly matching the filters from the first convolutional layer of that model . Similarly, we tested both RnnL1 and RnnL2 for the CpG model for each tissue. For the joint module, we tested JointL1h512, JointL2h512, and JointL3h512. The best-performing joint model was selected to evaluate RMSE, AUROC, AUPRC, and accuracy for each tissue. We used a default learning rate of 0.001 for all models. Similar to random forests and BoostMe, performance was generally robust with respect to different architectures. For a detailed explanation of all model architectures, see http://deepcpg.readthedocs.io/.