Participants in this analysis include 1062 mother-offspring pairs from a substudy of the Norwegian Mother and Child Cohort Study (MoBa) [37–39]. In a previous study with this cohort, individual CpG sites in newborns were tested for differential methylation in relation to maternal smoking . This dataset is referred to as MoBa1 and was used as the discovery cohort. We subsequently measured DNA methylation in an additional 685 newborns. This dataset is referred to as MoBa2 and was used as the replication cohort. The study has been approved by the Regional Committee for Ethics in Medical Research, the Norwegian Data Inspectorate and the Institutional Review Board of the National Institute of Environmental Health Sciences, USA, and written informed consent was provided by all mothers participating.
Covariates and cotinine measurements
Information on maternal age, parity, and maternal education was collected from questionnaires completed by the mother or from birth registry records. Maternal age was included as a continuous variable. Parity was categorized as 0, 1, 2, or >=3 births. Maternal educational level was categorized as previously described Joubert et al. , indicative of less than high school/secondary school, high school/secondary school completion, some college or university, and 4 years of college/university or more. Maternal smoking during pregnancy (none, stopped before 18 weeks of pregnancy, smoked past 18 weeks of pregnancy) was assessed by maternal questionnaire and verified with maternal plasma cotinine measured by liquid chromatography - tandem mass spectrometry at approximately 18 weeks gestation .
For MoBa1, cotinine, a quantitative biomarker of smoking, was measured in maternal plasma and was analyzed as a continuous variable. No cotinine was detected in 736 participants, and of the participants with detectable cotinine levels (N = 326) the mean cotinine level was 191 (SE = 11). For MoBa2, cotinine measurements were not available for most mothers. Therefore, a three-category variable based on the mother’s report of smoking during pregnancy was created and supported using cotinine measurements when available (N = 221 MoBa2 participants had cotinine data). The three categories represented no smoking (N = 512), stopped during pregnancy (N = 103), or smoked throughout pregnancy (N = 70).
Details of the DNA methylation measurements and quality control for the MoBa1 participants were previously described  and the same reagents, platforms and protocols were used for the MoBa2 participants. All biological material was obtained from the Biobank of the MoBa study . Briefly, DNA was extracted from umbilical cord whole blood samples . Bisulfite conversion was performed using the EZ-96 DNA Methylation kit (Zymo Research Corporation, Irvine, CA) and DNA methylation was measured at 485,577 CpGs in cord blood using Illumina’s Infinium HumanMethylation450 BeadChip [41, 42]. The package minfi in R was used to calculate the methylation level at each CpG as the beta-value (β = intensity of the methylated allele (M)/(intensity of the unmethylated allele (U) + intensity of the methylated allele (M) + 100)) from the raw intensity (idat) files [43, 44].
Probe and sample-specific quality control filtering was performed separately in MoBa1 and MoBa2 datasets. Control probes (N = 65) and probes on X (N = 11,230) and Y (N = 416) chromosomes were excluded in both datasets. Remaining CpGs missing >10% of methylation data were also removed (N = 20 in MoBa1, none in MoBa2). Samples indicated by Illumina to have failed or have an average detection p-value across all probes < 0.05 (N = 49 MoBa1, N = 35 MoBa2) and samples with gender mismatches (N = 13 MoBa1, N = 8 MoBa2) were also removed. For each dataset, we accounted for the two different probe designs by applying the intra-array normalization strategy Beta Mixture Quantile dilation (BMIQ) .
The gPCA program was used to determine the presence of batch effects, using plate to represent batch and ComBat was applied for batch correction using the SVA package in R for both MoBa 1 and MoBa 2 cohorts [44, 46–48]. A total of 473,772 markers remained after data processing, and 365,860 of these markers mapped to at least one of 21,231 genes using Illumina provided annotation based on human reference genome [NCBI build 37].
All analysis was conducted in the statistical programming language, R . Initially, potential clinical and demographic variables: maternal age, newborn gender, education, asthma, folate, and parity were evaluated as potential covariates prior to association analysis. Each potential covariate was tested for association with maternal cotinine using linear least squares regression, with categorical variables dummy encoded in the model(s). Two-sided p-values from each regression analysis were recorded, and a False Discovery Rate (FDR) correction for multiple comparisons was applied to limit false positives. Covariates with an FDR-adjusted q value < 0.1 were included in subsequent models . In addition, cell type fractions (CD8T, CD4T, natural killer cell, B cell, monocyte, granulocyte) for each subject were calculated using the reference-based Houseman method in the minfi package in R [43, 44, 50], and these fractions were forced as covariates into subsequent models. The same selection criteria was used for both the discovery and replication dataset. The only resulting covariate was maternal education for MoBa1 (q < 0.1), and maternal age, education, folate, and parity were selected as covariates for MoBa2 (q < 0.1).
Univariate association analysis
Statistical tests for the association of each CpG marker and maternal plasma cotinine levels (continuous) were performed using linear least-squares regression for the MoBa1 cohort. Significant covariates and cell type fractions were included in the model to reduce confounding. All CpG p values, on the -log10 scale, were plotted according to genomic sequence in a Manhattan plot (Fig. 1).
Gene score calculation
To perform gene-level association analysis, CpG markers were collapsed by gene using the Illumina provided annotation based on human reference genome [NCBI build 37]. For each gene, the CpG data was combined into a gene-level p value using the Sequence Kernel Association Test (SKAT) software implemented in R [12, 13]. The SKAT null model for MoBa1 was created using significantly associated covariates: maternal education (q < 0.1), and cell type fractions (CD8T, CD4T, natural killer cell, B cell, monocyte, granulocyte). The same modeling strategy was implemented for the SKAT null model for MoBa2 and included significantly associated covariates and the cell type fractions. The SKAT model was then run using an unweighted, linear kernel with the ‘is_check_genotype’ flag set to FALSE. In order to account for the underlying correlation structure for the p value gene scores, the SKAT null model was created with the cotinine values and covariate values randomly shuffled, and then SKAT was run on the residuals until 1000 permuted gene scores were created. To control for multiple comparisons, we report gene scores with a FDR q < 0.25 as being associated with cotinine levels.
The results from the SKAT gene-level association analysis (specifically p-values) were used for pathway-level analysis. Genes were grouped into a priori pathways (gene sets) using the Molecular Signatures Database v4.0 (MSigDB) . MSigDB contains gene sets from a collection of popular resources such as Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) . A subset of pathways was selected for analysis based on a set of four criteria: 1) the pathway must be composed of a set of genes from Homo sapiens, 2) the number of genes in a pathway cannot exceed 250 genes, 3) at least one gene in the pathway must be present in the list of available gene scores, and 4) pathways representing positional gene sets (C1), motif gene sets (C3), and computational derived gene sets (C4) were excluded. This resulted in a total of 5836 pathways for analysis. These pathways came from the either curated gene sets (C2), GO gene sets (C5), oncogenic signatures gene sets (C6), or the immunologic signatures gene sets (C7) collections in MSigDB . Each pathway consists of a set of genes that are considered biologically relevant to a given biological function or signaling network, and individual genes are often represented in multiple pathways.
The pathway-level score was calculated from the individual gene scores that overlapped with the genes in each pathway gene set. The pathway level score is the combined p-value across all gene-level results from the SKAT analysis. There are a number of approaches for combining p-values, but most assume that the individual p-values are not correlated. Pathway analysis actually relies on the fact that genes scores within a pathway are correlated, so a collapsing approach that explicitly takes that into account was used. More specifically, the individual gene scores were combined into pathway-level scores using the correlated Lancaster method in Dai et al. (TA) . This resulted in a final p-value for each pathway from MSigDB. It is important to note that this combined p-value represents a self-contained pathway analysis, where the null hypothesis is that gene sets are not more strongly associated than expected by chance. Because of the large number of pathways tested, we controlled for multiple comparisons using a conservative Bonferroni correction. We chose a conservative approach, even though the p-values from each pathway are not independent, since genes appear in multiple pathways. Pathways with a corrected p < .05 (n = 5836; p < 8.6 × 10−6) were considered statistically significant in the discovery cohort.
The statistically significant pathways (p < 8.6 × 10−6) were tested for replication using MoBa2. The CpG values were combined for genes that occurred in significant pathways in MoBa1, using SKAT as described above. Gene scores were then combined using the Lancaster approach to calculate a pathway-level score for the replication cohort. Pathways p values were adjusted using both an FDR and a more conservative Bonferroni approach and were considered to be successfully replicated with an FDR q < 0.05 (Table 2). Pathway analyses are commonly divided into self-contained or competitive approaches. Here we use a self-contained, global null approach to pathway analysis. An advantage of this approach is that it lends itself toward replication in smaller cohorts because only genes in significant pathways from the discovery cohort need to be tested for replication. Competitive pathway analysis methods test a different null hypothesis, and subsequently require all genes to be tested, which can make replication with smaller cohorts unfeasible.
Pathway hierarchical clustering
Hierarchical clustering was performed using R and the ‘APE’ package [44, 52]. All unique genes within replicated pathways (q < .05) were tabulated. All gene-pathway combinations were recorded as either a “1” if the pathway contained the gene or a “0” if the pathway did not contain the gene. Clustering was then performed using Euclidean distance and Ward’s method. The resulting dendrogram (Fig. 3) was then cut and colored so that six groups were defined based on gene set similarity.