Study population and phenotyping
MoBa is an ongoing nationwide pregnancy cohort study [11]. Participants in MoBa were enrolled in the study (1999-2008) from 50 of the 52 hospitals in Norway, and they are predominantly of Caucasian ancestry. Trained nurses at the hospitals measured the children’s birth weight. The genotypes in the MoBa dataset were generated on randomly selected umbilical-cord blood DNA samples (N=11,490) from the MoBa biobank [12]. The exclusion criteria were as follows: stillborn, deceased, twins, and children with missing data in the Medical Birth Registry of Norway (MBRN).
Materials, genotyping platform, and imputation
11,490 mother-father-newborn trios in the MoBa dataset were genotyped using the Illumina HumanCoreExome BeadChip (San Diego, CA, USA) containing more than 240,000 probes. Principal component (PC) visual checks for ethnicity were performed to remove ethnic outliers. For imputation, we used the Haplotype Reference Consortium (HRC) reference data version HRC.r1.1 (http://www.haplotype-reference-consortium.org/) and the free genotype imputation and phasing service of the Sanger Imputation Server (https://imputation.sanger.ac.uk/). For fast and accurate phasing, the Sanger server uses Positional Burrows-Wheeler Transform (PBWT) for indexing multiple sequence alignments across different genomes [13, 14]. We checked the results of the imputation for consistency by hard-calling markers with an INFO quality score larger than 0.7. Additionally, we checked for Mendelian inconsistencies, excess heterozygosity, deviations from Hardy–Weinberg equilibrium (HWE), and high rates of missingness to ensure that no major technical errors were introduced in the pre-phasing and imputation steps. A total of 7,947,894 SNPs met the following criteria and were included in the current analyses: call rate ≥98%, minor allele frequency (MAF) ≥1%, and HWE test P ≥104. Samples with a call rate ≤98% and with an excess heterozygosity ≥4SD were excluded.
Comparison with other studies
The MoBa dataset used here was included in the largest GWAMA of BW to date by Warrington et al. (2019) [1]. Our results are therefore not independent of the findings reported in that GWAMA. To perform an independent and unbiased comparison, we cross-checked our findings against those of the next largest GWAMA of BW that did not include the MoBa dataset, which is the study by Horikoshi et al. [15]. Horikoshi and colleagues identified 60 genome-wide significant loci in a multi-ancestry sample comprising 153,781 genotyped individuals [15]. In terms of sample size, the MoBa dataset used here is approximately 26 times smaller than the Warrington et al. study [1], 18 times smaller than the Horikoshi et al. study [15], and ten times smaller than another published GWAMA of BW from 2013 [16]. For further validation, we used the MoBa dataset to compare the performance of the modified WaveQTL against the standard methodology used by Horikoshi and colleagues [15].
Application of modified waveQTL
The original WaveQTL by Shim and Stephens [9] tests for association between an individual function and a covariate of interest using wavelets. Below we provide a brief description of wavelets and the modeling used in WaveQTL. We then provide details of our modified version of WaveQTL.
Wavelets and waveQTL modeling
WaveQTL aims at identifying associations between a population of functions and a univariate phenotype Φ measured once per function. WaveQTL tests for association between the functions and the trait by testing for association between the wavelet-transformed function and the trait. Wavelets are useful mathematical functions for conducting a Fourier-like transform. There are different types of wavelets [17], and, for the sake of simplicity, we present here only the most straightforward type of wavelet – the Haar wavelet. Like Fourier-transform, wavelet transform allows representing a function as a set of coefficients. The wavelet transform of a function on a given interval is computed via local integrals of the function. The integrals are called wavelet coefficients and are computed for regions of decreasing size, half the size at each step. The wavelet coefficients are indexed using a two-digit code (s,l), where the first number, s, corresponds to the scale or the level of resolution in Fig. 1, while the second number, l, corresponds to the location. We refer the reader to a textbook by Nason [18] for a more comprehensive introduction to wavelets and their applications in R.
In essence, WaveQTL tests for association between a population of functions and the trait using hierarchical Bayesian modeling that tests (for each wavelet coefficient) whether the trait is associated with the wavelet coefficient, with a prior probability \(p(M_{0_{s,l}}) = 1- \pi _{s}\)
$$\begin{array}{*{20}l} &M_{0}: \tilde{G}_{{sl}} = \beta_{sl,0} +\beta_{sl,C}C+\epsilon \\ &M_{1}: \tilde{G}_{{sl}} = \beta_{sl,0} + \beta_{sl,1}\Phi +\beta_{sl,C}C+\epsilon \end{array} $$
(1)
Where \(\tilde {G}_{{sl}}\) is the wavelet coefficient at scale s and location l based on genotype, Φ is the phenotype of interest (here, BW), and C is a confounder. The coefficient βsl,1 can be interpreted as the effect of the phenotype on the wavelet coefficient (sl). Additionally, π is a vector of length S, with S being the highest level of resolution. Each component of π,πs, represents the proportion of wavelet coefficients at scale s associated with Φ.
WaveQTL tests for association between the functions and the phenotype by testing the following hypothesis:
$$ H_{0}:\pi =(0,...,0) \enspace vs \enspace H_{1}:\exists j \in [0:J], \pi_{j} \ne 0 $$
(2)
The significance of π is assessed using the following likelihood ratio:
$$ \Lambda (\pi, \tilde{X}, \Phi) = \frac{p(\tilde{X} |\pi, \Phi) }{p(\tilde{X} |\pi\equiv 0, \Phi)} $$
(3)
For additional details on how to assess the significance of \(\Lambda (\pi, \tilde {X}, \Phi) \), we refer the reader to the original paper by Shim and Stephens [9]. For a fast computation of the p-value, we refer the reader to our recent paper [19].
Main run of the modified WaveQTL
Here, we treat each individual genotype as a “signal” and BW as the univariate continuous phenotype. For every screened region, we transform the individual genotype into wavelet coefficients and test for association with BW using the WaveQTL framework. An association is identified if the likelihood ratio p-value of a region is below the Bonferroni threshold.
In our adaptation of WaveQTL to enable the current GWAS, we used a sliding-window approach to sequentially screen the entire genome for associations. WaveQTL (and, by extension, the modified WaveQTL) is fast and reduces the number of tests to be performed by using overlapping windows of 1 Mb in length. By employing an alternative modeling to the single-SNP linear regression, WaveQTL enhances the detection of associations that are potentially missed by conventional GWAS methodology. The original software implementation of WaveQTL is available at https://github.com/heejungshim/WaveQTL. The modified WaveQTL is distributed as an R package on GitHub under the name mWaveQTL. The R package of the modified WaveQTL includes the zooming strategy (https://github.com/william-denault/mWaveQTL) and a comprehensive example of a typical run.
The user has to specify four parameters to run a GWAS using the modified WaveQTL: i) the region size, ii) the maximum distance between two consecutive SNPs, iii) the level of resolution, and iv) the prior standard deviation for the wavelet effect size. We recommend using half-overlapping regions of 1 Mb as the “region size” parameter and a maximum distance between two consecutive SNPs of 10 kb as the “maximum distance” parameter. Following the recommendations of Zhou and Guan [20], we set the prior standard deviation to \(\frac {0.2}{\sqrt (n)}\), where n is the number of samples. Based on these criteria, we defined 5,170 regions spanning the entire genome. In addition, the user needs to choose the depth of analysis. In Fig. 1, the y-axis shows how the results of the modified WaveQTL differ by the depth of analysis. As a rule of thumb, we choose ten SNPs per wavelet coefficient for the “level of resolution” parameter. To assign the level of resolution, we set the depth of our analysis to nine. It is important to select an appropriate depth of analysis, because an analysis with insufficient depth might overlook some loci. For example, we observed that most associated regions in the current analysis corresponded to a level of resolution of five or above. These results suggest that using a depth of analysis of four or less would have resulted in not detecting most of the loci.
Zooming strategy
One of the main drawbacks of the original WaveQTL is that the sliding-window size is not easily adjustable. If the window is too wide, the signal may be lost in the optimization step due to the large background noise. To overcome this, we developed a “zooming strategy”. As the wavelet coefficients generally remain the same except at the lowest levels, a sub-region can be analyzed using the Bayes factors computed using a larger window size. We implemented the following procedure:
-
Detect all the regions that have a Bayes factor above a given threshold (here set to 1).
-
For each selected region, extract the sub-region that contains all the Bayes factors above the set threshold.
-
Refit the optimization process in Shim and Stephens (2015) [9] on this sub-region to estimate the p-value.
A sub-region was considered statistically significant if the associated p-value was smaller than \(\frac {0.05}{n_{{region}}}\times \frac {size_{sub-region}}{size_{{region}}}\), where nregion is the number of regions initially analyzed (here 5,170). This significance criterion corresponds to the multiple-testing correction for a genome-wide screening based on using regions of the size of the considered sub-region.