### Read depth in RRBS data follows a negative binomial distribution, while the level of DNAm is bimodally distributed

As part of an ongoing study of aging, we profiled DNAm in 125 frontal cortex samples dissected from mice aged 2–10 months old using the original RRBS protocol [20] (see Methods). Prior to quality control filtering, a mean of 41,199,876 (SD = 6,753,486) single end reads were generated per sample (Additional file 2). The quality of the sequencing data was assessed using *FastQC* [31], before reads were aligned to the *mm10* reference (GRCm38) genome using *Bismark* [17]. Here, we define DNAm sites as vectors, such that each DNAm site has a DNAm point per sample, containing read depth and DNAm values. That is, DNAm site = {DNAm point_{1} = {m_{1}, rd_{1}}, …, DNAm point_{i} = {m_{i}, rd_{i}}, …, DNAm point_{n} = {m_{n}, rd_{n}}}, for i in 1 to n samples, where m_{i} represents the proportion of DNAm at a DNAm point_{i}, and rd_{i} is the read depth, defined here as the total number of reads at the DNAm point. If rd_{i} is 0, there will be no DNAm point associated with sample i (Fig. 1). Across all samples, there was a total of 64,199,621 distinct DNAm points covered (including CpG, CpH and CHH sites), with a total of 3,419,677 different DNAm sites assayed, and each sample containing a mean of 2,170,454 (SD = 124,281) DNAm points across all DNAm sites. We characterized the distribution of read depth for each sample across DNAm points, observing a unimodal discrete distribution, skewed to the left and characterized by a long tail (Fig. 2a). This distribution is typical of count data and is expected in sequencing datasets where the vast majority of DNAm points are covered by relatively few reads and a minority of DNAm points are covered by a large number of reads. Across all DNAm points, 22.1% (60,117,549) had less than or equal to than 5 reads and 3.30% (8,941,868) had more than 100 reads. Next, we visualized the distribution of DNAm levels across all DNAm points, observing the expected bimodal distribution, with the majority of DNAm sites being either completely methylated (50% of DNAm sites > 0.95) or unmethylated (49% of DNAm sites < 0.05) [32] (Fig. 2b).

### Read depth has a dramatic, non-linear effect on accuracy of DNAm estimates

One consequence of low read depth in RRBS data is reduced accuracy for the quantification of DNAm at DNAm points. While DNAm points that are either completely methylated or unmethylated can theoretically be characterized precisely with a single read, this is not the case for DNAm points with intermediate levels of DNAm, which may be inaccurately classed as methylated or unmethylated at low read depths. To understand the extent of this problem, we compared the proportion of DNAm values at extremes (less than 0.05 or greater than 0.95), with increasing read depths across DNAm points (Fig. 3A). As expected, the proportion of DNAm sites estimated to have extreme levels of DNAm was greater at lower read depths; 86.1% (SD = 4.94) of sites were estimated to have DNAm > 0.95 or < 0.05 at a read depth of 5, compared to 64.7% (SD = 6.90) at a read depth of 50. This suggests that, compared to DNAm points with a read depth of 50, more than 20% of DNAm points with a read depth of 5 may have been inaccurately classified as having an extreme level of DNAm.

To formally quantify the error in estimating DNAm, we used simulations of increasing read depth to estimate DNAm for a hypothetical DNAm site with an intermediate level of DNAm (0.50), calculating the difference between the estimated and true DNAm level. For read depths < 10, we observed a discrete distribution of estimated DNAm (Fig. 3B), with the range of predictions spanning 0.00–1.00 but centered on 0.50. In line with the Central Limit Theorem, we observe that as read depth increases, the distribution of estimated DNAm levels becomes more continuous and normally distributed around a DNAm value of 0.50. We expanded these simulations to consider DNAm sites with DNAm levels across the full distribution of possible values. We simulated 10,000 DNAm points with DNAm uniformly sampled between 0.00–1.00 and sampled 10,000 RRBS DNAm points with matched DNAm levels for comparison (see Methods). We found that as read depth increases, the correlation across DNAm points between estimated and actual DNAm level tends towards 1.00 (Fig. 3Ci) and the root mean squared error (RMSE) tends towards 0.00 (Fig. 3Cii). However, these effects are non-linear, with more dramatic improvements in accuracy occurring at lower read depths; i.e. there is a jump from a correlation of 0.589 to 0.926 between 1 and 10 reads with relatively minimal gains after that. Similarly, the RMSE drops from 0.404 at a read depth of 1.00 to 0.124 at a read depth of 10.

### RRBS and Illumina arrays DNAm values correlate highly

Commercial DNAm arrays, such as the Illumina EPIC BeadChip array, are commonly utilized as an alternative strategy to bisulfite sequencing approaches in large human studies, due to their relatively low cost and the ease of interpreting data [33]. To further characterize the accuracy and sensitivity of RRBS, we performed a comparison with DNAm levels quantified using a novel Illumina Beadchip vertebrate DNAm array [30] on an overlapping set of 80 mouse frontal cortex DNA samples. A total of 3552 unique DNAm sites were quantified in both the RRBS and array datasets, with each RRBS sample containing a mean of 2263 overlapping DNAm data points (SD = 104). First, we compared the distribution of DNAm estimates across all DNAm points between the two technologies, observing the expected bimodal distribution with both approaches (Supplementary Figure 1). Of note, the array data contains a higher proportion of DNAm sites with intermediate levels of DNAm (0.05–0.95), and the unmethylated and methylated peaks are shifted inwards from the boundaries, highlighting the reduced sensitivity of the array for quantifying extreme levels of DNAm [16]. In contrast, the peaks in the RRBS data are at 0.00 and 1.00. The array samples also have less variability between samples, with distributions looking nearly identical, due to DNAm points being consistently characterized for each DNAm site. Directly comparing the estimated level of DNAm between the two assays, we observed a strong positive correlation (Pearson correlation = 0.794) even with no read depth filtering in the RRBS data (Fig. 3D). The correlation between assays increases as more stringent read depth filtering is applied to the RRBS data, with the maximum correlation (Pearson correlation = 0.840) obtained at a read depth threshold of 22 (Fig. 3E, Fi). Although this correlation indicates a relatively strong relationship between the estimates of DNAm quantified using RRBS and the Illumina array, it does not necessarily indicate that the DNAm estimates generated by the two platforms are equal. Closer inspection showed that the relationship between RRBS- and array-derived DNAm estimates is not linear (Fig. 3D), and therefore we also explored absolute differences in DNAm estimates between the two assays. We observed a notable skew, with DNAm estimates from the array being generally higher than those from RRBS (mean difference = 0.112, SD = 0.223), and this relationship was observed regardless of read depth (Supplementary Figure 2). As expected, the RMSE between DNAm estimates generated using array and RRBS decreases as the stringency of read depth filtering in the RRBS dataset increases (Fig. 3Fii), plateauing at a read depth of ~ 30. Of note, the minimum RMSE observed was 0.180, suggesting some systemic differences between the two platforms in estimated DNAm levels. Our findings corroborate previous findings in which DNAm estimates generated using Illumina arrays and BS data are strongly correlated [34,35,36,37].

### RRBS enrichment results in a subset of DNAm sites that have consistent read depth across DNAm points

In order to perform a statistical analysis of DNAm differences between groups (e.g. in a study of cases vs controls), multiple samples, usually representing biological replicates, are required. We have demonstrated the importance of filtering RRBS data by read depth on obtaining accurate estimates of DNAm, however, this has the consequence of increasing the number of missing DNAm points (Fig. 4a). As expected, we found that read depth is not random across DNAm sites, but highly correlated between pairs of samples (Fig. 4b). To demonstrate this further, we iteratively increased the number of samples and calculated the proportion of DNAm points shared across DNAm sites (Fig. 4c). The proportion of DNAm points present decreases in a non-linear manner before plateauing at 0.20, demonstrating that there is a subset of DNAm sites for which read depth is greater than 0 across all or most DNAm points. DNAm sites containing all possible DNAm points, that is, each DNAm point had a read depth > 1, were found to have consistently higher read depth, with a strong correlation in read depths between DNAm points (Fig. 4d). This correlation in read depth between samples is a result of the enrichment strategy used in RRBS, meaning that specific CpG-rich regions are dramatically overrepresented in the sequencing data across all samples. As expected, the common DNAm sites containing all possible DNAm points were enriched in CpG islands compared to all DNAm sites (Fig. 4e) reflecting the *MspI*-based enrichment strategy used in RRBS [20].

### Simulated data demonstrates the consequence of read depth, sample size, and mean DNAm difference per group on power

Statistical power to identify differences in DNAm between two groups (e.g. cases vs controls), defined as the proportion of successfully detected true positives, will vary across DNAm sites and is influenced by multiple variables. In bisulfite sequencing studies, these include read depth, the number of samples in each group, the ratio of group sizes, the mean DNAm level, and the expected difference in DNAm between groups. We explored how each of these variables influences power by simulating bisulfite sequencing data for a given DNAm site following the framework laid out in Fig. 5. Briefly, a two group comparison was simulated, with sample size, mean read depth, μDNAm (the mean DNAm across the DNAm point) and ΔμDNAm (the mean difference in DNAm between groups) used as input variables that were either kept constant or varied to observe the effect on power. Each exemplar DNAm site was simulated 10,000 times, containing all DNAm points for the given sample size. A two-sided t-test was used to compare groups and power calculated as the proportion of *p*-values smaller than 5 × 10^{− 6}. It is important to note that all parameters, including r, the *p*-value threshold for power, and number of DNAm sites simulated, were selected with the aim of visualising how power might change with each variable in turn. Subsequent findings are based on exemplar DNAm sites, and exact values should be taken as such; they may not be representative of a wider study, as our aim was solely to characterize the relationship between each variable and statistical power. The values used to generate the results for each variable shown in Fig. 6 can be found in Supplementary Table 1.

As expected, increased read depth had a positive effect on power across each of the scenarios we considered, however, the potential gains are highly dependent upon the specific combination of parameters (Fig. 6a). For example, in a scenario where each group contains 30 samples and the mean DNAm level is 0.25, there is a relatively dramatic increase in power to detect a DNAm difference of 0.20 between groups as read depth increases, with 80% power at a mean read depth of 37, although there are minimal gains with read depths > 50. In contrast the gain in power with increased read depth is much less pronounced when detecting a mean DNAm difference of 0.10, and there is very little power at any read depth to detect a DNAm difference of 0.05. Therefore, if small effect sizes are relevant for the phenotype under study, power will need to be increased through other methods, e.g. increased sample size, as read depth filtering alone will not be sufficient. The relationship between read depth and power was previously also found in WGBS data, with higher total sequencing depth increasing the true positives detected [38].

We next investigated the effect of sample size and the ratio of group sizes on power (Fig. 6b), concluding that the optimal design in terms of maximizing power is to have equal sized comparison groups, assuming that the total sample size is constant. Fixing mean read depth to be 20 and a mean DNAm level of 0.25, our simulations showed that to have 80% power to detect a DNAm difference of 0.20 between groups a total sample size of 94 is required when the sample size ratio between groups is 60:40 (56 and 38 samples, respectively), which is only two more samples than required when the sample size ratio is balanced (i.e. 50:50). In the most extreme scenario we considered, an 80:20 ratio between groups, a total of 154 samples (123 and 31, respectively) are needed to have 80% power to detect a DNAm difference of 0.20 between groups. This has implications for the handling of DNAm sites where DNAm points are missing; it suggests that there may be a tolerable level of ‘missingness’ when comparing DNAm between groups that can be ‘rescued’ by having a greater sample size in the second comparison group. As with read depth (Fig. 6a), we found a non-linear relationship with power for both sample size (Fig. 6b) and mean DNAm difference between groups (Fig. 6c). Where each of these variables is the limiting factor, we found that the greatest gains in power occurred initially, with diminishing returns at higher levels and an eventual plateau. Where other variables act to reduce the overall power, the power curve is flattened and a plateau is not reached. One interesting observation from our simulations was the U-shaped relationship between power and mean level of DNAm at a given site (Fig. 6d). Power is highest at DNAm sites with either very low or very high levels of DNAm, and decreases to a minimum at intermediate levels of DNAm. We hypothesize that this reflects the relationship between the mean and variance in DNAm [39] (Fig. 6e), where the variance is lowest at the extremes, an artefact of DNAm being measured as proportion bounded at 0.00 and 1.00.

### Simulated bisulfite sequencing studies can be utilized to estimate power given suggested filtering

Our results indicate that, given the complex interplay of multiple experimental parameters, the choice of threshold for filtering DNAm sites is not always straightforward and will depend on the specific research question being addressed. Furthermore, the power calculations presented so far only consider a single DNAm site, whereas genome-wide comparisons of DNAm typically involve the analysis of hundreds of thousands of DNAm sites; given the effect of the properties of DNAm sites (e.g. in mean DNAm level) on power, no DNAm site can be considered to be ‘representative’ of the others. Therefore, we extended our simulation framework to quantify a study-level power statistic that considered all DNAm sites, allowing for the calculation of power given an RRBS dataset, and the read depth and minimum DNAm points per DNAm site filtering to be carried out. The extension of the simulation framework can be seen in Fig. 7 and is described in Methods. Briefly, an actual RRBS data set was used to estimate the simulation parameters (namely, sample size, μDNAm, μRD and negative binomial parameter, r) so that simulations reflect the real data. We compared the real and simulated data finding that the distribution of simulated read depths is highly comparable to real data for lower read depths (Fig. 8 Ai). Higher read depths do not seem to be captured as accurately by the negative binomial distribution, however, given that 95% of DNAm points have read depth < 85 (Fig. 8 Aii), this should be less important to the simulation. Overall, simulated DNAm estimates were similar to real DNAm levels across DNAm points, although there was some deviation, for example, a slight under representation of DNAm points with DNAm proportions above 0.25 and an overrepresentation of DNAm points with DNAm proportions above 0.25 (Fig. 8B).

To demonstrate the methodology, we considered a hypothetical study design with a total of 125 samples, specifying an expected mean DNAm difference between groups of 0.06, picked arbitrarily to allow for power visualization. To profile how read depth influences the power of the study, we incrementally increased the minimum read depth from 1 to 75, and to investigate the effect of the minimum number of DNAm points needed to find a difference between groups (i.e. the minimum effective sample size at any DNAm site given filtering by read depths and the often sparse nature of RRBS), we chose three arbitrary values: 2, 30 and 60. Power only increased subtly as read depth filtering became more stringent (Fig. 8C), compared to the gain of increasing the number of DNAm points. However, the gain is not consistent across all study designs, with greater gains in smaller studies (Fig. 8D). For example, with a minimum of two DNAm points per group, increasing the read depth threshold from 1 to 75 resulted in an increase in power of 10.9%, compared to a smaller increase of 4.83 and 4.89%, respectively, when the minimum DNAm points were set at 30 or 60. Our analysis reaffirms the interplay between all study-specific experimental variables. However, it should be noted that even with the most extreme read depth filtering, the maximum power for a group with a minimum of two DNAm points is still dramatically lower that the power of a study with a larger minimum and no or negligible filtering. Finally, we summarized our study wide power calculation in the R function POWer dEtermined REad Depth filtering for Bisulfite Sequencing (POWEREDBiSeq), which is available as a resource to the community at https://github.com/ds420/POWEREDBiSeq. The calculation results in largely consistent and normally distributed predictions of power, however, outliers can occur, suggesting that multiple iterations should be performed (Supplementary Figure 3).