Benchmarking of speed
To demonstrate MethylStar’s performance we analyzed bulk WGBS data from a selection of 200 A. thaliana ecotypes (paired-end, 295 GB, ∼8.63× depth, 85.66% genome coverage, GSE54292), 75 Maize strains (paired-end, 209 GB, ∼0.36× depth, ∼22.12% genome coverage, GSE39232) and 88 Human H1 cell lines (single-end, 82 GB, ∼0.12× depth, ∼10.62% genome coverage, GSM429321). MethylStar was compared with Methylpy, nf-core/methylseq and gemBS. All pipelines were run with default parameters on a computing cluster with a total of 88 cores (CPU 2.2 GHz with 378 GB RAM). Speed performance was assessed for a series of batch sizes (A. thaliana: 50, 100, 150, 200 samples; Human H1 cell line: 22, 44, 66, 88 samples; Maize: 15, 30, 45, 60, 75 samples) and was restricted to a fixed number of jobs (=32), (Fig. 2a-c and Additional file 1: Table S2). Although gemBS achieved the fastest processing times for the A. thaliana samples, MethylStar clearly outperformed the other pipelines when applied to the more complex genomes of Maize and Human, which are computationally more expansive and resource-demanding (Fig. 2b-c). For instance, for 88 Human WGBS samples (82 GB of data), MethylStar showed a 75.61% reduction in processing time relative to gemBS, the second fastest pipeline (∼909 mins vs. ∼3727 mins). Extrapolating from these numbers, we expect that for 1000 Human WGBS samples, MethylStar could save about ∼22.24 days of run time (4 × faster). To show that MethylStar can also be applied to single-cell WGBS data, we analyzed DNA methylation of 200 single cells from Human early embryo tissue (paired-end, 845 GB, ∼0.38× depth, ∼9.97% genome coverage, GSE81233) split into batches of 100 and 200 (Fig. 2d and Additional file 1: Table S2). MethylStar’s processing times were compared to Methylpy which also supports single-cell data. For 100 cells, MethylStar required only ∼2225 mins as compared to ∼5518 mins required by Methylpy. Hence, MethylStar presents an efficient analysis solution for deep single-cell WGBS experiments.
To demonstrate that MethylStar’s processing speed does not come at the expense of poor read alignments, we analysed the read mapping statistics of 50 samples each of A. thaliana, Maize, Human H1 cell line and single-cell Human data using MethylStar, Methylpy, nf-core/methylseq and gemBS. Our results show that MethylStar and nf-core/methylseq, both of which employ the Bismark alignment tool, provide the most accurate and sensitive alignments. This observation that is consistent with recent benchmarking results [30, 31]. By contrast, Methylpy and gemBS use their own inbuilt aligners and generally display poorer alignment statistics. Interestingly, although gemBS was the fastest pipeline for the A. thaliana samples, the percentage of ambiguously mapped reads was considerably higher than that of MethylStar, thus demonstrating a trade-off between speed and mapping performance. We also noticed that the percentage of ambiguously mapped reads by gemBS was even further increased in the case of the Maize samples (Additional file 1: Fig. S1 and Table S1). This could indicate that gemBS’s alignment performance is particularly challenged in complex plant genomes, although this hypothesis should be explored in more detail.
Memory usage statistics
Along with benchmarking of speed, we also evaluated the performance of the MethylStar, gemBS, nf-core/methylseq and Methylpy pipelines in terms of system memory utilization using the MemoryProfiler (https://github.com/pythonprofilers/memory_profiler) python module (Fig. 2e). We assessed the CPU time versus peak/max memory of all the 4 pipelines (default settings) on a computing cluster (specifications above). For 10 random samples from the above A. thaliana benchmarking dataset (paired-end, 16 GB, GSE54292) MethylStar and Methylpy showed the best balance between peak memory usage (∼12000 MB and ∼15000 MB, respectively) and total run time (∼177 mins and ∼333 mins, respectively). In contrast, nf-core/methylseq and gemBS exhibited strong trade-offs between memory usage and speed, with nf-core/methylseq showing the lowest peak memory usage (∼700 MB) but the longest CPU time (∼697 mins), and gemBS the highest peak memory usage (∼21000 MB) but the shortest run time (∼42 mins) (Fig. 2e and Additional file 1: Table S5).
Furthermore, we inspected the run times of MethylStar’s individual pipeline components, both with and without parallel implementation (Fig. 2f and Additional file 1: Table S3). Our results clearly show that the parallel implementation is considerably faster for all components; however, it is accompanied by a higher peak memory usage. For instance, the implementation of the Bismark alignment step required ∼141 mins (with parallel) as compared to ∼210 mins (without parallel), a ∼33% reduction in processing time. However, in exchange, peak memory usage was increased by ∼65%. Thus, with sufficient computational resources, MethylStar’s parallel implementation of Bismark alignment can be very effective in handling large numbers of read alignments in considerably less amount of time (Fig. 2f).
We further benchmarked memory usage using 10 random samples from the above Maize dataset (paired-end, 23 GB, GSE39232). For this analysis, we focused on gemBS and MethylStar due to their shorter processing times for these datasets as compared to nf-core/methylseq and Methylpy. For these Maize dataset, gemBS’s peak memory usage was ∼110000 MB as compared to ∼81000 MB for MethylStar (∼1.3 times less memory), (Additional file 1: Table S4) with a total run time of ∼667 mins and ∼508 mins, respectively. We observed a 76% reduction in processing times of Maize samples using the parallel implementation of MethylStar pipeline (Additional file 1: Table S4) as compared to the without parallel implementation. Taken together, these benchmarking results clearly show that MethylStar exhibits favorable performance in terms of processing time and memory, and that it is therefore an efficient solution for the pre-processing of large numbers of samples even on a computing cluster with limited resources.