An integrative approach for efficient analysis of whole genome bisulfite sequencing data
© Lee et al. 2015
Published: 9 December 2015
Whole genome bisulfite sequencing (WGBS) is a high-throughput technique for profiling genome-wide DNA methylation at single nucleotide resolution. However, the applications of WGBS are limited by low accuracy resulting from bisulfite-induced damage on DNA fragments. Although many computer programs have been developed for accurate detecting, most of the programs have barely succeeded in improving either quantity or quality of the methylation results. To improve both, we attempted to develop a novel integration of most widely used bisulfite-read mappers: Bismark, BSMAP, and BS-seeker2.
A comprehensive analysis of the three mappers revealed that the mapping results of the mappers were mutually complementary under diverse read conditions. Therefore, we sought to integrate the characteristics of the mappers by scoring them to gain robustness against artifacts. As a result, the integration significantly increased detection accuracy compared with the individual mappers. In addition, the amount of detected cytosine was higher than that by Bismark. Furthermore, the integration successfully reduced the fluctuation of detection accuracy induced by read conditions. We applied the integration to real WGBS samples and succeeded in classifying the samples according to the originated tissues by both CpG and CpH methylation patterns.
In this study, we improved both quality and quantity of methylation results from WGBS data by integrating the mapping results of three bisulfite-read mappers. Also, we succeeded in combining and comparing WGBS samples by reducing the effects of read heterogeneity on methylation detection. This study contributes to DNA methylation researches by improving efficiency of methylation detection from WGBS data and facilitating the comprehensive analysis of public WGBS data.
DNA methylation, defined as the addition of methyl group on 5-carbon in cytosine, is a widely spread epigenetic mark. The DNA methylation pattern can serve to identify cells and guides cell development and tissue maintenance . For decades, researchers have focused on methylation at CpG sites (mCpG) and found that differentially methylated regions (DMRs) among various cells are involved to cell-specific functions, aging and deceases [2–5]. Recently, methylation at CpH (mCpH; where H can be A, C, or T) sites has been confirmed to be a key regulator of brain development and embryonic stem cell (ESC) differentiation [6–9]. Therefore, profiling both mCpG and mCpH in a genome scale is crucial for understanding of various biological processes.
To analyze the methylation modifications, high-throughput methods coupled with microarray and next-generation sequencing have been widely used. Bisulfite microarray is a specially designed genotyping microarray combined with bisulfite treatment. Although this method is a useful strategy for targeted DNA methylation analyses, it is not suitable for genome-scale studies due to low genome coverage; only 0.8% of CpGs and 0.02% of CpHs have been covered in the newest version . Reduced representation bisulfite sequencing (RRBS)  utilizes enzyme bindings to "CCGG" sites to detect information-rich regions, whereas the enzyme reaction leads experimental bias and limits the detection of mCpH . Alternatively, as the widely-accepted gold standard method, whole genome bisulfite sequencing (WGBS) can detect both mCpG and mCpH at single nucleotide resolution in a genome scale .
For efficiently detecting the methylated sites with WGBS data, many computer programs have been developed. In particular, Bismark , BSMAP , and BS-seeker2  are the most widely used bisulfite-read mappers that employ different strategies; BSMAP is a wild-card type mapper that converts all cytosine bases (Cs) of a reference genome to a letter Y and then aligns sequenced Cs and thymine bases (Ts) to the Y  by using SOAP . Bismark and BS-seeker2 are three-letter type mappers that convert all Cs to Ts in both sequenced reads and a reference genome. Bismark and BS-seeker2 employ Bowtie2  in global and local alignment modes, respectively. It has been reported that wild-card type mappers are better in mapping rate (percentage of reads being aligned) but struggle with mapping accuracy (percentage of reads mapped at correct positions) . Three-letter type mappers show exactly opposite tendency . Therefore, the choice of bisulfite-read mapper is an important issue for not only specific studies that use costly WGBS data but also comprehensive large-scale analyses of public WGBS datasets.
Results and Discussion
Overview of integrative approach
We integrated the methylation results from three bisulfite-read mappers: Bismark, BSMAP, and BS-seeker2. Bismark and BSMAP are the most widely used three-letter type and wild-card type mappers, respectively [18–20], and BS-seeker2 is the newest three-letter type mapper, which has shown higher mapping rate than Bismark .
The evaluation and integration of Bismark, BSMAP, and BS-seeker2 were conducted as described below (Figure 1b). First, bisulfite-read sets were generated by Sherman , with randomly designated methylation levels for every block of 500 base pairs (bps) in human chromosome 19. Then we mapped the reads by Bismark, BSMAP, and BS-seeker2 and evaluated the performances of the three mappers with respect to mapping rate and accuracy; the mapping rate is the portion of mapped read number over total read number, whereas the mapping accuracy is the portion of correctly mapped read number over mapped read number. Lastly, we integrated the methylation results from the mappers with three strategies and evaluated the performances in terms of detection accuracy (d-accuracy) and amount of detected Cs (d-amount). The d-accuracy was determined by the similarity between generated and detected methylation levels at each block (Figure 1c).
Read-dependent performances of the three mappers
To investigate the performances of the three mappers under diverse read conditions, we analyzed the mapping results of the three mappers in the context of varying read quality, read length, and methylation levels.
The read length also affected the performances of the three mappers (Additional file 1: Figure S1). We compared mapping results of 50 bp-long reads with those of 100 bp-long reads. When read error rate was low (2%), both mapping rate and mapping accuracy were higher within long reads. When read error rate was high (8%), mapping rate of Bismark and BSMAP were higher within short reads. Remarkably, the mapping accuracy of Bismark was also higher in short reads.
Additionally, we found that the performances of the three mappers varied greatly within repeat regions. In particular, the reads generated from short interspersed nuclear elements (SINEs) tended to be unmapped by Bismark and BS-seeker2 (Figure 2c) and incorrectly mapped by BSMAP (Figure 2d), which clearly showed the difference in performances between wild-card type and three-letter type mappers.
Lastly, we found that hypo-methylated reads tended to be incorrectly mapped by BSMAP and BS-seeker2 (Figure 2e, and 2f). This tendency was not found in the mapping results of Bismark. This may be explained in part by that the increased number of Ts, induced by the bisulfite conversion of unmethylated Cs, hindered the correct mapping of BSMAP and BS-seeker2. To confirm that, we measured the percentage of Ts in reads that correctly and incorrectly mapped by the three mappers. For BSMAP and BS-seeker2, the incorrectly mapped reads contained higher amount of Ts than the correctly mapped ones (Figure 2g).
In summary, Bismark, BSMAP, and BS-seeker2 performed differently in different read conditions. Bismark mapped reads with great accuracy and was not affected by the density of Ts in reads. However, Bismark tended to lose both mapping rate and accuracy when read error rate was higher in longer reads. BSMAP generally mapped a large number of reads to incorrect positions. Additionally, the mapping accuracy of BSMAP was affected by the density of Ts in reads. Both the mapping rate and mapping accuracy of BS-seeker2 were only slightly affected by the read error rate, whereas the mapping accuracy was affected by the density of Ts in reads.
Integrative approach improved both the accuracy and amount of methylation detection
Based on the different performances of Bismark, BSMAP, and BS-seeker2 in varying read conditions, we integrated the mapping results of the three mappers using three strategies: Ave - average of the methylation levels from the three mappers, wAve - weighting by read depths, and pwAve - weighting by probabilistic method (i.e. Poisson distribution, see Method).
d-accuracy of the integration methods
Long reads (100 bps)
Short read (50 bps)
Read error (%)
The superior performance of wAve may be explained by the correlation between d-accuracy and mapping rate. Using read depth as weight, the wAve considered mapping rate as a first element on determining the certainty of the methylation levels from each mapper. On the other hand, pwAve indirectly employed mapping accuracy on weighting by considering the characteristics of the mappers; a mapper that maps larger number of reads compared with other mappers tended to maps reads at incorrect positions. The tendency was clearly revealed within short reads containing low error, so the d-accuracy of pwAve was the highest among the integration methods in those read conditions. Generally, however, d-accuracy was more strongly correlated with mapping rate (Pearson correlation coefficient equals to 0.83) than mapping accuracy (Pearson correlation coefficient equals to 0.64, Additional file 4: Figure S3), resulting in the higher d-accuracy of wAve than that of pwAve in most read conditions.
It is noteworthy that the d-accuracy of wAve exhibited the reduced dependency on read conditions (i.e. read length and quality, Figure 3g). We checked the distribution of d-accuracies of the three mappers and wAve, gathered from mapping results of 100 read sets (see Method).The wAve shows relatively less variance of d-accuracies among varied read conditions. Especially, the wAve decreased the difference of d-accuracy between high and low read error cases (Figure 3d). This implies that the wAve successfully reduces the effects of heterogeneous read conditions on methylation detection, facilitating comprehensive analyses of methylation patterns among public WGBS samples from various experiments.
The integrative approach facilitated the comprehensive analysis of public WGBS data
Next, we re-analyzed 13 WGBS samples that were generated from various experiments with different read length and quality. In particular, we included six brain samples and four pluripotent stem cells, in which significant amount of CpH methylation is accumulated [6–9].
We also examined the correlation between samples in terms of CpG and CpH methylation levels detected by the wAve. We found that both methylation levels clearly grouped samples according to their tissue of origin (Figure 4b and 4c). In particular, while brain samples were produced from three different experiments, they were closely positioned in the dendrogram. Moreover, an unknown-brain sample, a WGBS data from brain of which age was not known, and a sample from liver were produced from the same experiment but were successfully grouped according to each tissue type. These trends were not observed for BS-seeker2 (Additional file 6: Figure S5). The wAve clearly reduced false correlation between brain and liver from the same experiment (Figure 4d). We also found that the wAve results in significantly higher correlation of CpG methylation levels in gene-body regions within brain samples compared with that observed by Bismark (Additional file 7: Figure S6). Especially, the correlation among brain samples from different experiments were relatively more increased than that from same experiment. Thus, the wAve decreased the false correlation between samples from the same experiment and increased the correlation among samples from the same tissue.
To efficiently detect DNA methylation from WGBS data, we analyzed and integrated the three most widely used bisulfite-read mappers, Bismark, BSMAP, and BS-seeker2. The procedure consisted of three steps: mapper analysis, analysis with simulated reads, and analysis with real WGBS dataset.
Firstly, we confirmed that with low read error rate, the performances of the three mappers were consistent with the results of former studies of wild-card type (e.g. BSMAP) and three-letter type (e.g. Bismark and BS-seeker2) . In particular, the two types of mappers performed distinctly in SINEs, in which the wild-card type mappers falsely mapped reads, whereas the three-letter type mappers failed to map reads. It should be further investigated what distinction in algorithm induces the difference in mapping results in SINEs. In addition, the performances of Bismark and BSMAP dramatically decreased in case of high error reads, whereas BS-seeker2 did not affected much by the read error rates. Lastly, the mapping accuracies of BSMAP and BS-seeker2 were found to be dependent on the methylation level, whereas Bismark were not. Based on the complementary performances of the three mappers across varying read conditions, we integrated the mapping results of the three mappers with three methods: average (Ave), read depth-weighted average (wAve), and probabilistically weighted average by Poisson distribution (pwAve).
With the simulated reads, the wAve method resulted in significantly higher detection accuracy than that obtained with individual mappers and other integration methods. On the other hand, pwAve showed decreased accuracy compared with wAve. It should be further studied what probabilistic methods could improve the detection accuracy compared with read-depth weighting. In addition, the wAve exhibited higher detection of Cs than Bismark. Indeed, existing bisulfite mappers exhibit smaller increases in either quality or quantity of the methylation results compared with former systems. It is remarkable that the integration improved both the accuracy and amount of methylation detection. Furthermore, the integration reduced the dependency of detection accuracy on read conditions (i.e. error rate and length), proving that our method can facilitate the comprehensive analyses of multiple WGBS samples of which read conditions are heterogeneous.
With real WGBS samples, the wAve reduced the false correlation between WGBS samples generated from same experiments and increased the true correlation between those originated from same tissues. Thus, our method succeeded in facilitating comprehensive analyses of multiple WGBS datasets from various experiments by reducing the dependency of methylation results on read conditions.
In summary, our integrative approach improved both quality and quantity of methylation results from WGBS data, and facilitated the comprehensive analyses of DNA methylation among various read conditions. This study may contribute to researches about methylation patterns among samples in different conditions (e.g. tissue, age, or some diseases) by combining a massive public WGBS data. In addition, this study may give a new clue to algorithmic improvement of bisulfite-read mappers to enhance epigenetic researches.
Generating artificial sequenced reads
The sequenced reads were generated by Sherman . The Sherman generates virtually bisulfite-treated reads with specific read number, length, error rate, and methylation level. We designated methylation level of CpG and CpH randomly and separately, on every 500 bps block of human genome (hg 19) chromosome 19. The human chromosome 19 is short but reveals the highest repeat rate among all chromosomes . Therefore, we could effectively observe the diverse mapping results of the three bisulfite-read mappers with special focus on repeat regions. From the randomly methylated reference genome, we generated long (100 bps) and short (50 bps) reads separately. The numbers of generated reads were 50 for 100 bps reads and 100 for 50 bps reads in a block to adjust the average coverage depth equals to 10. Also, we generated reads with designating error rate to 0%, 2%, 4%, 6%, and 8%, in order to determine the dependency of mapping results on read error. We repeatedly generated all the cases of read sets 10 times. In total, we generated 100 read sets (2 read length cases × 5 read error cases × 10 repeat) for which the read number was 5.4 million and 10.8 million for long and short reads, respectively.
Parameters for read generation
– For short reads
/Sherman -l 50 -n 100 -e [ERROR] --genome_folder [DIR] -CG [mCG] -CH [mCH]
– For long reads
/Sherman -l 100 -n 50 -e [ERROR] --genome_folder [DIR] -CG [mCG] -CH [mCH]
[DIR]: Directory to fasta file that include 500 bp-long sequences. Before running Sherman, human chromosome divided into 500 bps sequences and saved in separate directories.
[ERROR]: repeatedly set to 0, 2, 4, 6, and 8
[mCG] and [mCH]: randomly and independently set from 0 to 100. After the read generation completed, Sherman reported exact value of the bisulfite-treated rates on CpG and CpH positions. We used the reported values as designated methylation level at each block.
Read mapping and extracting methylation level
We mapped both artificial reads and real WGBS reads with Bismark, BSMAP, and BS-seeker2. In mapping, we unified the maximum mismatch threshold to 4% of the read length, to determine the distinct performances of the three mappers in unified parameters. The command lines for each mapper were as below;
Bismark; we used bowtie2 as an aligner for better performance .
perl ./bismark -o [ouput] --bowtie2 [reference genome] [input fastq] --score_min L,0,-0.24
BSMAP; we set the maximum number of equal best hits to one .
./bsmap -a fastq file -d [reference genome] -o [output] -w 1 -v 0.04
BS-seeker2; we used bowtie2 as an aligner for better performance. 
python ./bs_seeker2-align.py -i [input fastq] -o [output] -g [reference genome] --aligner = bowtie2 -m 0.04
After mapping, we removed duplicates possibly induced by PCR amplification. The duplicated reads from Bismark, BSMAP, and BS-seeker2 were removed by picard , samtool rmdup , and a program of BS-seeker2 , respectively. After removing duplicates, methylation levels of each C were extracted by programs of each mapper. In results with simulated reads, we considered methylation levels at Cs that covered by more than one read. In results with real WGBS data, however, we considered methylation levels at Cs that covered by more than five reads in order to increase the confidence of methylation level at each C . The methylation level at each C was calculated as the ratio of unconverted Cs over the total mapped read number.
Integration of mapping results
where W p is the weight by the probability function f of Poisson distribution with parameter λ that is the average read depth of a mapper over whole genome.
WGBS data preparation
WGBS data description
middle frontal gyrus
middle frontal gyrus
middle frontal gyrus
middle frontal gyrus
– Parameters for quality-trimming
fastq_quality_trimmer -t 20 -l [half of read length] -i sample.fastq -o [output directory] -Q [phred score scale]
Computational resource was supported by Human Genome Center, the Institute of Medical Science, the University of Tokyo. Also, JHL acknowledges the support of the Japanese Government Scholarship (MEXT).
Publication charges for this article have been funded by JSPS KAKENHI Grant Number 25290067.
This article has been published as part of BMC Genomics Volume 16 Supplement 12, 2015: Joint 26th Genome Informatics Workshop and 14th International Conference on Bioinformatics: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S12.
- Meissner A, et al: Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature. 2008, 454 (7205): 766-70.PubMedPubMed CentralGoogle Scholar
- Day K, et al: Differential DNA methylation with age displays both common and dynamic features across human tissues that are influenced by CpG landscape. Genome Biol. 2013, 14 (9): R102-PubMedPubMed CentralView ArticleGoogle Scholar
- Varley KE, et al: Dynamic DNA methylation across diverse human cell lines and tissues. Genome Res. 2013, 23 (3): 555-67.PubMedPubMed CentralView ArticleGoogle Scholar
- Horvath S: DNA methylation age of human tissues and cell types. Genome Biol. 2013, 14 (10): R115-PubMedPubMed CentralView ArticleGoogle Scholar
- Baylin SB: DNA methylation and gene silencing in cancer. Nat Clin Pract Oncol. 2005, 2 (Suppl 1): S4-11.PubMedView ArticleGoogle Scholar
- Lister R, et al: Global epigenomic reconfiguration during mammalian brain development. Science. 2013, 341 (6146): 1237905-PubMedPubMed CentralView ArticleGoogle Scholar
- Ziller MJ, et al: Genomic distribution and inter-sample variation of non-CpG methylation across human cell types. PLoS Genet. 2011, 7 (12): e1002389-PubMedPubMed CentralView ArticleGoogle Scholar
- Guo JU, et al: Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nat Neurosci. 2014, 17 (2): 215-22.PubMedPubMed CentralView ArticleGoogle Scholar
- Lister R, et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315-22.PubMedPubMed CentralView ArticleGoogle Scholar
- Morris TJ, S Beck: Analysis pipelines and packages for Infinium HumanMethylation450 BeadChip (450k) data. Methods. 2015, 72: 3-8.PubMedPubMed CentralView ArticleGoogle Scholar
- Meissner A, et al: Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 2005, 33 (18): 5868-77.PubMedPubMed CentralView ArticleGoogle Scholar
- Bock C: Analysing and interpreting DNA methylation data. Nat Rev Genet. 2012, 13 (10): 705-19.PubMedView ArticleGoogle Scholar
- Krueger F, Andrews SR: Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011, 27 (11): 1571-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Xi Y, Li W: BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics. 2009, 10: 232-PubMedPubMed CentralView ArticleGoogle Scholar
- Guo W, et al: BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics. 2013, 14: 774-PubMedPubMed CentralView ArticleGoogle Scholar
- Alper BS: SOAP: solutions to often asked problems. Choice of antihistamines for urticaria. Arch Fam Med. 2000, 9 (8): 748-51.PubMedView ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9 (4): 357-9.PubMedPubMed CentralView ArticleGoogle Scholar
- Chatterjee A, et al: Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Res. 2012, 40 (10): e79-PubMedPubMed CentralView ArticleGoogle Scholar
- Tran H, et al: Objective and comprehensive evaluation of bisulfite short read mapping tools. Adv Bioinformatics. 2014, 2014: 472045-PubMedPubMed CentralView ArticleGoogle Scholar
- Kunde-Ramamoorthy G, et al: Comparison and quantitative verification of mapping algorithms for whole-genome bisulfite sequencing. Nucleic Acids Res. 2014, 42 (6): e43-PubMedPubMed CentralView ArticleGoogle Scholar
- Sherman - bisulfite-treated Read FastQ Simulator. [http://www.bioinformatics.babraham.ac.uk/projects/sherman/]
- Grover D, et al: Alu repeat analysis in the complete human genome: trends and variations with respect to genomic composition. Bioinformatics. 2004, 20 (6): 813-7.PubMedView ArticleGoogle Scholar
- Picard. [http://broadinstitute.github.io/picard/]
- Samtools rmdup. [http://samtools.sourceforge.net/]
- Ziller MJ, et al: Charting a dynamic DNA methylation landscape of the human genome. Nature. 2013, 500 (7463): 477-81.PubMedView ArticleGoogle Scholar
- Guo W, et al: Characterizing the strand-specific distribution of non-CpG methylation in human pluripotent cells. Nucleic Acids Res. 2014, 42 (5): 3009-16.PubMedPubMed CentralView ArticleGoogle Scholar
- Wen L, et al: Whole-genome analysis of 5-hydroxymethylcytosine and 5-methylcytosine at base resolution in the human brain. Genome Biol. 2014, 15 (3): R49-PubMedPubMed CentralView ArticleGoogle Scholar
- Court F, et al: Genome-wide parent-of-origin DNA methylation analysis reveals the intricacies of human imprinting and suggests a germline methylation-independent mechanism of establishment. Genome Res. 2014, 24 (4): 554-69.PubMedPubMed CentralView ArticleGoogle Scholar
- Whole Genome Bisulfite Sequencing by ENCODE/HAIB. [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40832]
- UCSD Human Reference Epigenome Mapping Project. [http://www.roadmapepigenomics.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.