High resolution profiling of human exon methylation by liquid hybridization capture-based bisulfite sequencing

Background DNA methylation plays important roles in gene regulation during both normal developmental and disease states. In the past decade, a number of methods have been developed and applied to characterize the genome-wide distribution of DNA methylation. Most of these methods endeavored to screen whole genome and turned to be enormously costly and time consuming for studies of the complex mammalian genome. Thus, they are not practical for researchers to study multiple clinical samples in biomarker research. Results Here, we display a novel strategy that relies on the selective capture of target regions by liquid hybridization followed by bisulfite conversion and deep sequencing, which is referred to as liquid hybridization capture-based bisulfite sequencing (LHC-BS). To estimate this method, we utilized about 2 μg of native genomic DNA from YanHuang (YH) whole blood samples and a mature dendritic cell (mDC) line, respectively, to evaluate their methylation statuses of target regions of exome. The results indicated that the LHC-BS system was able to cover more than 97% of the exome regions and detect their methylation statuses with acceptable allele dropouts. Most of the regions that couldn't provide accurate methylation information were distributed in chromosomes 6 and Y because of multiple mapping to those regions. The accuracy of this strategy was evaluated by pair-wise comparisons using the results from whole genome bisulfite sequencing and validated by bisulfite specific PCR sequencing. Conclusions In the present study, we employed a liquid hybridisation capture system to enrich for exon regions and then combined with bisulfite sequencing to examine the methylation statuses for the first time. This technique is highly sensitive and flexible and can be applied to identify differentially methylated regions (DMRs) at specific genomic locations of interest, such as regulatory elements or promoters.


Background
The methylation of cytosines at CpG dinucleotides is an important regulatory modification in the somatic cells of mammals and other vertebrates [1,2]. It plays a vital role in regulating gene transcription during diverse biological processes, including embryonic development, X-chromosome inactivation, and the maintenance of pluripotency and chromosome stability [3][4][5][6][7][8]. Aberrant DNA methylation is associated with many common diseases. Recent evidence has demonstrated that the epigenetic silencing of tumour suppressor genes due to abnormal DNA hypermethylation is involved in cancer development and progression [8,9]. Due to its relative genomic stability and well-established role in cancer pathogenesis, it is likely that DNA methylation is an important biomarker for the diagnosis and prognosis of certain diseases [10].
Understanding the exact role of DNA methylation in normal developmental and disease states requires knowledge of the distribution of methylation in the genome. Fortunately, reference genome assemblies and massively parallel sequencing enable the high-resolution genome-wide profiling of cytosine methylation. With this profiling, it is possible to identify methylation biomarkers in clinical samples using adequately powered epigenome-wide association studies.
There are three main types of strategies that are currently available for the high-resolution detection of DNA methylation based on the next generation sequencing platform. The first is whole genome bisulfite sequencing (WGBS), in which sodium bisulfite is applied to the genomic samples to convert all unmethylated cytosines to uracils. The uracil is subsequently recognized as thymine after PCR amplification [11,12]. This technique is the gold standard for cytosine methylation analysis in both CpG and non-CpG contexts because it allows for methylation detection at a single-base resolution [11,12]. However, whole genome sequencing is extremely time consuming and costly, especially when used to screen complex genomes. For this reason, it is impractical for the survey of multiple biological samples.
The second approach is based on the enrichment of methylated DNA fragments such as MeDIP-seq [13,14] and MBD-seq [15], which employ 5-methylcytosine-specific antibodies and methyl-binding domain proteins, respectively, to precipitate methylated fractions from a randomly sheared genomic DNA sample. Although these two methods are able to enrich fragments containing 5-methylcytosine, they screen the whole genome without representation and cannot determine the methylation statuses of individual CpGs within the fragments [13][14][15][16][17][18]. Even when combined with bisulfite conversion, these methods were still hampered by the enrichment bias that was caused by the CpGs densities and methylation statuses of the DNA fragments [16].
The third approach includes reduced representation bisulfite sequencing (RRBS) and bisulfite padlock probes (BSPP) [19][20][21]. RRBS digests genomic DNA with a specific restriction enzyme followed by bisulfite conversion and sequencing of the size-selected fractions [19,20]. This technique is able to cover multiple regions, including CpG islands, promoters and enhancer elements, and can yield data at a single-base resolution. It is a robust and relatively low-cost method but is limited to profiling the defined regions that correspond to the recognition sites of certain restriction enzymes [17,18,20,22]. Theoretically, BSPPs could be applied to detect any genomic region of interest by the design of specific probes. However, because they are involved in the hybridization of bisulfite-converted genomic DNA, each probe can only target a limited number of CpGs. This limitation hampers the flexibility of the probe design and applicability of this technology [21].
In the present study, we designed a novel strategy, termed liquid hybridization capture-based bisulfite sequencing (LHC-BS), which utilized biotinylated RNA probes to capture target regions of native genomic DNA by liquid hybridization platform and then followed by sodium bisulfite treatment and next-generation sequencing to survey the methylation statuses of specific genomic regions. Exome capture sequencing is a widely used technique to study diseases, and methylation status is gaining recognition as an important topic [23]. Here, we performed target (exome) capturing and parallel bisulfite sequencing to test the reliability of our platform. To validate the applicability of our approach, two types of samples were used: the YanHuang peripheral blood sample (YH) and a mature dendritic cell (mDC) line. Our results demonstrated that the current LHC-BS method was as effective as the bisulfite system in detecting methylation statuses of the captured regions. The technique provided the foundations for future highthroughput examinations of the methylation statuses of any genomic regions of interest.

Overview of LHC-BS
The LHC-BS assay was based on a protocol that applied predesigned biotinylated RNA probes to capture target genomic regions of interest for further analysis. In the current study, we used the Agilent SureSelect Human All Exon Kit to enrich all human exons, which totalled approximately 38 Mb. The capturing was achieved using a liquid hybridization system that required only 2-3 μg of genomic DNA for the library construction in contrast with the 20 μg that is required for the array-based capture method [24,25]. To briefly summarize the experimental procedures, 2-3 μg of genomic DNA were randomly fragmented to mean sizes of approximately 250 bp. The manufacturer's protocols were followed for creating the blunt ends and for the dA additions to the 3'ends and the methylcytosines modified adapter ligations. A total of 500 ng of each adapter-ligated library (147 ng/μl) were denatured and hybridised with biotinylated RNA probes at 65°C for 24 h. Then, Dynabeads ® M-280 Streptavidin was used for the separation of target fragments, after which the captured beads were washed and the DNA fragments were eluted. After the fragments were treated with sodium bisulfite, PCR amplification was performed and the amplicons were sequenced using the Illumina/HiSeq 2000 ( Figure 1).

Data generation
LHC-BS was successful in achieving sufficient coverage of the target regions based on a relatively small amount of sequencing. In this study, 88 M and 132 M raw reads were generated for the YH blood sample and mDC cell line, respectively (Additional File 1 Table S1). Of these reads, more than 75 M and 106 M reads were uniquely mapped back to the reference genome, giving unique map rates of 85.36% and 80.27%, respectively. On average, the depths of coverage were 58× and 63× for the YH and mDC exomes, respectively. We found that more than 97% of the target exons were covered by unique reads in both samples. Of these, 90.21% and 89.16% of the regions were covered by at least 10 reads for each sample, respectively. Furthermore, 71.93% of the reads were enriched in the target regions for the YH blood sample, and 75.96% for the mDC cell line, thus indicating the high capture specificity of this assay for exome.
Because bisulfite conversion can result in allele dropouts at low DNA concentrations, we further evaluated the potential rate of allele loss for the YH sample by evaluating heterozygous SNPs (more than 2 alleles) (unpublished). Totally, 7172 heterozygous alleles were identified in the target regions with sequencing depths of more than 10× in YH whole genome bisulfite sequencing research, and 6992 of them were identified heterozygous by the current designed method. These results indicated that the rate of allele dropout was 2.5% (Additional File 1 Table S2). The heterozygosity of A/G, C/T, G/A and C/T was not included in this analysis because it was possible that they were generated during the bisulfite conversion.

Accuracy of target exon capture
Based on the massive sequencing reads that were uniquely mapped to the reference genome, we further examined the read distribution across the whole genome. As indicated in Figures 2a and 2b, the reads were distributed across all chromosomes, indicating the efficient coverage of the target regions. The average sequencing depth for each chromosome was over 30× for the negative strands in both the YH and mDC samples. To estimate the accuracy of target capture, we randomly isolated data from a specific region of the DIP2B gene that is located on human chromosome 12 and used the Integrative Genomics Viewer (IGV) to assess the read distributions for both the YH and mDC samples. Using the IGV, we were able to determine that the profiles of the reads from the target region followed normal distributions, and most of them were enriched in the gene region ( Figure 2c). These results verified the accuracy of this technique in the capture of target regions. The read distribution across chromosome 12 is presented in Additional File 2 Figure S1.

Accuracy of methylation estimation of target regions
To estimate the methylation statuses of the captured exon regions and evaluate the efficacy of LHC-BS in detecting genomic DNA methylation, methylation levels were assessed based on the following equation for one certain cytosine site: mC reads/(C reads +mC reads)*100%. Considering that not all unmethylated cytosines could be converted to thymine following the bisulfite treatment and because it is believed that non-CpG methylation is barely discernable in human somatic cells [26], we estimated the conversion rate by calculating the methylation levels of the non-CpG sites and adjusted the methylation levels of the CpG sites accordingly. We had previously performed WGBS on both samples and obtained sufficient data to evaluate the genomic methylation statuses (unpublished data), and we were thus able to apply these data to estimate the accuracies of the methylation measurements that were obtained using the LHC-BS method. We isolated WGBS reads that represented exon regions with at least nine-fold depths of coverage and performed correlation analyses with the LHC-BS reads from both samples. A correlation coefficient of 0.907 was obtained for the YH blood sample from the comparison of data of the two methods (Additional File 3 Figure S2 a). Similarly, a correlation coefficient of 0.925 was reached for the comparison with the mDC cell line (Additional File 3 Figure S2 b). Such high consistencies between the two sets of data that were generated by the two distinct methods are indicative of the accuracy of LHC-BS because WGBS is considered to be the gold standard for measuring methylation. Secondly, we addressed the methylation levels in all gene regions of the YH blood sample and mDC cell line. As indicated in Figure 3, the methylation levels across the target gene regions were very similar between the two samples, suggesting similarities between the two cell types that are both primarily comprised of immunocytes. Due to these parallels, we were able to examine technological reproducibility to some extent; specifically, the average methylation level of all CpGs in the genomic regions around the TSSs was identified to be 3%, which is consistent with the published results from the whole genome bisulfite sequencing of YH peripheral blood mononuclear cells [26].
To characterize the slight differences in the DNA methylation profiles between the samples, we performed pair-wise comparisons between the YH blood cells and mDC cell line. As expected, the results revealed that only 21 out of the 165637 targeted regions significantly differed between the two samples (see Methods), indicating highly similar DNA methylation profiles corresponding with the whole exon regions. A Gene Ontology (GO) analysis was employed to address the molecular functions of the genes from the 21exon regions that exhibited significantly different DNA methylation patterns ( Table 1).

Validation of DNA methylation by BS-PCR
Certain biases may be introduced during exome capture, thus hampering the accuracy of LHC-BS. To confirm the methylation events that were detected by LHC-BS, BS-PCR was performed on four randomly chosen genomic fragments. These fragments had low, moderate or high methylation levels in both the YH blood sample and mDC cell line, and there were 27 individual CpG sites in total. Fisher's tests were applied to verify statistical efficacy, and it was observed that none of the CpG sites presented with significant differences in DNA methylation levels when the LHC-BS and BS-PCR results were compared (Additional File 1 Table S3). For instance, a region that is located in the 5'UTR of the    were observed for the two samples following BS-PCR (Figure 4a). In the other three verified regions, sequences from the DSPP gene presented with nearly undetectable methylation levels (Figure 4b), and the WDR37 and DIP2B genes were shown to be heavily methylated using both technologies (Figure 4c and 4d). Thus, taking into account the random selection of the DNA fragments for validation, we did not observe any bias in the detection of methylation levels using LHC-BS.

Discussion
Along with the completion of various genome sequencing projects and more wide spread uses of highthroughput sequencing technologies, many strategies have been developed to assess DNA methylation that require the enrichment of specific genomic regions of interest. For example, microarray-based strategies utilized bisulfite-converted DNA to probe defined regions by the bimodal hybridization of either unconverted or converted probes [24]. This technique enables the detection of original methylation states at specific CpG sites, but it is difficult to design an adequate number of unique probes to extend this approach to a genomewide scale. However, Hodges et al. [25] recently modified this strategy and integrated it with deep sequencing. The methylation level represents the average methylation level of the CpGs in the region. The P value was calculated using the chi-square test. For each CpG island, they designed two sets of probes; one designed to detect no conversion at any of the CpG residues and another to distinguish the full conversion of CpGs to TpGs. They hypothesized that even with completely random CpG methylation patterns, only half of the CpG sites within a given probe would contribute a mismatch, and that was tolerated. However, the methods that were based on microarray hybridization enrichment were either laborious or required large volumes of DNA and were therefore not easily adapted to the massive screening of clinical samples. In the current study, we described a novel technique that combined sequence capture with bisulfite conversion and deep sequencing. In contrast to the strategy that Hodges et al. described [25], we utilized biotinylated RNA probes without bisulfite conversion to capture native genomic regions of interest. This adjustment in the experimental procedures allowed for a drastic decrease in the amount of DNA that was required for capture and more accurate measurements of methylation statuses. Previously, Hodges et al. used 20 μg of PCRamplified genomic DNA for a microarray-based capture experiment. In contrast, 2-3 μg of genomic DNA was used in the library construction for the present study, and 500 ng of the constructed library was sufficient for the hybridization capture and subsequent bisulfite treatment. To protect the captured library, 200 ng of sheared λDNA was added to act as carriers of the DNA that had been eluted from the liquid capture in the current bisulfite treatment process. Based on these modifications, we were not only able to facilitate the probe design process by not requiring any prior consideration of methylation sites but were also able to utilize small amounts of native, unamplified input DNA to examine the methylation statuses of specific regions by designing probes that covered different genomic regions. Recent investigations involving DNA methylation within different gene regions showed that exons are disproportionately enriched in densely methylated genomic elements [21,[26][27][28][29]. Hypermethylation that occurs downstream of the TSS and first exon is closely associated with transcriptional silencing, whereas methylation that occurs in the more downstream portions of the gene body is not [26,29]. In highly expressed genes, the promoter regions are hypomethylated while the rest of the gene-body regions are considerably methylated. However, weakly expressed genes are moderately methylated in both the promoter and gene-body regions. Thus, the function of intragenic DNA methylation is ambiguous. Previous research has indicated that intragenic DNA methylation, exonic nucleosomes and histone modifications may function together to regulate transcript splicing and gene expression [23,30,31].
Here, we profiled the human whole exome methylation statuses of two samples and obtained the same patterns as were previously detected using WGBS [26]. Using this set of probes, we covered 31% of the promoter regions and approximately 100% of the exons. Most of the regions that were not covered (Figure 2a) were distributed in chromosomes 6 and Y. Because we only calculated uniquely mapped reads, we determined that a large number of multiple mapping were present in the uncovered regions of chromosomes 6 and Y. For this reason, the data that was obtained using LHC-BS did not provide accurate methylation information for these regions (Additional File 1 Table S4). Additionally, to collect sufficient information, we generated massive amounts of data using excessive sequencing depths for the two samples, which reasonably led to the high duplication rate of 40.32% for the YH blood sample and 61.04% for the mDC cells. However, this value can be adjusted by reducing the amount of raw sequencing data in future studies. It has been reported that bisulfite conversion may result in allele dropouts at low DNA concentrations [32]. We calculated the allele loss rate to be 2.5%, indicating efficient allele enrichment.
We applied a commercially available liquid hybridization system for exon capture, for which all of the probes were designed using the Watson strand of the human genome. Therefore, only information regarding the methylation of the Crick strand was obtained in this study. However, DNA methylation occurs primarily at CpG dinucleotides in mammalian genomes [1,2,33]. For each round of DNA replication, DNMT1 DNA methyltransferase, which has a preference for hemimethylated DNA, fills in the missing methylation sites on the newly synthesized strand [34][35][36]. Thus, the presence of symmetric CpG methylation was presumed for the two DNA strands composing each human chromosome. For non-CpG methylation, which plays an important role in maintaining the pluripotency of stem cells [11], we may design probes in future studies that cover the doublestranded regions of interest.

Conclusions
In the present study, we applied a liquid hybridisation capture system to enrich for specific genomic regions, which was combined with bisulfite sequencing to examine the methylation statuses of the human exome for the first time. Two different samples were profiled, and the methylation patterns of the exons were compared with methylation patterns that had been obtained using WGBS. These comparisons demonstrated the high accuracy of methylation estimation that is possible using this novel LHC-BS method. In contrast with the array-based capture strategy [25], we captured target regions without pre-converting genomic DNA using a bisulfite treatment. Instead, we used 2-3 μg of genomic DNA to construct a library and 500 ng of them was then subjected to target regions capture and a following bisulfite treatment. This modification improved the efficacy of this platform for methylation profiling, especially with regard to its cost-effective use in the analysis of multiple clinical samples for biomarker detection.

Sample preparation and DNA library construction
Two types of genomic DNA were isolated from a peripheral blood sample that was obtained from the same individual whose genome and methylome had been previously sequenced (YH) [26,37] and from a mature human dendritic cell line (mDC). The total DNA from the peripheral blood sample was prepared using the QIAamp DNA Blood Mini Kit following the manufacturer's instructions, and the DNA from the mature human dendritic cells was isolated by proteinase K digestion and phenol/chloroform extraction.
Prior to the library construction, 2-3 μg of genomic DNA from each sample was fragmented using a Covarias sonication system to mean sizes of approximately 200 bp. After fragmentation, libraries were constructed according to the Illumina Paired-End protocol. Briefly, the purified, randomly fragmented DNA was treated with a mix of T4 DNA polymerase, Klenow fragments, T4 polynucleotide kinase and a nucleotide triphosphate mix to repair the ends by blunting and phosphorylation. The blunted DNA fragments were subsequently 3'-adenylated using the Klenow fragment (3'-5' exo-) and ligated by T4 DNA ligase to BGI-designed PE Index Adaptors that had been synthesized with 5'-methylcytosine in place of cytosine. After each step, the DNA was purified using the QIAquick PCR Purification Kit (Qiagen). The constructed libraries were stored at -20°C until the next step of hybridization.

Liquid hybridization-based exon capture
The SureSelect Human All Exon 38 Mb Kit was used for the hybridization process. The probes nearly covered all of the exons that were annotated by the CCDS in September of 2008. The constructed libraries for the capture were quantified using the Quant-iT dsDNA HS Assay Kit with the Invitrogen Qubit fluorometer. Five hundred nanograms of each library at concentrations of 147 ng/μl were required for the hybridizations as described in the protocol for the SureSelect Target Enrichment System for an Illumina Paired-End Sequencing Library. The hybridizations were performed using a thermal cycler that was set at 65°C for 24 hours with the lid heated to 105°C; then, Dynabeads ® M-280 Streptavidin (Invitrogen) was used to capture the biotinylated RNA probes that had hybridized with the fragments of the target regions. Finally, the captured library samples were washed and purified for the bisulfite conversion.

Bisulfite conversion and PCR amplification
The bisulfite conversion of the captured exon DNA was performed according to the instructions for the ZYMO EZ DNA Methylation-Gold Kit™ with some modifications. A total of 200 ng of fragmented λDNA were added to act as carriers for the elution products from the hybridization. After the samples were denatured and treated with sodium bisulfite, they were desulfonated, and the cytosine was converted to uracil. PCR was carried out in a final reaction volume of 50 μl that included 10 μl eluted bisulfite conversion products, 4 μl 2.5 mM dNTP, 5 μl 10× buffer, 0.5 μl JumpStart™ Taq DNA Polymerase, 2 μl PCR primers and 28.5 μl water. The thermal cycling program was as follows: 94°C for 1 min, 18 cycles of 94°C for 10 s, 58°C for 30 s, 72°C for 30 s and a final 5 min incubation at 72°C. The temperature was held at 12°C following the termination of the cycling program. PCR products that ranged in size from 200 to 400 bp were selected and gel purified using the QIAquick Gel Extraction Kit (Qiagen). They were then analyzed by the Bioanalyser Analysis System (Agilent, Santa Clara, USA) and quantified using real time PCR. According to the real time PCR measurements, two PCR products of distinct samples were pooled on one lane and sequenced by the Illumina Hiseq2000 using 90 bp paired-end sequencing reads.

Bisulfite sequencing of specific regions
Specific regions of bisulfite-treated genomic DNA from each of the samples were PCR amplified, and the products were cloned and sequenced using conventional Sanger sequencing. Briefly, 500 ng of genomic DNA from each sample were converted according to the ZYMO EZ DNA Methylation-Gold Kit™ manufacturer's instructions. PCR primers were designed using the online MethPrimer software http://www.urogene.org/ methprimer/index.html. The PCR primer information is listed in Additional File 1 Table S5. These primers were designed to recognize regions that lack CpG sites to avoid any amplification bias that may have occurred due to the differences between the methylated and unmethylated sequences. Thermal cycling was performed as follows: 94°C for 1 min, 30 cycles of 94°C for 10 s, 58°C for 30 s, 72°C for 30 s and a final 5 min incubation at 72°C. The temperature was held at 12°C following the termination of the cycling program. The PCR products were purified using the QIAquick Gel Extraction Kit (Qiagen). The purified PCR products were subcloned, and colonies from each region were sequenced using the 3730 Genetic Analyzer (Applied Biosystems) to assess the methylated cytosine levels.

Bioinformatic analysis
The captured fragments were sequenced using an Illumina Hiseq2000 sequencer that generated raw data in a 90 bp paired-end Fastq format. The data were processed using the Illumina base-calling pipeline. Subsequently, all reads were aligned to the hg18 reference genome using the SOAP2.01 aligner [38]. Briefly, we used the human reference genome to derive the soap libraries for the assessment of the mapping results of the bisulfite treatment. In the mapping process, the seed of a read was 30 bp, and 2 mismatches were permitted. Only mismatches of 5 bp were allowed in the total read. Methylcytosines were identified according to a previously published strategy [26]. To reduce any bias that resulted from the PCR amplification, we chose optimal reads that possessed fewer gaps and mismatches when more than one read was mapped to the same position. Additionally, we preferred the paired-end mapped reads. Methylation levels were calculated using the following formula for certain cytosine types: mC reads/(C reads +mC reads) *100%. Because not all of the unmethylated cytosines could be converted to uracils by the bisulfite treatment, we estimated the non-conversion rate using the methylation rate of non-CpG sites.
To validate the results of the LHC-BS, Pearson's correlation analysis were utilized to conduct pair-wise comparisons between datasets that were derived from the LHC-BS and WGBS assays. For the BS-PCR validation analysis, the BLAST algorithm was applied to align the sequences (expect value = 1E-10). The Fisher's test was used to statistically define the significances of the differences between the methylation information that was generated by LHC-BS and BS-PCR (P value < 0.01).
Three parameters were applied to identify the differentially methylated regions between the mDC cell line and YH blood sample. First, we summed all of the methylated and unmethylated nucleotides in the CpG sites for each region of the two samples, respectively. The Fisher's test was used to analyze the methylation differences for each target region using a P value < 0.01. Second, we separated out the regions that showed over two-fold increases in methylation levels between samples. Third, considering the non-uniformity of the methylation levels, we selected regions with differences in methylation of over 20% between the two samples. The obtained differentially methylated regions were then examined using the BGI WEGO software to determine the functional information for their respective genes.