Investigating the relationship of DNA methylation with mutation rate and allele frequency in the human genome
© Xia et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Xia et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
DNA methylation, which mainly occurs at CpG dinucleotides, is a dynamic epigenetic regulation mechanism in most eukaryotic genomes. It is already known that methylated CpG dinucleotides can lead to a high rate of C to T mutation at these sites. However, less is known about whether and how the methylation level causes a different mutation rate, especially at the single-base resolution.
In this study, we used genome-wide single-base resolution methylation data to perform a comprehensive analysis of the mutation rate of methylated cytosines from human embryonic stem cell. Through the analysis of the density of single nucleotide polymorphisms, we first confirmed that the mutation rate in methylated CpG sites is greater than that in unmethylated CpG sites. Then, we showed that among methylated CpG sites, the mutation rate is markedly increased in low-intermediately (20-40% methylation level) to intermediately methylated CpG sites (40-60% methylation level) of the human genome. This mutation pattern was observed regardless of DNA strand direction and the sequence coverage over the site on which the methylation level was calculated. Moreover, this highly non-random mutation pattern was found more apparent in intergenic and intronic regions than in promoter regions and CpG islands. Our investigation suggested this pattern appears primarily in autosomes rather than sex chromosomes. Further analysis based on human-chimpanzee divergence confirmed these observations. Finally, we observed a significant correlation between the methylation level and cytosine allele frequency.
Our results showed a high mutation rate in low-intermediately to intermediately methylated CpG sites at different scales, from the categorized genomic region, whole chromosome, to the whole genome level, thereby providing the first supporting evidence of mutation rate variation at human methylated CpG sites using the genome-wide sing-base resolution methylation data.
Germline mutation can profoundly affect genetic variation within populations and divergence between species. The identification of factors that modulate mutation rates in germ cells, particularly in humans, has been a long-standing scientific pursuit . Several factors are thought to be related to mutation rate, including DNA replication timing , genetic recombination , and categorized genomic regions . The mutation rate is also known to vary throughout the genome in a contextual fashion , the most well-studied example being the influence of CpG dinucleotides in mammalian genomes. CpGs are generally methylated in mammalian genomes, and since the deamination rates of methylated cytosine is 2.0 to 3.2 fold more than unmethylated cytosine , it leads to a 10- to 50-fold higher transition rate of methylated CpG to TpG than other transitional changes [6, 7]. This rate difference in turn causes the suppression of CpG sites in the mammalian genomes [8, 9]. However, whether and how a variation in methylation level causes the variation in the high mutation rate is still unknown.
Based on the analysis of single nucleotide polymorphisms (SNPs), Fryxell and Moon  have concluded that methylation-dependent transition rates are exponentially dependent on local GC content. Our re-examined studies further indicated that these methylation-dependent transition rates in the human genome were dependent on both the local sequence length and the genomic region . In a recent study of CpG mutability in intronic regions of the human genome , CpG substitution rates were observed to be significantly correlated with the extent of methylation level. In the same study, they also found that the CpG substitute rate is positively correlated with non-CpG divergence and negatively correlated with GC content . However, their conclusion is based on MeDIP methylation data, which is calculated in a low resolution for a 100-bp (base pair) window size . Moreover, although their results showed that the methylation level affected the mutation rate, their investigation limited in the intronic regions only, and the mutation pattern in other genomic regions remained unclear.
More recently, next-generation sequencing has provided single-base resolution DNA methylome in the human genome , thus making more comprehensive analysis possible [14–17]. The availability of whole-genome methylation maps with single-nucleotide resolution provides an opportunity to refine studies pertaining to CpG mutation rate variation. Compared with array data, the advantage of sequencing data can not only be used to estimate methylation broadness but also be used to analyze methylation deepness for each single base .
Here, we used genome-wide, single-base-resolution maps of methylation data  in human embryonic stem cell line (H1) to perform an extensive investigation of the correlation between the level of cytosine methylation and CpG mutability of the human genome. In addition to the methylation on CpG sites, we also examined the effect of methylation level on non-CpG sites (e.g., CHG/CHH, where H = A, C, or T) sites. We observed that among methylated CpG sites, the mutation rate, as reflected by the human SNP density, is markedly increased in low-intermediately to intermediately methylated CpG sites at different scales, from the categorized genomic region, whole chromosome, to the whole genome level. This observation was also confirmed by human-chimpanzee divergence data. Finally, we found a significant correlation between methylation level and cytosine allele frequency in the human genome.
The single-base resolution DNA methylation data were derived from Lister et al. , including whole genome bisulfite sequencing data for two human cell lines: H1 human embryonic stem cell and IMR90 fetal lung fibroblast. We removed all methylcytosines that could not be mapped to reference CpG sites (NCBI Build 36) in our analysis. For the remaining mapped methylcytosines, we used methylation deepness  to evaluate the methylation level in the CpG site, which represents the extent of the methylation level of the cytosines based on the ratio of the number of reads having methylated cytosine over the total number of reads mapped to each reference CpG site. CpG sites can be classified as methylated CpG sites if there is at least one read covering methylated cytosine (methylation level > 0%). Otherwise, they are classified as the unmethylated CpG sites, i.e. methylation level = 0%. Methylated CpG sites are further categorized into five groups of increasing methylation levels: lowly methylated (less than or equal to 20%), low-intermediately methylated (20-40%), intermediately methylated (40-60%), high-intermediately methylated (60-80%), and highly methylated (greater than 80%).
The human reference genome sequence (Build 36) was downloaded from NCBI website . Genomic annotations, including the start and end positions of genes and exons, were defined based on the ENSEMBL database . Promoter was defined as an interval of -1500 to +500 bp around the transcriptional start site (TSS) as previously described. A genic region was defined as the transcribed region from the start to the end of transcription sites. An intergenic region was defined as a region in which no known gene was annotated. Similarly, an intronic region was defined as the region in which no known exon was annotated . According to our previous comparisons [21, 22], we used Takai and Jones' algorithm  to identify CpG islands (hereafter, we abbreviated as CGIs): length ≥ 500 bp, GC content ≥ 55%, and CpG observed/expected (O/E) ratio ≥ 0.65.
To estimate the mutation rates, we followed the approach in previous studies [2, 24] to use the data from both human SNP density and human-chimpanzee divergence. We downloaded the human SNPs data from the NCBI dbSNP database (build 130) website  and only considered the validated reference SNPs . A total of 16,969,034 validated reference SNPs were available (with reference SNP ID and reference annotation). Among them, 12,236,967 were bi-allelic and uniquely mapped in the genome, which were used for our further analysis. The SNP density is estimated by computing the total number of C/A, C/G, and C/T SNPs over the total number of cytosine bases in the human genome.
We estimated the divergence of CpGs between the human and chimpanzee genomes based on the human and chimpanzee alignment. We downloaded alignments of the hg18 version of the human genome and PanTro2 version of the chimpanzee genome from UCSC Genome Browser . The divergence is calculated as the proportion of different sites between human and chimpanzee genomes among the total sites that could be aligned between human and chimpanzee genomes.
For human allele frequency data, we used CEU (Utah residents with Northern and Western European ancestry from the CEPH collection) population data that we downloaded from UCSC Genome Browser . This data set also includes chimpanzee nucleotide information for each allele set when available; thus, we could determine which allele was the ancestral one based on our previous study .
We compared the mutation rate of methylated CpG sites in H1 and IMR90 cell lines (see additional file 1). We found that the mutation pattern in IMR90 cell line is overall similar to that in H1 cell line, though it is not as evident as in H1 cell line. One possible reason is that the H1 cell line is derived from human embryonic stem cells while the IMR90 cell line is originated from human fetal lung fibroblasts. It has been suggested that stem cells have more opportunity to accumulate mutations compared with mature cells . Therefore, the mutation pattern shows stronger difference in CpG sites with 20-60% methylation level in H1 cell line than IMR90 cell line.
To examine whether the strand bias affected the mutation pattern, we further analyzed the data in the minus strand to compensate for the plus strand pattern observed above. We observed the same pattern when the methylation data of the minus strand was analyzed (see additional file 2). This observation is consistent with the previous report that the DNA methylation levels of two strands in any given genomic region are highly symmetric .
The methylation profile we used in this study was generated by single-base resolution bisulfite sequencing technology. According to this technology, the more reads that cover the CpG site, the higher confidence we can obtain regarding the base it detects. Therefore, we checked whether our observation was affected by the read coverage. We found an overall similar pattern regardless of the cut-off values of the coverage we used (see additional file 3). Particularly for CpG sites with low-intermediate and intermediate levels of methylation, we found an even higher mutation rate with higher coverage. For example, we observed that the mutation rate increased to 15.26% when intermediately methylated CpG sites were covered by at least 10 sequencing reads compared with 11.00% when they were covered by at least 5 reads. This finding suggests that our observation that CpG sites with low-intermediate and intermediate levels of methylation actually had a much higher mutation rate is not affected by the sequence coverage of methylated sites. With the increased coverage, our observations could be even more apparent.
A similar mutation pattern could not be observed in the X chromosome due to an overall low mutation rate in this sex chromosome. We only found a slightly higher mutation rate in the CpG sites with a lowly methylated level (0-20%) (Figure 3).
The H1 cell line has abundant non-CpG methylation (mCHG and mCHH, where H = A, C or T) that accounted for nearly 25% of all methylated cytosines. Therefore, we investigated the mutation rate in these methylated non-CpG sites. A summary of the mutation rate from mCHG and mCHH sites was provided in additional file 5. We found that the mutation rates in methylated non-CpG sites were low across the different methylation levels in these sites (0.38-0.42% for mCHG, and 0.39-0.54% for mCHH); thus, we did not observe the pattern as in the methylated CpG sites.
To further test whether the methyltion level depends on the allele frequency information, we examined the methylation level in CpG sites with different allele frequencies. First, we compared the methylation level when the major allele is cytosine with that when it is not. The methylation level was higher with cytosine as the major allele than with other bases as the major allele (0.798 vs. 0.605, P < 2.2E-16). The methylation level was also higher when the ancestral allele is cytosine than other bases (0.792 vs. 0.703, P < 2.2E-16). The directionality observed in these values is consistent with the fact that deamination of methylated cytosine results in C → T mutations that likely remain unrepaired and, therefore, become polymorphisms. For those alleles with cytosine as the ancestral allele, we further categorized them as to whether their current major allele is cytosine or other base. We found that for those alleles with cytosine as the ancestor, their methylation level significantly increased when cytosine became the current major allele relative to another base as the current major allele (0.810 vs. 0.610, P < 2.2E-16). The increased methylation level of CpG sites with cytosine as both the ancestral allele and the current major allele further suggest that these methylated CpGs are representative of hypermutable sites. Taking together, these results indicated that the subset of methylated CpGs in the derived allele dataset is not a random subset of the genome. The observed overrepresentation of methylated CpGs in all data sets suggests that the methylated CpGs variable is indicative of the previously known hypermutability of methylated cytosines.
Previous studies have suggested that methylated CpG sites correlate strongly to a high rate of mutation. However, whether and how the higher methylation level causes the higher mutation rate is largely unknown. Our analysis showed that CpG sites with low-intermediate to intermediate levels of methylation (20-60%) actually had a much higher mutation rate as reflected in recent human nucleotide diversity as compared with CpG sites with other methylation levels at different scales, including the categorized genomic regions, whole chromosome, and the whole genome. Consistently, we observed a similar pattern by using human-chimpanzee divergence data.
We further examined variation in the mutation rate among chromosomes. Within autosomal chromosomes, we observed that CpG sites with low-intermediate and intermediate levels of methylation had a higher mutation rate compared with CpG sites with other methylation levels. We found that autosomes with low gene density tend to have an overall higher mutation rate. Because of male-driven evolution  in humans, mutation rates are typically highest on the Y chromosome, intermediate on autosomes, and lowest on the X chromosome. Indeed, the mutation rate was higher on the CpG sites in autosomes than that in the X chromosome. However, we could not estimate the mutation rate in the Y chromosome due to the overall low quality data. Further investigation is necessary when high quality data in this sex chromosome is available.
In summary, our study found a significant correlation between the extent of germ line methylation and the mutation rate at human CpG sites. Specifically, we found that there is a high mutation rate in low-intermediately to intermediately methylated CpG sites at different scales, from the categorized genomic region, whole chromosome, to the whole genome level. Mutation rate has been found to be correlated with proximal DNA methylation patterns [11, 28], local GC content, local sequence lengths and genomic regions, local or regional nucleotide composition, crossover rate, distance to telomeres, chromatin compaction , as well as the replication timing . Our analysis provides the first supportive evidence regarding the mutation rate variation at human methylated CpG sites using genome-wide single base resolution methylation data. However, further integrative investigation of genetic factors on the mutation rate and methylation level is warranted. Finally, we have shown that there is a significant correlation between methylation level and cytosine allele frequency in the human genome.
We thank members of the Bioinformatics and Systems Medicine Laboratory for their useful discussion and suggestions. We also thank Rebecca Hiller Posey for her assistance with polishing the manuscript.
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.