Epigenetic and genetic alterations and their influence on gene regulation in chronic lymphocytic leukemia
© The Author(s). 2017
Received: 13 October 2016
Accepted: 10 March 2017
Published: 16 March 2017
To understand the changes of gene regulation in carcinogenesis, we explored signals of DNA methylation – a stable epigenetic mark of gene regulatory elements — and designed a computational model to profile loss and gain of regulatory elements (REs) during carcinogenesis. We also utilized sequencing data to analyze the allele frequency of single nucleotide polymorphisms (SNPs) and detected the cancer-associated SNPs, i.e., the SNPs displaying the significant allele frequency difference between cancer and normal samples.
After applying this model to chronic lymphocytic leukemia (CLL) data, we identified REs differentially activated (dREs) between normal and CLL cells, consisting of 6,802 dREs gained and 4,606 dREs lost in CLL. The identified regulatory perturbations coincide with changes in the expression of target genes. In particular, the genes encoding DNA methyltransferases harbor multiple lost-in-cancer dREs and zero gained-in-cancer dREs, indicating that the damaged regulation of these genes might be one of the key causes of tumor formation. dREs display a significantly elevated density of the genome-wide association study (GWAS) SNPs associated with CLL and CLL-related traits. We observed that most of dRE GWAS SNPs associated with CLL and CLL-related traits (83%) display a significant haplotype association among the identified cancer-associated alleles and the risk alleles that have been reported in GWAS. Also dREs are enriched for the binding sites of the well-established B-cell and CLL transcription factors (TFs) NF-kB, AP2, P53, E2F1, PAX5, and SP1. We also identified CLL-associated SNPs and demonstrated that the mutations at these SNPs change the binding sites of key TFs much more frequently than expected.
Through exploring sequencing data measuring DNA methylation, we identified the epigenetic alterations (more specifically, DNA methylation) and genetic mutations along non-coding genomic regions CLL, and demonstrated that these changes play a critical role in carcinogenesis through damaging the regulation of key genes and alternating the binding of key TFs in B and CLL cells.
KeywordsDNA methylation Regulatory elements Genetic mutation Transcription factor binding site Genome-wide association study
Cancer, a leading cause of death worldwide , is a major focus of biological and clinical research. Dramatic phenotypic alterations in cancer cells have often been attributed to gene mutation and gene regulatory variation . In the last decade, evidence has been accumulating that the malfunction of gene regulatory elements, such as promoters, enhancers, etc., makes a substantial contribution to cancer initiation and progression. For example, the promoter inactivation of von Hippel-Lindau (VHL), leading to the silencing of this gene, has been reported as a biomarker of renal cancer . Similarly, in many cancers, the transcription of cyclin-dependent kinase inhibitor 2A (CDKN2A), an important tumor suppressor gene, has been found to be terminated after the chromatin blocking of its promoter region . Also, the disruption of super-enhancers plays a key role in inhibiting the oncogene MYC in multiple myeloma . More recently, aberrant DNA methylation along super-enhancers has been reported in a broad spectrum of cancers, such as breast, colon, lung cancer .
To delineate the activity alteration of regulatory elements (REs) during carcinogenesis, the signals of epigenetic marks are commonly measured and compared between cancer and normal cells . DNA methylation, predominantly occurring at the 5’ position of the cytosine in CpG dinucleotides, is a stable epigenetic mark that can be combined with other epigenetic modifiers, such as Histone 3 lysine 4 trimethylation (H3K4me3), for defining the function of the DNA. DNA methylation preliminarily affects the activity of regulatory elements, prompting research into how DNA methylation alters gene regulation. Since the original report in 1983 that DNA methylation is substantially decreased in tumor tissues, aberrant DNA methylation has been well-established as a signature in cancer [8–10]. Global hypomethylation of repetitive DNAs elements has been found to be responsible for promoting multiple cancers, such as inducing the overexpression of oncogenes in leukemia cells , silencing the tumor suppressor genes in colorectal cancer , and enhancing the chromatin instability in lymphoma .
With the knowledge that de-methylation is strongly correlated with activation of regulatory elements , we developed a computational model, in which a genome-wide methylation profile was analyzed to map REs in cancer and normal cells. The comparison between these RE maps in turn established differentially-activated REs (dREs), including dREs gained and lost during cancer development. We tested this model on chronic lymphocytic leukemia (CLL), due to its relatively abundant data resources, and observed that the gained and lost dREs were enriched in the neighborhood of up- and down-regulated genes during CLL carcinogenesis. The genes encoding transcription repressors and DNA methyltransferases have multiple lost dREs in their loci, suggesting an important role for these genes in maintaining normal B-cells and initiating CLL development. Also, dREs are enriched for the GWAS SNPs associated with CLL or, more broadly, cancer traits. CLL genetic mutations, i.e., the substitution of wild - type alleles with CLL-susceptible alleles, are associated with a change in binding of major B-cell TFs. In this study, we identified epigenetic and genetic changes during carcinogenesis and evaluated the impact of these changes on gene regulation.
Data processing of reduced representative bisulfite sequencing (RRBS)-seq profiling
We analyzed the genome-wide methylation profiles from 32 B cells of 32 chronic lymphocytic leukemia (CLL) patients and 10 normal CD19+ B cells  (which have been deposited to Gene Expression Omnibus GSE66121 by the authors of the referenced study). Methylation levels of CpG sites were measured using reduced representative bisulfite sequencing (RRBS)-seq.
We downloaded the raw RRBS-seq reads to establish the methylation profiles and detect the genetic mutations in CLL. We established a workflow to analyze these raw sequence data (Additional file 1: Figure S1). Bismark  coupled with Bowtie2  was used to align the raw reads to the human genome with the settings “-q --phred64-quals -n 1 -l 40”. The alignment results, i.e., one sam file per sample, were transformed into bam files using the samtools (“samtools view -bT”) . The bam files were used as input to BisSNP  to calculate the methylation levels of CpG sites and to call genotypes. The parameters for BisSNP were set as default, i.e., -maxQ 40, -stand_call_conf 8, -stand_emit_conf 0, -mmq 30, -mbq 0. Only SNPs with the minor allele frequency (MAF) > 0.01 in 1000 Genomes Project  were used to run BisSNP.
Mapping consensus dREs in a sample class
where r ik is the number of the reads at the site k from the sample i. w i , the weight of the sample i, is determined as the reciprocal of the total number of the aligned reads in the sample i. After replacing r ik with mr ik , the number of the methylated reads, in eq. (1), we obtained the combined number of methylated reads at k. After collecting these numbers, we had a combined methylation profile for each tested sample class. We then applied MethylSeekR  to each combined methylation profile with the setting of chr.sel = chr2, meth.cutoff = 0.5 and nCpG.cutoff = 3. At the end, we established a map of consensus dREs, together with sREs and hiMRs, in each sample class.
To categorize dREs based on their genomic location, we employed the annotatePeaks.pl script from HOMER with default settings. The obtained gained dREs, lost dREs and sREs, with average lengths of 660, 814, and 1094 bp, have the average CpG density of 2.4, 3.9 and 5.6 CpGs per 100 bp, respectively.
Hierarchal clustering and PCA
After filtering out CpGs with less than five aligned reads, we used the classic hierarchal clustering algorithm to analyze the similarity of methylation profiles from different samples (32 CLL cells and 10 control B-cells). For this purpose, we employed the MATLAB function “linkage” to build a hierarchal clustering tree using the distance of “Euclidean” and the method of “ward.” The constructed tree was then visualized using the MATLAB function “dendrogram” with the default settings.
We also used principal component analysis (PCA) to visualize the distribution of samples. PCA was conducted by using the MATLAB function “princomp” with the default settings.
Alignment of human and mouse genomes
To map genomic regions (i.e., dREs, sREs and hiMRs) from the human to the mouse genome, we used the software “bnMapper” (available at https://bitbucket.org/james_taylor/bx-python/wiki/bnMapper). The pair-wise genome alignment (chain file) between the mouse and human required by bnMapper was downloaded from the UCSC Genome browser. bnMapper was run with the setting “--gap 20 --threshold 0.1”. A human genomic region was considered as conserved between human and mouse when the aligned sequence was longer than 20 bps.
Repeat composition along dREs
We used the repeat tables downloaded from the UCSC Genome browser to investigate the repeat content of dREs. Given a group of dREs, the fraction of these regions covered by repetitive elements was calculated. Similarly, the repeat composition of sREs and hiMRs was estimated and was used as a baseline to evaluate the enrichment of repeats in dREs.
Enrichment of dREs in loci of genes differentially expressed in CLL
The RNA-seq profiles of the 32 CLL samples, together with five normal B-cell samples (of which two samples were also included in the methylation data), were downloaded from Gene Expression Omnibus (accession number GSE66117). To avoid unreliable RNA-seq measurements, we filtered out genes with very low expression, i.e., those for which the average expression was less than 0.1 in either CLL or normal B-cells. For each of the remaining genes, the fold change of its expression in CLL was then calculated as the ratio of the average expression in CLL to that in normal B-cells. Ranking the genes based on their expression fold-change, we identified genes up- and down-regulated in CLL by selecting a percentile of top differentially regulated genes.
Following a general rule, we assigned a genomic region (either RE or hiMR) to the gene with the closest transcription start site (TSS). Given gained (or lost) dREs and a group of genes (say R and G, respectively), we identified dREs linked to any given gene, and calculated the fraction of these dREs from all dREs associated with the genes having reliable RNA-seq measurements (denoted as fract(R, G)). Using the sREs (represented by S) as background, we evaluated the enrichment of G in the surrounding of R as the ratio of fract(R, G) to fract(S, G). The significance of this enrichment was measured under a binomial test.
Functional analysis of dREs and genes
We used the Genomic Regions Enrichment of Annotations Tool (GREAT, available at http://bejerano.stanford.edu/great/public/html/)  to examine the function of dREs with the whole human genome as the background. Also, the Database for Annotation, Visualization and Integrated Discovery (DAVID, available at https://david.ncifcrf.gov/)  was used to estimate the function of a set of genes with the whole list of human genes as the background.
Distribution analysis of dREs
Given a class of dREs, we calculated the distance from each dRE to its nearest within-class neighbor and then computed the distribution of these distances. Through randomly shuffling class labels among dREs, sREs and hiMRs, we generated a background class and assessed the distribution of within-class distances in the background class. We generated 1,000 background classes independently and used the average of their within-class distributions as background for statistical analysis. Similarly, we built the distribution of cross-class distances of gained dREs to their nearest lost dREs and compared this distribution with the background estimated the same way as in the case for within-class computations.
The bimodal distribution of within-class distances among lost dREs (Additional file 1: Figure S2) implies that parts of lost dREs are clustered close to each other (the distance of <10 kb). To investigate the function of these lost dREs, we identified the lost dREs with the distances to their nearest within-class neighbor less than 10 kb. We used GREAT to evaluate the function of these lost dREs (see Functional analysis of dREs).
GWAS analysis of dREs
We downloaded the NHGRI GWAS Catalog in April 2015 . For each GWAS SNP, we identified all SNPs in a tight LD with the GWAS SNP (r 2 > 0.8 and distance < 500 kb) based on at least one population from the 1000 Genomes Project (CEU, YRI and CHB/JPN) by using SNAP . After that, we linked these tight-LD SNPs to the corresponding traits. At the end, we had 1,759 GWAS traits associated with 324,454 SNPs.
Considering that 54% of the traits are linked to a small number of SNPs, i.e., less than five tagged SNPs, we agglomerated similar traits together to obtain reliable statistical results. For example, we identified the GWAS SNPs associated with the lymphoma traits (but not CLL), and marked them as “lymphoma”. Similarly, we built SNP categories for “CLL” and “cancer” due to their immediate and close relevance with CLL. Finally, the GWAS SNPs not included in these categories were marked as “irrelevant” and were used as the baseline of our statistical analysis.
To evaluate the association between a class of dREs (e.g., gained dREs) and a GWAS trait, we identified all SNPs from the 1000 Genomes Project in gained dREs. After that, we counted among these SNPs the ones that have been associated with a given trait. This count measures the overlap between the given GWAS trait and gained dREs. To examine the significance of this count, we adopted a permutation strategy. We randomly shuffled class labels among dREs, sREs and hiMRs, and counted the SNPs linked to the tested trait in the randomly-labeled gained dREs. After repeating this process 1,000 times, we examined the probability of randomly-labeled gained dREs displaying a higher number of given-trait-associated SNPs than the gained dREs. This probability measures the significance of the association between the gained dREs and the tested given trait.
Identification of SNPs and their alleles associated with CLL
We investigated RRBS reads at SNP positions. For those SNPs that were not polymorphic in the set of RRBS reads, we dubbed them non-assayed if they overlapped with less than 10 reads or non-mutated otherwise.
where n k,c is the occurrence count of k in the CLL samples, and n c is the summation of the occurrence count of all alleles in the CLL samples. p k,n is the frequency of k in the control samples. We used the MATLAB function “binocdf” for this calculation. We also examined the significance of each diploid genotype state in CLL samples with reference to controls. The minimum of the p values (i.e., Pr s) of the alleles and genotype states measures the significance of genotypic difference between CLL and control. The nucleotide positions having minimum of Prs < 0.05 were marked CLL-associated SNPs. In this study, we detected 305 and 186 CLL-associated SNPs located in lost and gained dREs, respectively. Furthermore, for a CLL-associated SNP, the allele enriched in CLL was considered as the CLL-associated allele.
Haplotype association between alleles
TFBS representation and enrichment along dREs
We used the TRANScription FACtor (TRANSFAC) version 2010.3  and JASPAR  databases of TFBS. We scanned dREs sequences using position weight matrices (PWMs) from these two databases using Find Individual Motif Occurrences (FIMO)  with the default settings.
Given a dREs, we randomly sampled the human genome to obtain 10 control sequences with matching GC content, repeat density, and sequence length. TFBS enrichment in the dREs was calculated as the ratio of a TFBS density in dREs to counterpart in control sequences.
Binding affinity changes at CLL-associated allele substitution positions
where p t is Fract(lost|t) in controls.
where p t is Fract(gained|t) in the background positions generated using the strategy for lost dRE SNPs.
Methylation of non-promoter CpG sites is informative for distinguishing CLL from control
Next, we directly compared the CpG methylation profiles in CLL with those of normal B-cells. To control the noise introduced during data generation and processing, we focused on the CpGs having a considerable methylation change (i.e., the difference of methylation level between CLL and normal cells is greater than 0.3), denoted as methyl-change CpGs. Thus, every methyl-change CpG is either highly methylated in normal cells but not CLL (we refer to this class as de-methylated in CLL) or, in the opposite case, highly methylated in CLL but not in normal (we refer to those as highly methylated in CLL). After focusing on regulatory elements (i.e., non-promoter, intronic and intergenic elements), we observed that about 70% of non-promoter methyl-change CpGs are de-methylated in CLL (69.1% and 69.7% for intergenic and intronic, respectively), significantly higher than that in promoters (33.7%, binomial test p < 10− 100 intergenic/introns vs. promoters, Fig. 1b), which demonstrates that non-promoter CpGs predominantly lose methylation, while promoter CpGs become predominantly methylated. This is in accordance with the report that de-methylation is widespread in intergenic and intronic regions in cancer cells . Promoters display the smallest fraction of methyl-change CpGs among all genomic regions (Fig. 1b), reflecting that the promoters are more likely to become methylated than other genomic regions in carcinogenesis .
We applied a classic hierarchical clustering algorithm to the CpGs methylation profiles in regulatory elements (see Methods). Using the methylation levels of CpGs located in non-promoter gene regulatory loci, all CLL samples, being clustered into a homogeneous group, were distinguished from normal samples (Fig. 1c). On the other hand, the methylation signals of promoter CpGs could be used to cluster the majority but not all CLL samples together (Fig. 1c). These findings are further supported when using principal component analysis (PCA) to visualize the distribution of CLL and control samples (see Methods, Additional file 1: Figure S6). Collectively, the nucleobase-resolution methylation profiles of CpGs in gene regulatory elements (including promoter and non-promoter elements) contain sufficient information to discriminate CLL from control samples. Especially, non-promoter CpGs are capable of distinguishing two sample classes better than promoter CpGs (Fig 1c and Additional file 1: Figure S6).
Mapping consensus REs in CLL and control
As methylation is highly correlated over short genomic distances and the methylation change at individual CpG sites correlates with chromatin accessibility and transcription factor association of the flanking context , our next step was to expand base-resolution methylation levels to local methylation states. By connecting multiple adjacent CpGs (i.e., at least three CpGs in this study) with similar methylation levels , we identified the de-methylated regions and marked those located in gene regulatory loci as activated REs (see Methods and Additional file 1: Figure S1). Our assumption was that methylation change corresponds to the change in the activity of a RE—as methylated REs are likely inactive, long spans of CpG de-methylation in CLL likely correspond to REs that have been inactive in normal cells, but have been activated in CLL. Throughout the rest of the manuscript, we use these differential methylation data in reference to REs that are active in either normal or CLL cells.
dREs occupy non-promoter regions (i.e., intronic and intergenic genomic loci) more often than sREs (88% gained and 79% lost dREs vs. 60% sREs, binomial test p < 10− 100, Fig. 2b), which is in line with the preceding finding that non-promoter CpGs display larger methylation changes than promoter CpGs (Fig. 1b), and suggests a pronounced role of enhancer changes during CLL development.
In addition, gained dREs in 22.6% of promoters and 3.6% of non-promoters overlap CpG islands (CGIs), which is significantly lower than their lost dRE and sRE counterparts (44% and 58.7% of promoters lost dRE and sRE, respectively, and 18% and 28.7% of non-promoters lost dRE and sRE, respectively; p < 10− 100, Additional file 1: Figure S8, see Methods). The significant depletion of CpG islands (CGIs) along the gained dREs coincides with the report that DNA methylation in tumors is higher within CGIs but is lower outside of CGIs .
dREs of different categories show distinct evolutionary conservation
We assessed evolutionary conservation of dREs by aligning their human and mouse counterparts (see Methods). First of all, more than half of sREs and dREs have mouse orthologues (65% of sREs, 60.9% of lost dREs and 52.3% of gained dREs), which is significantly higher than that of hiMRs (40.8%, binomial test p < 10− 100, Fig. 2c). This elevated evolutionary conservation is suggestive of molecular maintenance of dRE and sRE functionality. Moreover, dREs show higher sequence divergence than sREs (60.9% and 52.3% v.s. 65%, binomial test p < 10− 100), indicating the propensity of functional change of dREs during CLL development.
In addition, forty-four percent of hiMR sequence nucleotides are DNA sequence repeats, which is consistent with DNA repeats and repeat-derived regions spanning about half of the human genome  (Fig. 2c and Additional file 1: Figure S9). The low repeat density of dREs and sREs (17.65%, 22.14%, and 36.15% in sRE, lost dRE, and gained dRE sequences, respectively) is in agreement with a previous observation of decreased repeat content in regulatory elements  and correlates with their elevated evolutionary conservation. As compared with sREs and lost dREs, the gained dREs show the higher content of all classes of retrotransposons (Additional file 1: Figure S9, S10 and Additional file 2: Table S1), which supports the implication of retrotransposons in cancer initiation .
Gain and loss of dREs positively correlate with the change of target gene expression
Next, we binned human genes according to the presence of dREs and sREs in their neighborhood (see Methods). About 69% of the genes associated with the gained dREs (2,626/3,784) also harbor one or more sREs in their loci, while only 50% of the lost dRE genes (1,464/2,905) host sREs (Additional file 1: Figure S11). The genes linked to either the gained dREs or sREs or both are enriched with the genes participating in apoptosis, cell death and immunological process. All these biological processes are activated in normal B-cells and are impaired in cancer cells . Also, genes exclusively linked to the lost dREs play a role in cell motility (Fig. 3c and Additional file 2: Table S3), and abnormal motility has been found in CLL cells . Furthermore, the observation that the genes associated with both the gained dREs and sREs have a function in T cell differentiation and activation (Fig. 3c) partially explains the finding that T cell numbers are increased in most patients with CLL .
Lost dREs cluster near the genes encoding DNA methyltransferases and transcription repressors
The analysis of the distribution of dREs revealed that the distances between two neighboring gained dREs (i.e., within-class distances between gained dREs) are significantly shorter than expected (Wilcoxon rank-sum test p < 10− 16, Additional file 1: Figure S3, see Methods). Similarly, the within-class distances of the lost dREs are much smaller than expected (p < 10− 16). The cross-class distance (i.e., the distance of a gained dRE to its nearest lost dRE) is longer than expected (p < 10− 16, Additional file 1: Figure S3). These findings show that dREs having the same activity likely cluster together, suggesting that the change of DNA activation occurs selectively, rather than randomly, along the human genome during CLL development. That is, certain genomic regions are subject to become activated (e.g., methylation decrease), while others tend to be de-activated (e.g., methylation increase).
Functional analysis of multi-lost dREs with respect to all lost dREs
Number of elements
oxidoreductase activity, acting on NAD(P)H, oxygen as acceptor
superoxide-generating NADPH oxidase activity
peptidyl-histidine dioxygenase activity
oxygen sensor activity
peptidyl-asparagine 3-dioxygenase activity
protein methyltransferase activity
protein-lysine N-methyltransferase activity
CLL and CLL-related GWAS SNPs fall in dREs
We also examined the association of dREs with individual GWAS traits (see Methods). The results consistently support the aforementioned functions of dREs (Fig. 3c), since the gained and lost dREs are strongly enriched for the SNPs linked to CLL and/or other cancers (p < 5 × 10− 3, Fig. 4b and Additional file 2: Table S5). Also, the sREs are remarkably linked to immunity-related traits, such as asthma and adaptive immunity (p < 1 × 10− 8), which is in line with the observations that sREs are significantly associated with T cell activation and differentiation (Fig. 3c).
Examples of GWAS SNPs in dREs
Another example lies at rs3806624, a lost-dRE SNP. rs3806624 has been associated with Hodgkin’s lymphoma and has G as a risk allele in a GWAS study . Our analysis shows that the allele G of rs3806624 is significantly enriched in CLL (allele frequency is 0.57 and 0.21 in CLL and control, respectively; binomial test p = 0.0029, Fig. 5b), indicating the possible deleteriousness of this allele in CLL. The allele substitution of G to A potentially disrupts the binding motif of P53 and CCAAT-enhancer-binding protein (CEBP) (Fig. 5b), TFs known to play roles in apoptosis and hematopoietic cell differentiation. The coincidence between the CLL-enriched allele and the reported risk allele, together with the binding disruption caused by the CLL mutation, supports the possible pathogenicity of rs3806624.
We have a total of six cancer-associated GWAS SNPs exhibiting a significant difference of allele frequency between CLL and control (Additional file 2: Table S6). Among these SNPs are the above example SNPs, rs1976684, rs2151512, rs8077394 and rs133018 (see Additional file 1: Figure S14 and S15). Most of these SNPs (5/6) exhibit a prominent haplotype association between the CLL-enriched alleles and the risk alleles detected in GWASs (Additional file 2: Table S6). These cancer SNPs, coupled with the statistical results presented in the previous section, suggest a significant association of dREs and genetic mutations inside these regions with CLL or CLL-related traits. We also observed that genetic mutations are able to change the binding sites of CLL/normal B-cell TFs, which may be the driver of phenotypic alterations.
Changes of TFBS in CLL development
Next, we examined how genetic mutations, i.e., the allele substitutions at SNPs identified in this study, impact the binding affinity of TFs. We did not have genetic variation data directly available for the tested CLL samples. Therefore, we explored RRBS-seq data to identify SNPs strongly associated with CLL, in which the genotype in CLL samples is significantly different from the controls, along with the CLL-associated substitutions at these SNPs (see Methods). In total, 491 such SNPs were identified in dREs, of which 305 were located in the gained dREs and 186 were in lost dREs. We assessed TF binding alterations potentially caused by the CLL-associated substitutions (see Methods). By a comparison to the random positions having matched base-pair composition along the lost dREs, we noticed that the detected CLL-associated substitutions are associated with the loss of binding site of P53, NFKB2, E2F1, and PAX5 more frequently than expected (enrichment of TFBS loss > 1.5 and p < 0.05, Fig. 6b and Additional file 2: Table S8). E2F1 and PAX5 are the major regulators in normal B-cells, of which TFBSs have been found to be enriched uniquely along the lost dREs (Fig. 6a and Additional file 2: Table S7). Also, in the context of the gained dREs, 15 TFBSs are enriched only in the sequences carrying CLL-associated alleles (p < 0.01, see Methods, Fig. 6c and Additional file 2: Table S9). Most of these TFBS correspond to well-known CLL TFs, such as TCF3 and HIF1. HIF1 is required for the survival of leukemia stem cells under hypoxic environments, such as bone marrow niches [10, 49]. In addition, the CLL-association substitutions are more likely than expected to alter the binding affinity of NFKB and PAX5 in both the gained and lost dREs, compatible with the functions of these TFs in CLL as well as normal B-cells.
Discussion and conclusion
In this study, we established a workflow to identify differentially-activated REs (dREs) in carcinogenesis and applied it to CLL data. Most of the CLL dREs are located in non-promoter gene regulatory loci, indicating a substantial role enhancer alterations play in CLL carcinogenesis. We found that dRE changes are strongly correlated with the change of gene expression, i.e., gained/lost dREs are enriched in the loci of up-/down-regulated genes in CLL, respectively.
We found that lost and gained dREs rarely co-occur in the same gene loci, suggesting reprogramming of the regulatory architecture is locus-long and not necessarily targeting individual regulatory elements in carcinogenesis. As expected, gained dREs are significantly associated with CLL-induced biological processes. For example, 68% of the genes having the function of DNA damage response exclusively harbor gained dREs, which is 2.3 times higher than expected. Also 74% of genes regulating B cell activation host gained dREs. DNA methyltransferase genes, for example, DNMT3A and MGMT, which are essential for maintaining cell cycle and methylation levels of normal B-cells, harbor multiple lost dREs but zero gained dREs in their neighborhood. In addition, both gained and lost dREs significantly coincide with CLL, lymphoma, and, more broadly, cancer-associated GWAS SNPs. Furthermore, most of the cancer-associated alleles at these SNPs (83%) are in predominant haplotypes with the risk alleles reported in GWAS. All of these findings indicate the phenotypic consequence of RE changes during CLL development.
By examining TFBS enrichment in dRE sequences, we observed that normal and CLL B-cells recruit distinct gene regulatory pathways, although both of them employ common TFs, such as NFKB and P53. Apart from these common TFs, the key TFs in normal B-cells include PAX5, E2F1 and AP2, while CLL employs TCF3, PPAR, etc. Moreover, through analyzing the impact of the identified CLL-associated mutations on TF binding, we found that these mutations change the binding activity of key TFs, i.e., disrupting/generating TF binding sites in the lost/gained dREs, more frequently than expected.
Overall, through exploring sequencing data of chromatin states, we established the maps of REs in normal and cancer cells and identified genetic mutations during CLL development. The comparison between these RE maps enabled us to identify gene regulatory variations during cancer initiation in different layers, such as TF binding and chromatin interaction. To test the generalization of our pipeline, we applied it to a liver tumor dataset consisting of 4 tumor and 4 control samples , and noticed that the distribution of dREs is highly correlated with the change of expression of local genes (Additional file 1: Figure S16), which is similar to the finding on the CLL data analysis. This indicates that our observations are likely not limited to CLL and could be generalized to other cancers.
Chronic lymphocytic leukemia
RE differently activated during cancer development
Highly-methylated regions across different cells
Regulatory element shared by normal and cancer cells
This research was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.
Availability of data and materials
The data sets supporting the results of this article are included within the article and its additional files.
DH and IO conceived of and designed the data analysis processes. DH and IO drafted the manuscript. All authors read and approved the final manuscripts.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Stewart BW, Wild CP. World cancer report 2014, Lyon, France: International Agency for Research on Cancer, Geneva, Switzerland: Distributed by WHO Press, World Health Organization, 2014. 2014.
- Khurana E, Fu Y, Chakravarty D, Demichelis F, Rubin MA, Gerstein M. Role of non-coding sequence variants in cancer. Nat Rev Genet. 2016;17(2):93–108.View ArticlePubMedGoogle Scholar
- Moore LE, Nickerson ML, Brennan P, Toro JR, Jaeger E, Rinsky J, Han SS, Zaridze D, Matveev V, Janout V, et al. Von Hippel-Lindau (<italic > VHL</italic>) Inactivation in Sporadic Clear Cell Renal Cancer: Associations with Germline < italic > VHL</italic > Polymorphisms and Etiologic Risk Factors. PLoS Genet. 2011;7(10):e1002312.View ArticlePubMedPubMed CentralGoogle Scholar
- Merlo A, Herman JG, Mao L, Lee DJ, Gabrielson E, Burger PC, Baylin SB, Sidransky D. 5[prime] CpG island methylation is associated with transcriptional silencing of the tumour suppressor p16/CDKN2/MTS1 in human cancers. Nat Med. 1995;1(7):686–92.View ArticlePubMedGoogle Scholar
- Lovén J, Hoke HA, Lin CY, Lau A, Orlando DA, Vakoc CR, Bradner JE, Lee TI, Young RA. Selective Inhibition of Tumor Oncogenes by Disruption of Super-Enhancers. Cell. 2013;153(2):320–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Belinsky SA. Gene-promoter hypermethylation as a biomarker in lung cancer. Nat Rev Cancer. 2004;4(9):707–17.View ArticlePubMedGoogle Scholar
- Sharma S, Kelly TK, Jones PA. Epigenetics in cancer. Carcinogenesis. 2010;31(1):27–36.View ArticlePubMedGoogle Scholar
- Esteller M. Cancer epigenomics: DNA methylomes and histone-modification maps. Nat Rev Genet. 2007;8(4):286–98.View ArticlePubMedGoogle Scholar
- Feinberg AP, Vogelstein B. Hypomethylation distinguishes genes of some human cancers from their normal counterparts. Nature. 1983;301(5895):89–92.View ArticlePubMedGoogle Scholar
- Yonekura S, Itoh M, Okuhashi Y, Takahashi Y, Ono A, Nobuo N, Tohda S. Effects of the HIF1 Inhibitor, Echinomycin, on Growth and NOTCH Signalling in Leukaemia Cells. Anticancer Res. 2013;33(8):3099–103.PubMedGoogle Scholar
- Roman-Gomez J, Jimenez-Velasco A, Agirre X, Cervantes F, Sanchez J, Garate L, Barrios M, Castillejo JA, Navarro G, Colomer D, et al. Promoter hypomethylation of the LINE-1 retrotransposable elements activates sense//antisense transcription and marks the progression of chronic myeloid leukemia. Oncogene. 2005;24(48):7213–23.View ArticlePubMedGoogle Scholar
- Miki Y, Nishisho I, Horii A, Miyoshi Y, Utsunomiya J, Kinzler KW, Vogelstein B, Nakamura Y. Disruption of the APC Gene by a Retrotransposal Insertion of L1 Sequence in a Colon Cancer. Cancer Res. 1992;52(3):643–5.PubMedGoogle Scholar
- Howard G, Eiges R, Gaudet F, Jaenisch R, Eden A. Activation and transposition of endogenous retroviral elements in hypomethylation induced tumors in mice. Oncogene. 2007;27(3):404–8.View ArticlePubMedGoogle Scholar
- Kieffer-Kwon K-R, Tang Z, Mathe E, Qian J, Sung M-H, Li G, Resch W, Baek S, Pruett N, Grøntved L, et al. Interactome Maps of Mouse Gene Regulatory Domains Reveal Basic Principles of Transcriptional Regulation. Cell. 2013;155(7):1507–20.View ArticlePubMedGoogle Scholar
- Pei L, Choi J-H, Liu J, Lee E-J, McCarthy B, Wilson JM, Speir E, Awan F, Tae H, Arthur G, et al. Genome-wide DNA methylation analysis reveals novel epigenetic changes in chronic lymphocytic leukemia. Epigenetics. 2012;7(6):567–78.View ArticlePubMedPubMed CentralGoogle Scholar
- Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27(11):1571–2.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu Y, Siegmund K, Laird P, Berman B. Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biol. 2012;13(7):R61.View ArticlePubMedPubMed CentralGoogle Scholar
- The Genomes Project C. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.View ArticleGoogle Scholar
- Burger L, Gaidatzis D, Schübeler D, Stadler MB. Identification of active regulatory regions from DNA methylation data. Nucleic Acids Res. 2013;41(16):e155.View ArticlePubMedPubMed CentralGoogle Scholar
- McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotech. 2010;28(5):495–501.View ArticleGoogle Scholar
- Quackenbush J. Computation analysis of microarray data. Nat Rev Genet. 2001;2:418–27.View ArticlePubMedGoogle Scholar
- Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42(D1):D1001–6.View ArticlePubMedGoogle Scholar
- Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O'Donnell CJ, de Bakker PIW. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics. 2008;24(24):2938–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. TRANSFAC(®) and its module TRANSCompel(®): transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34(Database issue):D108–10.View ArticlePubMedGoogle Scholar
- Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16(1):16–23.View ArticlePubMedGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME Suite: tools for motif discovery and searching. Nucleic Acids Res. 2009;37 suppl 2:W202–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Puente XS, Bea S, Valdes-Mas R, Villamor N, Gutierrez-Abril J, Martin-Subero JI, Munar M, Rubio-Perez C, Jares P, Aymerich M, et al. Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature. 2015;526(7574):519–24.View ArticlePubMedGoogle Scholar
- Hon GC, Hawkins RD, Caballero OL, Lo C, Lister R, Pelizzola M, Valsesia A, Ye Z, Kuan S, Edsall LE, et al. Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer. Genome Res. 2012;22(2):246–58.View ArticlePubMedPubMed CentralGoogle Scholar
- Herman JG, Baylin SB. Gene Silencing in Cancer in Association with Promoter Hypermethylation. N Engl J Med. 2003;349(21):2042–54.View ArticlePubMedGoogle Scholar
- Pandiyan K, You JS, Yang X, Dai C, Zhou XJ, Baylin SB, Jones PA, Liang G. Functional DNA demethylation is accompanied by chromatin accessibility. Nucleic Acids Res. 2013;41(7):3973–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Issa J-P. CpG island methylator phenotype in cancer. Nat Rev Cancer. 2004;4(12):988–93.View ArticlePubMedGoogle Scholar
- International Human Genome Sequencing Consortium. (2001). "Initial sequencing and analysis of the human genome". Nature. 409(6822):860-921.
- The mod EC, Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, et al. Identification of functional elements and regulatory circuits by drosophila modENCODE. Science (New York, NY). 2010;330(6012):1787–97.View ArticleGoogle Scholar
- Wilkins AS. The enemy within: An epigenetic role of retrotransposons in cancer initiation. BioEssays. 2010;32(10):856–65.View ArticlePubMedGoogle Scholar
- Billard C. Apoptosis inducers in chronic lymphocytic leukemia. Oncotarget. 2014;5(2):309–25.View ArticlePubMedGoogle Scholar
- Till KJ, Harris RJ, Linford A, Spiller DG, Zuzel M, Cawley JC. Cell Motility in Chronic Lymphocytic Leukemia: Defective Rap1 and αLβ2 Activation by Chemokine. Cancer Res. 2008;68(20):8429–36.View ArticlePubMedGoogle Scholar
- D’Arena G, Laurenti L, Minervini MM, Deaglio S, Bonello L, De Martino L, De Padua L, Savino L, Tarnani M, De Feo V, et al. Regulatory T-cell number is increased in chronic lymphocytic leukemia patients and correlates with progressive disease. Leuk Res. 2011;35(3):363–8.View ArticlePubMedGoogle Scholar
- Ley TJ, Ding L, Walter MJ, McLellan MD, Lamprecht T, Larson DE, Kandoth C, Payton JE, Baty J, Welch J, et al. DNMT3A Mutations in Acute Myeloid Leukemia. N Engl J Med. 2010;363(25):2424–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Gerondakis S, Siebenlist U. Roles of the NF-κB Pathway in Lymphocyte Development and Function. Cold Spring Harb Perspect Biol. 2010;2(5):a000182.View ArticlePubMedPubMed CentralGoogle Scholar
- Lauc G, Huffman JE, Pučić M, Zgaga L, Adamczyk B, Mužinić A, Novokmet M, Polašek O, Gornik O, Krištić J, et al. Loci Associated with < italic > N</italic > -Glycosylation of Human Immunoglobulin G Show Pleiotropy with Autoimmune Diseases and Haematological Cancers. PLoS Genet. 2013;9(1):e1003225.View ArticlePubMedPubMed CentralGoogle Scholar
- Bauvois B, Djavaheri-Mergny M, Rouillard D, Dumont J, Wietzerbin J. Regulation of CD26/DPPIV gene expression by interferons and retinoic acid in tumor B cells. Oncogene. 2000;19(2):8.View ArticleGoogle Scholar
- Wickremasinghe RG, Prentice AG, Steele AJ. p53 and Notch signaling in chronic lymphocytic leukemia: clues to identifying novel therapeutic strategies. Leukemia. 2011;25(9):1400–7.View ArticlePubMedGoogle Scholar
- Kardava L, Yang Q, St. Leger A, Foon KA, Lentzsch S, Vallejo AN, Milcarek C, Borghesi L. The B lineage transcription factor E2A regulates apoptosis in chronic lymphocytic leukemia (CLL) cells. Int Immunol. 2011;23(6):375–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Frampton M, da Silva Filho MI, Broderick P, Thomsen H, Försti A, Vijayakrishnan J, Cooke R, Enciso-Mora V, Hoffmann P, Nöthen MM, et al. Variation at 3p24.1 and 6q23.3 influences the risk of Hodgkin’s lymphoma. Nat Commun. 2013;4:2549.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu JW, Field SJ, Gore L, Thompson M, Yang H, Fujiwara Y, Cardiff RD, Greenberg M, Orkin SH, DeGregori J. E2F1 and E2F2 determine thresholds for antigen-induced T-cell proliferation and suppress tumorigenesis. Mol Cell Biol. 2001;21(24):8547–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Medvedovic J, Ebert A, Tagoh H, Busslinger M. Chapter 5 - Pax5: a master regulator of B cell development and leukemogenesis. In: Frederick WA, editor. Advances in Immunology. Volume 111. Amsterdam: Academic Press; 2011. p. 179-206.
- Zhang H, Li H, Xi HS, Li S. HIF1α is required for survival maintenance of chronic myeloid leukemia stem cells. Blood. 2012;119(11):2595–607.View ArticlePubMedPubMed CentralGoogle Scholar
- Li X, Liu Y, Salz T, Hansen KD, Feinberg A. Whole-genome analysis of the methylome and hydroxymethylome in normal and malignant lung and liver. Genome Res. 2016;26(12):1730–41.View ArticlePubMedGoogle Scholar