Nuclear Receptor HNF4α Binding Sequences are Widespread in Alu Repeats
© Bolotin et al; licensee BioMed Central Ltd. 2011
Received: 18 March 2011
Accepted: 15 November 2011
Published: 15 November 2011
Skip to main content
© Bolotin et al; licensee BioMed Central Ltd. 2011
Received: 18 March 2011
Accepted: 15 November 2011
Published: 15 November 2011
Alu repeats, which account for ~10% of the human genome, were originally considered to be junk DNA. Recent studies, however, suggest that they may contain transcription factor binding sites and hence possibly play a role in regulating gene expression.
Here, we show that binding sites for a highly conserved member of the nuclear receptor superfamily of ligand-dependent transcription factors, hepatocyte nuclear factor 4alpha (HNF4α, NR2A1), are highly prevalent in Alu repeats. We employ high throughput protein binding microarrays (PBMs) to show that HNF4α binds > 66 unique sequences in Alu repeats that are present in ~1.2 million locations in the human genome. We use chromatin immunoprecipitation (ChIP) to demonstrate that HNF4α binds Alu elements in the promoters of target genes (ABCC3, APOA4, APOM, ATPIF1, CANX, FEMT1A, GSTM4, IL32, IP6K2, PRLR, PRODH2, SOCS2, TTR) and luciferase assays to show that at least some of those Alu elements can modulate HNF4α-mediated transactivation in vivo (APOM, PRODH2, TTR, APOA4). HNF4α-Alu elements are enriched in promoters of genes involved in RNA processing and a sizeable fraction are in regions of accessible chromatin. Comparative genomics analysis suggests that there may have been a gain in HNF4α binding sites in Alu elements during evolution and that non Alu repeats, such as Tiggers, also contain HNF4α sites.
Our findings suggest that HNF4α, in addition to regulating gene expression via high affinity binding sites, may also modulate transcription via low affinity sites in Alu repeats.
As much as 50% of the ~3 billion base pairs in the human genome may be derived from repetitive DNA sequence . While repetitive DNA is often referred to as "junk" DNA, even when that term was originally coined it was hypothesized that junk DNA may play an active role in genome function . The notion that repetitive DNA may play a regulatory role and be involved in the evolution of gene regulation was also postulated early on, although it was not until recently that there was evidence to support those ideas [3–5].
A major category of repetitive DNA is short interspersed nuclear elements (SINEs), which are believed to have originated from the 7SL RNA gene that is part of the ribosome complex . In the human genome, the largest class of SINEs are Alu repeats, which at ~1.2 million copies account for ~10% of the human genome . Alu elements were first characterized as ~300 nucleotide repetitive sequences that contain an AluI restriction site (5'-AGCT-3') from the bacterium Arthrobacter luteus [7, 8]. Alu elements, which are still mobile in the human genome by virtue of the action of a LINE-1 reverse transcriptase , are a relatively recent occurrence evolutionarily. They are found exclusively in primates, including humans, and hence are postulated to have entered the mammalian genome ~60-65 million years ago .
Alu elements have been implicated in several human diseases including leukemia, hemophilia and breast cancer, suggesting that their impact on human health may be significant . There are several well characterized examples of Alu insertions affecting splicing patterns and hence protein function . A variety of transcription factor (TF) binding sites (TFBSs) have also been characterized in Alu elements, including sites for YY1 , Sp1 , tumor suppressor p53 , homeodomain and TATA binding proteins . Nuclear receptors (NR), which belong to a superfamily of ligand-dependent TFs, have also been found to have binding sites in Alu elements: retinoid acid receptor (RAR, NR1B) , estrogen receptor (ER, NR3A) [18, 19], progesterone receptor (PR, NR3C3)  and vitamin D receptor (VDR, NR1I1) . Alu insertions have also been shown to alter the expression of at least six human genes: CD8a (CD8A), keratin 18 (KRT18), parathyroid hormone (PTH), Wilm's tumor 1 (WT1), receptor for Fc fragment of IgE, high affinity I, gamma polypeptide (FCER1G) and breast cancer 1, early onset (BRCA1) . Therefore, Alu sequences may regulate the level of transcripts and hence proteins in the cell, as well as the function of those proteins.
Hepatocyte nuclear factor 4 alpha, (HNF4α, NR2A1) is a member of the NR superfamily that is highly expressed in the liver, as well as the kidney, intestine (large and small), pancreas and stomach . HNF4α is best known for its role in the adult liver and pancreas, as well as in early development [24, 25]; it also has an emerging role in the gut [26–28]. The HNF4Α gene is mutated in an inherited form of type 2 diabetes, maturity onset diabetes of the young 1 (MODY1) , and was recently identified as a susceptibility locus in inflammatory bowel disease (IBD) . Mutations in HNF4α binding sites have also been directly linked to human diseases, including hemophilia and MODY3 [31, 32]. Many NRs are common drug targets ; the recent identification of the endogenous ligand of HNF4α that binds in a reversible fashion also makes HNF4α a potential drug target [34, 35].
In addition to its medical relevance, HNF4α also appears to play a unique role in the evolution of NRs. It is highly conserved across species, with 100% amino acid conservation in the DNA binding domain of all mammalian HNF4α. While HNF4α is most similar to the retinoid × receptor alpha (RXRα, NR2B1), unlike many other NRs, it does not heterodimerize with RXR. Rather, it binds DNA in the form of direct repeats separated by one nucleotide (DR1, AGGTCAxAGGTCA) exclusively as a homodimer . HNF4α has been found in every animal organism examined thus far, including sponge and coral , and has been postulated to be the ancestor of the entire NR family .
Many hundreds of HNF4α target genes have been identified by both classical promoter analysis as well as more modern genome-wide studies [32, 39–41]. During one such genomic study, we observed a very uneven frequency profile of individual HNF4α binding sequences . Specifically, we noted that a certain DNA sequence designated H4.141 (5'-AGGCTGaAGTGCA-3') was > 100-fold overrepresented compared to other HNF4α binding sites in the human, but not the mouse, genome (see additional file 1: Figure S1). In the current study, we investigate the notion that these and other HNF4α binding sequences are in Alu repeats. We use the powerful high throughput technology of protein binding micorarrays (PBMs) to show that HNF4α does indeed bind numerous sequences in Alu repeats in vitro. We perform ChIP and luciferase assays to show that HNF4α binds at least some Alu sequences in vivo and that those binding events are associated with transcriptional activation. Finally, we investigate accessibility of these sites by correlation with DNase hypsersensitivity data and evolutionary conservation by comparative genomic analysis.
Frequency of Alu-derived sequences bound by HNF4α in Alu repeats and the human genome.
# in Alus
# in hg18
Number of Alu repeats with HNF4α binding sites in the human genome (hg 18).
# of H4 sites in Alu
# of Alus
Total # of H4 sites
Alu families in human genome (hg18) with HNF4α binding sites.
# with H4
# in hg18
Alu subfamilies in human genome (hg18) with HNF4α binding sites.
# with H4
# in hg18
Non Alu repeat families in human genome (hg18) with HNF4α binding sites.
Sorted by number of HNF4α sites
Sorted by prevalence of HNF4α sites
# with H4
# in hg18
# with H4
# in hg18
Others have shown that the region 5000 bp upstream from the TSS (+1) contains on average 3.63 Alu elements . We analyzed the same promoter region and found that every human gene has on average 2.91 HNF4α-Alu elements, consistent with the overall high proportion of Alu elements with an HNF4α site (Tables 3 and 4). To determine which Alu elements may be accessible, and hence potentially play a role in transcription regulation, we determined the number of HNF4α-Alu elements that reside within DNase hypersensitive regions using datasets from the ENCODE project [46, 47]. Genome-wide 46,129 HNF4α-Alu elements (~6.2% of all HNF4α-Alu's) are within DNase hypersensitive regions across mutliple cell lines, with 5458 genes containing one or more HNF4α-Alu/DNase sites in their 5 kb promoter region. ~7000 HNF4α-Alu elements are in DNase hypersensitive regions in HepG2 cells alone (6212 from Rep Track 1 and 8127 from Rep Track 2). While these findings may be an underestimate due to the difficulty of sequencing through repetitive elements, they nonetheless indicate that while the majority of the ~750,000 HNF4α-Alu elements may not be accessible in most cell types, a sizeable portion of HNF4α-Alu elements are in regions of open chromatin and hence may be transcriptionally active.
The functional relevance of repetitive DNA such as Alu repeats in the human genome has been debated ever since they were first discovered several decades ago. In this study, we show that the nuclear receptor HNF4α binds Alu-derived 13-mers in vitro as well as Alu elements in the promoters of HNF4α target genes in vivo. We show that HNF4α sites in Alu elements can drive gene expression in luciferase assays and that HNF4α binding sites are found in ~64% of all known Alu repeats in the genome (~1.2 million HNF4α sites in ~750,000 Alu elements). Additionally, we found that while HNF4α sites are predominantly found in Alu repeats, they are also found in other repeats such as SVA elements, which contain a portion of Alu repeat , and L2, MIR and Tigger families of retrotransposons.
Perhaps the most important question is how many of the HNF4α-Alu elements are functional. Several recent studies suggest that Alu elements may indeed play a role in regulating gene expression: Alu elements are enriched in regions with genes , particularly in housekeeping and metabolism genes. However, they are underrepresented in developmental genes , suggesting that their presence in those genes may be detrimental. Binding sites for other NRs have also been found in Alu repeats and several of those sites were found to affect transcription [17, 19–21]. To determine what types of genes contain HNF4α-Alu elements, we performed a Gene Ontology (GO) analysis of genes enriched with HNF4α-Alu elements (> 8 per 5 kb promoter region) and found RNA processing and transcription regulation genes, as well as macromolecular catabolic processes and complex assembly genes (see additional file 2 : Table S6 for a full list of significant GO categories and relevant genes). RNA processing is not a category previously associated with classical HNF4α binding sites, but Alu elements have been found to play a direct role in alternative splicing .
In a detailed, genome-wide analysis of functional targets of HNF4α and binding sites, we recently found that only 30% of genes down regulated in an HNF4α RNAi experiment contained a potential classical HNF4α binding site . While the other 70% could be indirect targets, it is also possible that some of those genes are regulated by HNF4α-containing Alu elements, consistent with our finding here that on average every gene in the human genome contains ~2.91 HNF4α-Alu elements within 5000 bp upstream of the TSS. On an individual gene basis, we found that even though the HNF4α binding sites in Alu repeats are not high affinity sites compared to the majority of classical HNF4α sites, they are nonetheless capable of driving the expression of a heterologous gene on their own. In the context of the genome, however, the HNF4α-Alu elements are typically present in conjunction with other TFBS in the promoter, including other HNF4α binding sites, suggesting that they may act in more of a modulatory capacity than as the sole drivers of transcription, as we observed on the APOA4 promoter. These results are similar to those found for other NRs albeit on different binding sites within the Alu elements [19–21].
The functionality of HNF4α-Alu elements, as with any potential TFBS, will also depend on the state of the local chromatin and the accessibility of the site to HNF4α. While it has been reported that most Alu repeats in the human genome contain CpG dinucleotides that are methylated , potentially rendering them nonfunctional, the Alu elements that are hypomethylated tend to be in promoter regions, suggesting that they are accessible [52, 53]. Indeed, our analysis showed that there may be as many as ~46,000 HNF4α-Alu elements in DNase hypersensitive regions genome-wide, suggesting that they may be accessible for binding and therefore may affect transcription.
In addition to affecting transcription directly, it is tempting to speculate that the relatively large number of HNF4α-Alu elements, especially in regions of open chromatin, could act as a sink or reservoir for HNF4α protein. We have estimated by semi-quantitative immunoblotting that there may be as many as 450,000 molecules of HNF4α in the nucleus of an adult mouse hepatocyte (unpublished observation); this estimate is consistent with the fact that we originally had to purify HNF4α only ~5,000 to 10,000-fold from adult rat liver nuclei . Assuming that human hepatocytes have similar levels of HNF4α protein and keeping in mind that HNF4α binds DNA only as a dimer , this suggests that the presence of ~7000 to 46,000 HNF4α-Alu elements in accessible regions of the genome would not have a significant impact on the availability of ~225,000 HNF4α protein dimers in a normal adult hepatocyte nucleus. However, conditions that significantly alter the accessibility of the ~750,000 HNF4α-Alu elements genome-wide, or the amount of HNF4α protein, could in theory result in a situation in which the stoichiometry of HNF4α-Alu sites to HNF4α protein is indeed relevant. For example, global loss of DNA methylation has been associated with cancer progression and there is at least one report in which certain Alu elements lose methylation during tumor progression . Likewise, a decrease in the amount of functional HNF4α protein, such as that found in heterozygous MODY1 patients , activation of signaling pathways [56–61], DNA damage via p53 [62, 63], microRNAs , diet [35, 65, 66] and diseases such as colitis and cancer [67, 68] could tip the balance between HNF4α protein and potential binding sites, rendering the notion of Alu elements as a sink of HNF4α potentially relevant. The stoichiometry of HNF4α protein to total HNF4α binding sites may also differ in other tissues and developmental time points , which could alter the relevance of HNF4α-Alu elements.
The ~1.2 million HNF4α binding sites in ~750,000 Alu elements in the human genome has the potential to affect the expression of HNF4α target genes. Therefore, it will be important to keep the HNF4α-Alu elements in mind when investigating HNF4α function, especially when using non primates as models for humans and when investigating conditions, such as cancer, where there may be genome-scale alterations in chromatin accessibility. These results join the increasing number of reports of NR and other TF binding sites in Alu or other repeat elements  and support the notion that repetitive DNA may be more than just "junk" DNA.
A custom-designed 8x15k Alu PBM (PBM3) containing 8 grids, each of which consisted of ~15,000 spots of DNA, was ordered from Agilent (Figure 1). An in silico Alu library of ~200 DNA sequences was made by extracting every unique 13-mer from every Alu element consensus from the RepBase database (http://www.girinst.org/repbase/). The human genome (hg18) was searched with the Alu library and the 100 most frequent sites were included on PBM3. The 13-mer Alu library was further searched with the support vector machine (SVM) model described in Bolotin et al . (The SVM is an algorithm trained on sequences bound by HNF4α in the PBM; it predicts the binding HNF4α binding with correlation R2 = 0.76.) The top 100 scoring potential HNF4α binding sites from the SVM search were included on PBM3 for a total of 200-derived Alu sequences. Another 704 sequences were included from permutations of three adjacent positions in every combination of the DR1 consensus (5'-AGGTCAaAGGTCA-3') and 768 sequences from similar permutations of a DR2 consensus (5'-AGGTCAaaAGGTCA-3'). Additionally, 100 randomly generated 13-mers and 50 randomly generated 14-mers were included as negative controls for the DR1s and DR2s, respectively. Finally, an additional 2,061 unique sequences were generated from an SVM search of all human genes for a total of 3802 unique DNA sequences, each of which was replicated 4 times on the PBM for a total of 15,208 DNA spots. The linker and cap sequences were the same as those described in Bolotin et al. . (See additional file 2 : Table S5 for a list of all DNA sequences on PBM3 and the corresponding HNF4α binding score.)
Crude nuclear extracts of COS-7 cells transfected with human HNF4α2 or HNF4α8 expression vectors was applied to PBM3 (~400 ng HNF4α protein per grid) and visualized and analyzed as described in Bolotin et al. . The primary antibody was a mouse monoclonal that recognizes the C-terminal region of HNF4α (H1415 from R&D Systems); the secondary was NL-637 anti-Mouse IgG (NL008 from R&D Systems). PBMs were scanned using a GenePix Axon 4000B scanner (Molecular Devices, Sunnyvale, CA) at 543 nm (Cy3) dUTP and 633 nm (Cy5-conjugated secondary antibody). Since there was no significant difference between the HNF4α2 and HNF4α8 isoforms, which differ by ~30 amino acids in the N-terminal region but have identical DNA binding and dimerization/ligand binding domains, the average of the four grids (two with HNF4α2 and two with HNF4α8) were used for the final PBM3 score. The sequences with a score > 0.612 (i.e., 2 SD above the mean of the random controls, p-value < .045) were considered to be HNF4α binders.
HNF4α ChIP from HepG2 cells was performed as described in . Quantitative-PCR (qPCR) following the ChIP was performed using BioRad IQ SYBR Green Supermix. Each 23.5-ul reaction included 12.5 ul of Supermix, 0.25 ul of 100 nmol of each primer, 0.5 ul of template and 10 ul of ddH2O. The qPCR was performed as follows: 95°C for 5 min (hot start), followed by 40 cycles 95°C for 30 sec (melt), 30 sec at the melting temperature (Tm) for annealing and extension, followed by a melt curve. The Tm was determined experimentally for each pair of primers by using a temperature gradient qPCR that was visualized on an ethidium bromide-stained agarose gel to control for product size. All qPCR was performed using BioRad iQ5 and myQ5 thermocyclers. (See additional file 2 : Table S2 for a complete list of PCR primers giving a positive ChIP signal.) Affymetrix expression profiling data for the HNF4α RNAi knockdown in HepG2 cells were obtained from Bolotin et al. .
Human embryonic kidney (HEK 293T) cells were plated (0.25 × 106 cells) in 12-well plates. After 24 hr the cells were transfected using Lipofectamine 2000 according to the manufacturer's protocol (Invitrogen), with different amounts of empty vector (pcDNA3) or wild type human HNF4α2 in pcDNA3, 1 μg of the luciferase reporter and 200 ng of a CMV.βgal control. Cells were harvested after 24 hr using Triton lysis buffer (1% Triton X-100, 25 mM Gly-Gly pH 7.8, 15 mM MgSO4, 4 mM EGTA, 1 mM DTT). Luciferase and β-gal activity were measured as described earlier . Significant differences in luciferase activity between cells transfected with empty vector or human HNF4α2 were determined by the Student's t-test. APOM, PRODH2 and TTR luciferase constructs were created by cloning PCR products of the Alu elements in the respective promoters into pGL4.23 (Promega): the APOM construct used SfiI restriction sites and the PRODH2 and TTR constructs used NheI and KpnI sites. The APOA4.Luc construct was made by cloning a PCR product from the human APOA4 promoter (-1343 to +247) into the pGL4.10 vector (Promega) at HindIII and NheI sites. Site-directed mutations were introduced into the HNF4α binding sites in the Alu and PBM elements using the QuikChange kit (Stratagene). Luciferase reporter constructs with classical HNF4α response elements (RE-1 and RE-2) were made by inserting the appropriate synthetic oligonucleotides into pGL4.23. All constructs were sequence verified. (See additional file 2 : Table S2 for the sequence of the PCR primers and oligonucleotides used in the constructions.)
Searches of human genome hg18 downloaded from UCSC Genome Browser (http://genome.ucsc.edu) were conducted using all of the sequences that HNF4α bound in PBM3 using Seqmap . Alu and non Alu repeats with HNF4α sites were identified by comparing the HNF4α genome-wide search results to the repeat coordinates obtained from Repeat Masker Track version 3.2.7 in UCSC Genome Browser. The results were processed using custom Perl scripts and an SQL database. To determine accessibility of HNF4α-Alu sites, we used BEDtools software package  to cross reference our list of ~750,000 HNF4α-Alu elements (Table 2) with DNase hypersensitivity tracks in the ENCODE Project in UCSC Genome Bioinformatics, allowing for one nucleotide or more of overlap. We used both the clustered track that contains data from multiple human cell lines (http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=211217271&g=wgEncodeRegDnaseClustered) as well as tracks for two different repetitions of HepG2 cells (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeUwDnaseSeq). Gene Ontology analysis of genes containing HNF4α-Alu elements was done using DAVID . We used as a cut off eight HNF4α-Alu elements within 5 kb upstream of +1, two SD above the average number of sites (2.91+4.22).
We thank D. Mane-Padros and L. Vuong for the luciferase constructs with classical HNF4α response elements and B. Fang for predicting mutations in HNF4α binding sites. This work was funded by a PhRMA Foundation fellowship to EB, and grants to FMS from the UCR Institute for Integrative Genome Biology and the NIH (R21 MH087397, R01 DK053892). KC, W H-V, CY and JMS were supported by NIH R01 DK053892. EB was supported by NIH R21 MH087397. The funding bodies did not have any role in the study design, data collection, manuscript preparation or submission.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.