Evolution of conserved secondary structures and their function in transcriptional regulation networks
© Xie et al. 2008
Received: 05 June 2008
Accepted: 02 November 2008
Published: 02 November 2008
Skip to main content
© Xie et al. 2008
Received: 05 June 2008
Accepted: 02 November 2008
Published: 02 November 2008
Many conserved secondary structures have been identified within conserved elements in the human genome, but only a small fraction of them are known to be functional RNAs. The evolutionary variations of these conserved secondary structures in human populations and their biological functions have not been fully studied.
We searched for polymorphisms within conserved secondary structures and identified a number of SNPs within these elements even though they are highly conserved among species. The density of SNPs in conserved secondary structures is about 65% of that of their flanking, non-conserved, sequences. Classification of sites as stems or as loops/bulges revealed that the density of SNPs in stems is about 62% of that found in loops/bulges. Analysis of derived allele frequency data indicates that sites in stems are under stronger evolutionary constraint than sites in loops/bulges. Intergenic conserved secondary structures tend to associate with transcription factor-encoding genes with genetic distance being the measure of regulator-gene associations. A substantial fraction of intergenic conserved secondary structures overlap characterized binding sites for multiple transcription factors.
Strong purifying selection implies that secondary structures are probably important carriers of biological functions for conserved sequences. The overlap between intergenic conserved secondary structures and transcription factor binding sites further suggests that intergenic conserved secondary structures have essential roles in directing gene expression in transcriptional regulation networks.
Conserved genomic elements are shared by a wide spectrum of organisms, and with the increased availability of sequenced genomes, it is now feasible to identify these elements by implementing comparative genomic analysis with highly divergent species. A series of studies have focused on the identification of conserved elements in the human genome, and have revealed that about 5% of the genome is composed of these conserved elements [1, 2]. The precise number of conserved elements in a genome identified in different studies varies though, due to the specific search criteria used and the degree of divergence between the genomes analyzed [1, 3, 4]. The primary criteria for the identification were largely based on the sequence identity. For example, Bejerano et al. defined 481 ultraconserved elements as sequences at least 200 base pairs showing 100% identity in human-mouse and human-rat genomic comparisons . An alternative strategy was used by Cooper et al. who calculated "rejected substitutions" (RS) value for sequences, where RS is computed by comparing the number of observed substitutions to that estimated if the sequences were evolving neutrally, thus sequences with high RS values show high identity, and with a threshold of 8.5 RS this method achieved about 95% confidence in the identification of conserved elements . In the human genome, conserved elements range in size from dozens to thousands of base pairs in length . While some elements overlap protein coding sequences, most are located in intergenic and intronic regions of the genome . Even in simpler organisms, conserved elements are an important component of their genomes . Searches in vertebrate, insect, worm and yeast genomes have found that as genome sizes increases, a larger fraction of the conserved elements are located outside of the exons of protein coding genes .
Despite the well documented existence of conserved elements, the significance of these sequences remains largely unknown . Evidence suggests that conserved elements represent a variety of different types of DNA sequences . Some families of ancient repetitive sequences have been found to be under strong purifying selection and are conserved among many species . Some conserved elements have been identified as genes encoding microRNAs, for example the microRNAs in insects . The highest number of microRNA genes estimated for metazoans and plants is about 2,500, with only about 1,000 of these genes being estimated in the humans , thus microRNA genes can only represent a tiny fraction of the conserved elements. Other attempts have been made to characterize the potential functions of conserved sequences, most of which document a statistically significant association between conserved elements and gene families for transcription factors and developmental regulators [3, 4, 11, 12]. Experimental essays have been done to characterize the transcriptional activities of only a handful of conserved elements, with a few being found capable of driving the expressions of proximal genes [11, 13, 14], thus strongly suggesting that these conserved elements may have enhancer activity. Conserved elements may confer their regulatory activity over great genomic distances. A recent analysis, based on duplicated conserved elements, indicated that the distance of regulatory activity of conserved elements can vary dramatically, with more than half of the elements regulating target genes that are more than 250 kb, and as much as 2 Mb, away . In addition, Frazer et al., in a study of conserved elements in the SIM2 interval, uncovered an additive effect of adjacent elements on promoting gene expression , suggesting that some of the conserved elements function together despite the great distances that separate them from their target genes. The distance between highly conserved elements is also conserved . Less variation in distance between conserved elements is observed compared to the distances between protein coding sequences in human-mouse and human-dog genome pairs. This observation implies that the interval space size or orientation may be important for the co-function of these elements . Abnormal action of conserved element can lead to genetic diseases . Many developmental diseases have been characterized as being due to the malfunction of conserved noncoding sequences, including preaxial polydactyly , blepharophimosis syndrome , and Van Buchem disease .
Many conserved secondary structures (CSSs) were identified in the human genome by using an eight-way genome-wide alignment. Some of the CSSs identified in this alignment have been identified as known functional RNAs, such as microRNAs, histone 3'-UTR stem-loops, and some genetic recoding elements . In insects, a conserved element with a secondary structure has been implicated in the control of alternative mRNA splicing , thus potentially some of the human elements may have similar roles. However, the functions of most of the identified CSSs remain unknown. In this study, we analyzed the evolutionary constraint acting upon CSSs using data from SNPs and demonstrated that about 1/3 of the mutations in CSSs were eliminated by selection in human populations and that sites in the stems of the predicted secondary structures are under stronger constraint than sites in loops/bulges. A substantial number of intergenic CSSs overlap the binding sites for transcription factors and are located proximal to transcription factor-encoding genes, thus we speculate that they may function in transcriptional regulation networks. We suggest that a substantial portion of intergenic CSSs function as cis-regulators and that the structural conservation is partially attributed to steric requirement for interacting with transcription factors.
We initiated this study by reanalyzing the CSSs data originally produced by Pedersen et al. . CSSs were predicted with EvoFold program , from sequences defined as conserved sequences by the PhastCons method  from a whole genomic alignment generated by the MULTIZ program  using the human, chimpanzee, mouse, rat, dog, chicken, zebra fish, and puffer fish genomes. Only long secondary structures (at least 15 pairing bases) were included in our analysis, with a focus on examining polymorphisms within them and the associations between intergenic CSSs and their neighboring genes. Of the total of 9404 long CSSs, 4473 are located in intergenic regions, 2690 are located inside intronic regions, 1428 overlap within protein coding sequences (CDSs), and the remaining 783 are located in untranslated regions (UTRs) of genes. To measure the evolutionary constraint acting upon CSSs, SNPs were used to determine the polymorphism density and derived allele frequencies. Data on SNPs and recombination rates were obtained from the HapMap project and from the dbSNP database. Genetic distance between SNPs spanning intergenic CSSs and flanking genes were calculated using the recombination rate information and was used for investigating the associations between CSSs and their flanking genes.
To investigate whether the SNP density varies within CSSs, we classified nucleotide sites of the CSSs as being in predicted stems or loops/bulges according to their positions in the structural folding predicted by EvoFold. A total of 559,960 nucleotides mapped to stems with 311,897 nucleotides mapped to loops/bulges of CSSs. Of the 746 SNPs located within CSSs, 392 SNPs mapped to stems and 354 SNPs to loops/bugles, demonstrating that stems have a much lower SNP density (0.70 SNP/kb), of about 62% of the density of loops/bulges (1.13 SNP/kb). This result implies that a very large fraction of the mutations that occur in stems appear to be deleterious and are removed by selection within the human populations, suggesting that mutations in the stems of secondary structures have a greater impact compared to mutations in loops/bulges in the function of CSSs. Indirectly, this data also further support the existence of CSSs in the human genome, as we would not expect to observe differences in SNP density if the secondary structures were simply due to false positive folding. SNP density of loops/bulges is still lower than that of the flanking sequences, suggesting that sites in the loops/bulges are also constrained.
We then investigated the associations between intergenic CSSs and their flanking genes. Genetic distance is roughly proportional to physical distance and it appears to be reasonable to hypothesize that non-homologous recombination is less likely to happen between an intergenic CSS and its target gene, thus genetic distance, rather than physical distance, was used to measure the tightness of association between intergenic CSSs and their flanking genes. The genetic distances between intergenic CSSs and their flanking genes was calculated using data for the recombination rates between SNPs spanning the interval where the genes were located. The gene that showed the minimum genetic distance from the intergenic CSS was concluded to be the target gene of the CSS. Given this assumption, intergenic CSSs were found to be enriched near genes encoding transcription factors (P = 1.4 × 10-5, CHI-Square test), an observation consistent with previous reports [3, 11]. In total, 1069 of the 16,574 protein coding genes annotated in the Gene Ontology (GO) and Gene Ontology Annotation (GOA) databases are associated with intergenic CSSs, and of these 1069 genes, 323 encode transcription factors, constituting a fraction (0.30) much higher than the fraction (of 0.15, 2525/16574) in the annotated GO/GOA databases (P < 0.001, CHI-Square test). Enrichment of CSSs around transcription factor-encoding genes suggests that a substantial portion of intergenic CSSs may function as cis-regulatory elements. In addition, intergenic CSSs were also found to be statistically enriched proximal to genes that are involved in development and differentiation (P < 0.01, CHI-Square test). Detailed results are given in Additional file 1. Strikingly, 138 of the 323 transcription factor-encoding genes associated with CSSs are also known to be important in the development, a fraction (0.43) that is significantly higher than the fraction (0.18, 404/2202) of the remaining transcription factor-encoding genes that are involved in the development but that are not associated with intergenic CSSs (P = 2 × 10-23, CHI-Square test). These observations suggest that transcription factor-encoding genes associated with intergenic CSSs regulate developmental processes.
The clustering of a substantial portion of the intergenic CSSs to the proximity of transcription factors-encoding genes is similar to the organization of transcriptional regulation networks that regulate many transcription factor genes [25–27]. For example, Boyer et al. have experimentally identified the binding sites for several important transcription factors that affect stem cell identity, including OCT4, SOX2, and NANOG , and the binding sites of these three transcription factors are found proximal to many transcription factor genes, including themselves, and therefore may form self-regulatory network loops. In the TFCONES database, a considerable fraction of conserved elements were annotated overlapping with the binding sites of many transcription factors . Here we examined how many of the intergenic CSSs are potentially bound by these important transcription factors. When chromosomal coordinates were used to map transcription factor binding sites and intergenic CSSs, 15, 17, and 18 intergenic CSSs were found to overlap with binding sites for SOX2, OCT4, and NANOG, respectively. The 15 intergenic CSSs potentially bound by SOX2 associate with 14 protein coding genes, of which 8 encode transcription factors. Similarly, 13 and 16 protein coding genes associate with intergenic CSSs bound by OCT4 and NANOG, respectively, of which 10 and 12, respectively, are transcription factor-encoding genes. We also examined whether there was an overlap between intergenic CSSs and the binding sites for C-MYC and SUZ12, factors for which binding sites have also been experimentally mapped [26, 27]. We found that 174 (3.86% of the total) intergenic CSSs overlap with binding sites for SUZ12 and 9 (0.20% of the total) overlap binding sites for C-MYC. The 174 intergenic CSSs bound by SUZ12 are associated with 100 genes, 67 of which are encoding transcription factors, while the 9 intergenic CSSs bound by C-MYC are associated with 7 genes, 5 of which are transcription factor-encoding genes. The overlap with binding sites for these five transcription factors indicates that a substantial number of intergenic CSSs may function at the experimentally verified binding sites for these transcription factors.
In this study, we have conducted a systematic survey of the evolutionary constraints acting upon CSSs and investigated a relationship between the enrichment of intergenic CSSs and the enrichment in the binding sites for transcription factors proximal to genes that encode transcription factors. Our survey of the evolutionary constraints unveiled that intensive purifying selection acts against CSSs and has favored the maintenance of secondary structures, implying that there is a functional importance to the secondary structures in these conserved sequences. The enrichment of intergenic CSSs near transcription factor-encoding genes suggests that these CSSs likely function as cis-regulatory elements rather than being transcribed into RNAs, since it is not necessary for RNA genes to be organized predominantly near any class of protein coding genes in the genome.
A recent study focusing on a secondary DNA structure near the gene Hoxb9 revealed that a DNA secondary structure functions as an important binding site for the protein FBXL10 and this structure is conserved between human and mouse . For the Hoxb9 promoter, DNA fragments with two conformations were isolated with identical DNA sequence, one linear and the other containing a secondary structure. Intriguingly, protein FBXL10 exhibits a high binding affinity for the structured promoter, rather than for the linear promoter sequence, strongly suggesting that this protein's binding activity is structure-dependent . Similarly, in this study we have observed an overlap between intergenic CSSs and the binding sites for several transcription factors, which may be due to a steric requirement during the interaction between these transcription factors and genomic DNA sequences. However, protein binding to intergenic CSSs may still be sequence-dependent, since complementary substitutions in an inverted repeat, which retained secondary structure, of the Hlx gene promoter did not restore promoter activity, thus explaining why CSSs are highly conserved both in structures and in primary sequences. In the case of Hoxb9, the protein FBXL10 binds competitively to the structured promoter and the binding is critical to reduce the expression of Hoxb9 . In our analysis we found several intergenic CSSs which could each bind at least 2 different transcription factors, suggesting that competitive binding of transcription factors to a single intergenic CSS could occur at these sites. We found 8 intergenic CSSs that are associated with genes that are co-regulated by three transcription factor SOX2, OCT4, and NANOG (see Fig. 6B), where 6 of the CSSs are completely composed of binding site for each of these three transcription factors. Of these six intergenic CSSs, one is also composed of a binding site for SUZ12 (data not shown). Interestingly, Lee et al. has previously documented that the expression of genes regulated by SUZ12 changed from expressed to repressed, possibly due to the competitive binding of SUZ12 to cis-regulatory sequences that were previously bound by other transcription factors that activated gene expressions . Many intergenic CSSs may act in a similar manner and function as a switch due to the alternative binding of transcription factors directly affecting the temporal expression of target genes.
Our study supports the hypothesis that DNA secondary structures are important units that function in the interaction between proteins and genomic DNA sequences. Investigating mutations that change the paring status of CSSs should facilitate the identification of functional variants that predispose to genetic diseases.
Despite the evolutionary conservation of conserved secondary structures (CSSs) a considerable amount of variation in CSS sequences exist in the genomes of human populations. Analyses of the variant sequences that exist in human populations demonstrate that sites in stems of the predicted secondary structures are under stronger evolutionary constraint than sites on loops/bulges, which are still more constrained than non-conserved sequences. An overlap between CSSs and the binding sites of transcription factors was found to be enriched near transcription factor-encoding genes suggesting a role for CSSs in transcriptional regulation networks.
Human genomes (assembly hg18 and hg17) were downloaded from the UCSC Genome Bioinformatics Site . CSSs were retrieved from the EvoFold home page , but only long CSSs (at least 15 pairing bases) were used in this analysis. An application using hash-tables was written to map CSSs from human genome hg17 to hg18. Locations and SNP ancestral allele data were obtained from the dbSNP database available at the NCBI . Frequencies of SNPs in the HapMap populations: Yoruba in Ibadan, Nigeria (YRI), Japanese in Tokyo, Japan (JPT), Han Chinese in Beijing, China (CHB), and Utah Residents with Northern and Western European Ancestry (CEU), and the recombination rates between SNPs were derived from the HapMap project (Release #22) . Information of genes was retrieved from the GenBank database . Genes were categorized into different functional groups according to the annotations in the Gene Ontology database (GO)  and in the Gene Ontology Annotation database (GOA) . Binding sites for transcription factors SOX2, OCT4, NANOG, C-MYC, and SUZ12 were downloaded as described from previous reports [25–27]. Positions of SNPs relative to CSSs and the overlap between CSSs and the binding sites of transcription factors were calculated according to their chromosomal coordinates in the human genome. SNP densities of flanking sequences were calculated 50 times on each side of the CSSs with a moving window of 200 bp in size. DAFs for SNPs were calculated using ancestral alleles. Genetic distances between the intergenic CSSs and flanking genes was obtained by calculating the genetic distance between two SNPs spanning the genomic interval having the minimal physical distance. The target gene of a CSS was chosen as the flanking gene that showed the minimum genetic distance to the intergenic CSS. Statistical analyses were performed and figures were prepared by using the R software .
conserved secondary structure
derived allele frequency
Gene Ontology Annotation
Yoruba in Ibadan, Nigeria
Japanese in Tokyo, Japan
Han Chinese in Beijing, China
Utah Residents with Northern and Western European Ancestry
We thank Zhou-Hai Zhu and Wei Jin for assisting in the preparation of figures and Xing-Lai Ji and Jiao-Long Huang for useful discussions. This work was supported by grants from the National Basic Research Program of China (973 Program, 2007CB411600), the National Natural Science Foundation of China (30621092, 30430110), and Bureau of Science and Technology of Yunnan Province.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.