Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes
- Albin Sandelin†1,
- Peter Bailey†2,
- Sara Bruce1, 3,
- Pär G Engström1,
- Joanna M Klos2,
- Wyeth W Wasserman4,
- Johan Ericson2Email author and
- Boris Lenhard1Email author
© Sandelin et al; licensee BioMed Central Ltd. 2004
Received: 02 December 2004
Accepted: 21 December 2004
Published: 21 December 2004
Evolutionarily conserved sequences within or adjoining orthologous genes often serve as critical cis-regulatory regions. Recent studies have identified long, non-coding genomic regions that are perfectly conserved between human and mouse, termed ultra-conserved regions (UCRs). Here, we focus on UCRs that cluster around genes involved in early vertebrate development; genes conserved over 450 million years of vertebrate evolution.
Based on a high resolution detection procedure, our UCR set enables novel insights into vertebrate genome organization and regulation of developmentally important genes. We find that the genomic positions of deeply conserved UCRs are strongly associated with the locations of genes encoding key regulators of development, with particularly strong positional correlation to transcription factor-encoding genes. Of particular importance is the observation that most UCRs are clustered into arrays that span hundreds of kilobases around their presumptive target genes. Such a hallmark signature is present around several uncharacterized human genes predicted to encode developmentally important DNA-binding proteins.
The genomic organization of UCRs, combined with previous findings, suggests that UCRs act as essential long-range modulators of gene expression. The exceptional sequence conservation and clustered structure suggests that UCR-mediated molecular events involve greater complexity than traditional DNA binding by transcription factors. The high-resolution UCR collection presented here provides a wealth of target sequences for future experimental studies to determine the nature of the biochemical mechanisms involved in the preservation of arrays of nearly identical non-coding sequences over the course of vertebrate evolution.
Comparative genome sequence analysis, often termed phylogenetic footprinting, has proven successful for the identification of cis-regulatory regions[1, 2]. Recent computational and experimental studies have identified a small number of large, highly conserved enhancers, or 'global control regions', associated with the regulation of important developmental genes such as DACH , SOX9 , Dlx bigene [5, 6], and HOX-D [7, 8] clusters. These regulatory regions can act at distances of several hundred kilobases from their target genes, while at the same time conferring an equivalent expression pattern to reporter genes over much shorter distances (e.g. ). A recent computational analysis proves that such highly conserved elements (termed ultra-conserved elements (UCRs)) are occurring far more often than expected . In the study by Bejerano et al., UCRs are defined as regions perfectly conserved between human and mouse longer than 200 base pairs (bp). The study reports a significant association of a non-transcribed subset of those elements with DNA-binding proteins; an equivalent observation has been made independently by Boffeli et al. for a limited number of most highly conserved elements between human and pufferfish. The stringent criteria for conservation applied in the two studies miss many known enhancer elements that are shorter than 200 bp, and highly conserved across all vertebrates. For instance, in a recently published study, Sabarinadh et al.  described a number of non-transcribed regions flanking the genes of HoxD gene cluster that are highly conserved across vertebrate genomes.
In this paper, we define a set of UCRs using high-resolution criteria that detect segments conserved between the human, mouse and pufferfish genomes. Analysis of this set provides insights into a previously unrealized organizational structure of UCRs in vertebrate genomes. We conclusively show that clusters of UCRs are globally associated with many of the genes that act as master regulators during vertebrate development. The clustered distribution of these regions along chromosomes and, importantly, around their presumptive target genes suggests that gene regulation involves the coordinated action of numerous, widely dispersed elements.
Definition and genomic environment of ultra-conserved non-coding regions (UCRs)
We initiated this study by applying comparative genomics to identify putative regulatory regions for a number of evolutionary conserved homeodomain transcription factors that control neural cell fate determination [12, 13]. When we examined the genomic landscapes surrounding homeodomain gene loci, we consistently found non-coding regions that exhibited a striking degree of sequence conservation between human and mouse over a minimum of 50 bp. Many of these regions are at least partially conserved over extended periods of evolution. The observed nucleotide identities between human and mouse sequences exceed even those of exon sequences encoding identical proteins. Such striking sequence conservation has previously been anecdotally associated with long-range enhancers for several developmental genes [3–8].
To test whether the association of UCRs with regulatory genes reflected a global genomic trend, we identified a comprehensive set of human/mouse/pufferfish UCRs for detailed analysis. We defined minimum requirements for a UCR (see Methods) and performed a genome-scale computational analysis that retrieved 3583 human/mouse/pufferfish UCRs. Since one of the requirements is that the UCRs are not overlapping actively transcribed genomic regions, they would belong to type II UCRs defined by Bejerano et al. .
The median UCR length was 125 bp, but extreme lengths (>1000 bp) were observed. Qualitative assessment of "genescapes", the gene structures, surrounding UCRs revealed them to be present either in introns, in dense clusters around a group of genes or in 'gene deserts' (up to several thousands kilobases from known genes). There appeared to be a strong association between locations of our set of UCRs and genes encoding transcription factors – even stronger than that reported by Bejerano et al. [see Additional file 1 and 2]. This observation will be proven in the subsequent analysis.
UCRs are strongly associated with DNA-binding proteins
Over-representation of protein domains in genes flanking UCRs. Bonferroni-corrected and uncorrected Fisher Exact Test p-values are shown for the 16 most over-represented InterPro domains. Typical transcription factor domains (DNA binding domains) are indicated in bold. A full list of all InterPro domains with P-values is given in [Additional file 3].
Fisher test P value
Corrected P value
UCRs clusters encompass the entire gene loci of key developmental genes
Rare duplications of UCRs across evolution
The human genome contains numerous ultra-conserved regulatory sequences that are shared broadly across vertebrates. These UCRs occur in arrays of highly conserved regulatory elements spanning large chromosomal regions. The clusters are co-localized with genes encoding key proteins for the regulation of development, with a particular correlation with genes encoding transcription factors. The strength of association between UCRs and diverse classes of DNA binding transcription factors validates that a relatively simple definition of UCRs captures a biologically meaningful set of functional sequences. The presence of non-coding UCRs is predictive for the presence of genes implicated in development, differentiation and malignancies. The list presented in [Additional file 6] hints at potentially crucial roles of currently uncharacterized transcription factor genes, while the collection of reported UCRs provides a wealth of regulatory locations for further study.
Exceptional mechanisms are brought to bear to retain UCRs over hundreds of millions of years of parallel evolution. UCRs are more strongly conserved than sequences encoding identical proteins, and exhibit sequence identity exceeding essentially all known cis-regulatory sequences. The retention properties suggest that UCRs have important functions in the vertebrate genome.
The observed UCRs could fall into multiple functional categories, including enhancers of transcription, regulators of chromatin structure and unknown genes for non-coding transcripts. A small subset of UCRs have been identified previously as enhancers of transcription [7, 3].
The high conservation and length of UCRs compared to binding sites for single transcription factors suggests that the mode of regulation must involve more than the binding of small number of transcription factors. Homeotypic clusters of binding sites, as seen in developmental genes in Drosophila melanogaster , represent one regulatory mechanism that could explain the occurrence of long, conserved non-coding regions. However, as transcription factors tolerate considerable variation between functional binding sites, a homeotypic cluster of binding sites as such cannot warrant the extreme level of conservation observed in UCRs. Alternatively, the recent emergence of the role of microRNAs in regulation suggests that there could be additional non-coding genes in the human genome, perhaps at the sites of ultra-conservation.
The clustering of UCRs suggests that UCR-mediated transcriptional regulation may involve molecular events on a greater scale, possibly involving chromatin structure. This potential link to chromatin structure is suggested by the striking pattern of UCRs in the IRX gene clusters. Most of the UCRs have no similarity between the two clusters, with the exception of a set of four UCRs that have retained both mutual sequence similarity and spatial position (Figure 4). It is tempting to assume that the retention of their mutual similarity is a consequence of IRX cluster co-regulation, the mechanism of which remains unknown.
Based on the preservation of nearly identical sequences over ~450 million years of vertebrate evolution, it is reasonable to postulate the influence of exceptional biochemical mechanisms. Numerous hypotheses could account for the observed data, broadly falling into two categories – active mechanism(s) resulting in the decrease of mutational frequency in UCRs, or negative pressure consistent with evolutionary selection against such mutations. Given the breadth of possibilities, we leave postulation until further data emerges.
Since Bejerano et al. focused on larger regions (200 bp) of perfect nucleotide identity compared to our more permissive settings (95% sequence identity over 50 bp), the genomic arrangement of UCR-containing regions with respect to their presumptive target genes was not fully realized. Our findings include critical new information about UCR clusters, particularly with regards to patterns of conservation, their genomic organization, and the insights they provide into potential chromatin regulating mechanisms. These mysterious regions retained over hundreds of millions of years of evolution appear to contribute to a novel mechanism of developmental regulation. Detailed studies of UCRs that will ensue from the discoveries reported here promise to advance our understanding of vertebrate development.
Definition of UCRs applied in this study
We defined UCRs as non-protein coding genomic regions having a sequence identity > 95% over a 50 bp sliding window of length in human/mouse comparison (based on the tight alignments track from the UCSC genome browser database, using human and mouse assemblies hg15 and mm3, respectively). As a further constraint, an UCR must overlap with sequences conserved between the human and pufferfish genomes, as defined in the UCSC genome browser databases (a BLAT  alignment between human and pufferfish with a minimum BLAT score of 20). In order to avoid inclusion of coding sequence, we required that a UCR must not overlap a mouse or human cDNA mapped to the genome (based on cDNA tracks from from the UCSC genome browser database) or overlap putative coding regions predicted by GenScan .
Calculation of UCR and gene distributions
The distribution of UCRs in the genome was calculated by counting the number of UCRs within a 500 kilobase (kb) window which was progressively slid over each chromosome in 100 kb intervals. The same approach was used to estimate the gene density; specifically by summing the number of bases within the window that aligned with human mRNA (from the UCSC Genome Browser database).
Gene-UCR distance calculation
Distances between a given gene and UCR on the same chromosome were defined as the shortest distance between the starting points and/or endpoints of UCR and gene in the human genome (UCSC assembly hg15), using EnsEMBL  gene annotation. Genes based solely on ESTs or computational predictions were not included.
Estimation of significance of Gene-UCR distances
The distances from genes within a set (for instance, all forkhead domain-containing genes) to the closest UCRs were calculated as above. The expected fraction of gene-UCR distances smaller than 8 kb was estimated by simulation: UCR genome coordinates were randomly chosen and distances measured as above. The simulation process was repeated 1000 times and the average fraction reported. In order to estimate if the observed distribution was significantly different from the expected, we used the chi-squared test.
Estimation of domain over-representation in genes closest to UCRs
For each UCR, the closest upstream and downstream gene within 2 Mbp was identified (UCRs inside introns of genes were analyzed separately). EnsEMBL InterPro  domain annotation was used to tabulate a contingency table consisting of the positive sample counts (number of genes in the set containing a certain domain), negative sample counts (number of remaining genes in the set), background positives (number of genes containing the same domain in the genome) and background negatives (remaining genes). For clarity, a given gene was only counted once, and multiple occurrences of the same domain within the same protein were not counted.
For each domain found in the UCR-proximal genes, we tested the null-hypothesis that the sample and background sets are drawn from the same population versus the alternative hypothesis that the sample set has a higher frequency of the domain, using Fisher's Exact Test  from the R statistical package http://www.r-project.org. Since the number of tests is considerable, we corrected for multiple sampling using the conservative Bonferroni method , in which the number of tests is multiplied with the P-value from the Fisher test with the number of unique domains tested (837). An analogous analysis was performed with genes containing one or more UCRs within their introns [see Additional file 4].
Estimation of clustering tendency
We used the distances between consecutive UCRs as a statistic indicating clustering. A neutral background distance distribution was created by assigning UCRs genome coordinates randomly, and subsequently measuring distances between consecutive UCRs. This process was repeated 1000 times. We compared the distance distribution between naturally occurring UCRs and the background using the Kolmogorov-Smirnov test , which assigns a probability that two distributions are similarly shaped.
UCR sequence similarity analysis
All possible pairs of UCRs were aligned using NCBI BLASTN  with standard settings. For any pair to be reported as near-identical, we required an HSP of at least 50 bp and a pairwise sequence identity exceeding 75%.
- UCR –:
ultraconserved non-coding region
- bp –:
- kbp –:
103 base pairs
AS and BL were supported in part by funding from Pharmacia Corporation (now Pfizer). JE is supported by the Royal Swedish Academy of Sciences, by a donation from the Wallenberg Foundation, The Swedish Foundation for Strategic Research, The Wallenberg Foundation, The Swedish National Research Council and the EC network grants: Brainstem Genetics: QLRT-2000-01467 and Stembridge: QLG3-CT-2002-01141. W.W. is supported by the Michael Smith Foundation for Health Research and the Canadian Institutes of Health Research.
- Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol. 2003, 2: 13-10.1186/1475-4924-2-13.PubMed CentralView ArticlePubMedGoogle Scholar
- Ureta-Vidal A, Ettwiller L, Birney E: Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat Rev Genet. 2003, 4: 251-262. 10.1038/nrg1043.View ArticlePubMedGoogle Scholar
- Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science. 2003, 302: 413-10.1126/science.1088328.View ArticlePubMedGoogle Scholar
- Bagheri-Fam S, Ferraz C, Demaille J, Scherer G, Pfeifer D: Comparative genomics of the SOX9 region in human and Fugu rubripes: conservation of short regulatory sequence elements within large intergenic regions. Genomics. 2001, 78: 73-82. 10.1006/geno.2001.6648.View ArticlePubMedGoogle Scholar
- Sumiyama K, Irvine SQ, Ruddle FH: The role of gene duplication in the evolution and function of the vertebrate Dlx/distal-less bigene clusters. J Struct Funct Genomics. 2003, 3: 151-159. 10.1023/A:1022682505914.View ArticlePubMedGoogle Scholar
- Ghanem N, Jarinova O, Amores A, Long Q, Hatch G, Park BK, Rubenstein JL, Ekker M: Regulatory roles of conserved intergenic domains in vertebrate Dlx bigene clusters. Genome Res. 2003, 13: 533-543. 10.1101/gr.716103.PubMed CentralView ArticlePubMedGoogle Scholar
- Spitz F, Gonzalez F, Duboule D: A global control region defines a chromosomal regulatory landscape containing the HoxD cluster. Cell. 2003, 113: 405-417. 10.1016/S0092-8674(03)00310-6.View ArticlePubMedGoogle Scholar
- Santini S, Boore JL, Meyer A: Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res. 2003, 13: 1111-1122. 10.1101/gr.700503.PubMed CentralView ArticlePubMedGoogle Scholar
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science. 2004, 304: 1321-1325. 10.1126/science.1098119.View ArticlePubMedGoogle Scholar
- Boffelli D, Nobrega MA, Rubin EM: Comparative genomics at the vertebrate extremes. Nat Rev Genet. 2004, 5: 456-465. 10.1038/nrg1350.View ArticlePubMedGoogle Scholar
- Sabarinadh C, Subramanian S, Tripathi A, Mishra RK: Extreme conservation of noncoding DNA near HoxD complex of vertebrates. BMC Genomics. 2004, 5: 75-10.1186/1471-2164-5-75.PubMed CentralView ArticlePubMedGoogle Scholar
- Cornell RA, Ohlen TV: Vnd/nkx, ind/gsh, and msh/msx: conserved regulators of dorsoventral neural patterning?. Curr Opin Neurobiol. 2000, 10: 63-71. 10.1016/S0959-4388(99)00049-5.View ArticlePubMedGoogle Scholar
- Briscoe J, Ericson J: Specification of neuronal fates in the ventral neural tube. Curr Opin Neurobiol. 2001, 11: 43-49. 10.1016/S0959-4388(00)00172-0.View ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley RR, Courcelle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffiths-Jones S, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Lonsdale D, Silventoinen V, Orchard SE, Pagni M, Peyruc D, Ponting CP, Selengut JD, Servant F, Sigrist CJ, Vaughan R, Zdobnov EM: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 2003, 31: 315-318. 10.1093/nar/gkg046.PubMed CentralView ArticlePubMedGoogle Scholar
- Westfall PH, Wolfinger RD: Multiple Tests with Discrete Distributions. The American Statistician. 1997, 51: 3-8.Google Scholar
- Lettice LA, Horikoshi T, Heaney SJ, van Baren MJ, van der Linde HC, Breedveld GJ, Joosse M, Akarsu N, Oostra BA, Endo N, Shibata M, Suzuki M, Takahashi E, Shinka T, Nakahori Y, Ayusawa D, Nakabayashi K, Scherer SW, Heutink P, Hill RE, Noji S: Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly. Proc Natl Acad Sci U S A. 2002, 99: 7548-7553. 10.1073/pnas.112212199.PubMed CentralView ArticlePubMedGoogle Scholar
- Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ: The UCSC Genome Browser Database. Nucleic Acids Res. 2003, 31: 51-54. 10.1093/nar/gkg129.PubMed CentralView ArticlePubMedGoogle Scholar
- de Crombrugghe B, Lefebvre V, Behringer RR, Bi W, Murakami S, Huang W: Transcriptional mechanisms of chondrocyte differentiation. Matrix Biol. 2000, 19: 389-394. 10.1016/S0945-053X(00)00094-9.View ArticlePubMedGoogle Scholar
- Gomez-Skarmeta JL, Modolell J: Iroquois genes: genomic organization and function in vertebrate neural development. Curr Opin Genet Dev. 2002, 12: 403-408. 10.1016/S0959-437X(02)00317-9.View ArticlePubMedGoogle Scholar
- Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila. Genome Res. 2003, 13: 579-588. 10.1101/gr.668403.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94. 10.1006/jmbi.1997.0951.View ArticlePubMedGoogle Scholar
- Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T, Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark C, Clamp M, Hubbard T: Ensembl 2004. Nucleic Acids Res. 2004, 32 Database issue: D468-70. 10.1093/nar/gkh038.View ArticleGoogle Scholar
- Mehta CR, Patel NR: FEXACT: A Fortran subroutine for Fisher's exact test on unordered r*c contingency tables ACM Transactions on Mathematical Software, 12, 154–161. ACM Transactions on Mathematical Software. 1986, 12: 154-161. 10.1145/6497.214326.View ArticleGoogle Scholar
- Conover WJ: Practical nonparametric statistics. 1971, New York, John Wiley & SonsGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMed CentralView ArticlePubMedGoogle Scholar