The human genome contains numerous ultra-conserved regulatory sequences that are shared broadly across vertebrates. These UCRs occur in arrays of highly conserved regulatory elements spanning large chromosomal regions. The clusters are co-localized with genes encoding key proteins for the regulation of development, with a particular correlation with genes encoding transcription factors. The strength of association between UCRs and diverse classes of DNA binding transcription factors validates that a relatively simple definition of UCRs captures a biologically meaningful set of functional sequences. The presence of non-coding UCRs is predictive for the presence of genes implicated in development, differentiation and malignancies. The list presented in [Additional file 6] hints at potentially crucial roles of currently uncharacterized transcription factor genes, while the collection of reported UCRs provides a wealth of regulatory locations for further study.
Exceptional mechanisms are brought to bear to retain UCRs over hundreds of millions of years of parallel evolution. UCRs are more strongly conserved than sequences encoding identical proteins, and exhibit sequence identity exceeding essentially all known cis-regulatory sequences. The retention properties suggest that UCRs have important functions in the vertebrate genome.
The observed UCRs could fall into multiple functional categories, including enhancers of transcription, regulators of chromatin structure and unknown genes for non-coding transcripts. A small subset of UCRs have been identified previously as enhancers of transcription [7, 3].
The high conservation and length of UCRs compared to binding sites for single transcription factors suggests that the mode of regulation must involve more than the binding of small number of transcription factors. Homeotypic clusters of binding sites, as seen in developmental genes in Drosophila melanogaster , represent one regulatory mechanism that could explain the occurrence of long, conserved non-coding regions. However, as transcription factors tolerate considerable variation between functional binding sites, a homeotypic cluster of binding sites as such cannot warrant the extreme level of conservation observed in UCRs. Alternatively, the recent emergence of the role of microRNAs in regulation suggests that there could be additional non-coding genes in the human genome, perhaps at the sites of ultra-conservation.
The clustering of UCRs suggests that UCR-mediated transcriptional regulation may involve molecular events on a greater scale, possibly involving chromatin structure. This potential link to chromatin structure is suggested by the striking pattern of UCRs in the IRX gene clusters. Most of the UCRs have no similarity between the two clusters, with the exception of a set of four UCRs that have retained both mutual sequence similarity and spatial position (Figure 4). It is tempting to assume that the retention of their mutual similarity is a consequence of IRX cluster co-regulation, the mechanism of which remains unknown.
Based on the preservation of nearly identical sequences over ~450 million years of vertebrate evolution, it is reasonable to postulate the influence of exceptional biochemical mechanisms. Numerous hypotheses could account for the observed data, broadly falling into two categories - active mechanism(s) resulting in the decrease of mutational frequency in UCRs, or negative pressure consistent with evolutionary selection against such mutations. Given the breadth of possibilities, we leave postulation until further data emerges.