Vertebrate conserved non coding DNA regions have a high persistence length and a short persistence time
© Retelska et al. 2007
Received: 01 June 2007
Accepted: 31 October 2007
Published: 31 October 2007
Skip to main content
© Retelska et al. 2007
Received: 01 June 2007
Accepted: 31 October 2007
Published: 31 October 2007
The comparison of complete genomes has revealed surprisingly large numbers of conserved non-protein-coding (CNC) DNA regions. However, the biological function of CNC remains elusive. CNC differ in two aspects from conserved protein-coding regions. They are not conserved across phylum boundaries, and they do not contain readily detectable sub-domains. Here we characterize the persistence length and time of CNC and conserved protein-coding regions in the vertebrate and insect lineages.
The persistence length is the length of a genome region over which a certain level of sequence identity is consistently maintained. The persistence time is the evolutionary period during which a conserved region evolves under the same selective constraints.
Our main findings are: (i) Insect genomes contain 1.60 times less conserved information than vertebrates; (ii) Vertebrate CNC have a higher persistence length than conserved coding regions or insect CNC; (iii) CNC have shorter persistence times as compared to conserved coding regions in both lineages.
Higher persistence length of vertebrate CNC indicates that the conserved information in vertebrates and insects is organized in functional elements of different lengths. These findings might be related to the higher morphological complexity of vertebrates and give clues about the structure of active CNC elements.
Shorter persistence time might explain the previously puzzling observations of highly conserved CNC within each phylum, and of a lack of conservation between phyla. It suggests that CNC divergence might be a key factor in vertebrate evolution. Further evolutionary studies will help to relate individual CNC to specific developmental processes.
Large-scale conservation of non-coding genomic regions has been discovered by Dermitzakis et al, after alignment of the human chromosome 21 to homologous regions of the mouse genome. This work reported that protein-coding genes were more conserved overall than non-genic regions, thus giving a large-scale confirmation that evolutionary conservation is a hallmark of biological function. At the same time, it showed that numerous short non-coding DNA fragments were extremely highly conserved between human and mouse, but absent from the Drosophilids genome . Subsequent work established that some of these sequences are highly conserved across all vertebrate species, whereas other are conserved only between pairs of species . Regions of > 200 bp of perfect identity between human, mouse and rat have been called ultraconserved elements (UCE) . Conserved non coding regions (CNC) are also referred to by others as conserved non-genic (CNG) regions, conserved non-coding elements (CNE)  or highly conserved elements (HCE) .
Although the conservation of these sequences pointed to an important biological role, their function remained elusive. A general confirmation of the functional relevance of CNC genomic sequences was given by Drake et al  who showed that the conservation is not due to lower regional mutation rate, but is best explained by purifying selection. In this study, a subset of conserved sequences shows SNP allele frequency shifts with magnitudes comparable to those for coding mis-sense variants, which suggests that they are likely to be under similar selective pressure.
In a recent work, Siepel et al  analyzed genomic conservation in multiple alignments from four different phyla: vertebrate, arthropods, nematodes, and fungi. They concluded that part of non coding bases are conserved in all genomes studied, but the fraction of conserved bases lying outside of exons of protein-coding genes is increasing with the complexity of the investigated lineage. Moreover, this study provided interesting clues about the function of non-coding sequences. In vertebrates, CNC regions are over-represented within 3'UTRs of regulatory genes, and show a strong enrichment in RNA secondary structure candidates. Non-coding RNAs are thus likely to contribute to the pool of CNC sequences . However, non-coding RNAs of known function, as well as UTRs of protein coding genes, have diverse, and often moderate, degrees of human-mouse conservation . These functional elements are thus more likely to contribute to the moderately conserved fraction of eukaryotic genomes than to the highly conserved fraction. Detailed studies on some of the highly conserved sequences demonstrated that some of them play important regulatory roles [8, 3]; and recent large-scale study showed that a significant proportion of vertebrate highly conserved CNC have a tissue-specific enhancer function .
Here we present a comprehensive exploratory analysis of conserved sequence regions based on genome alignments. Our analyses are motivated by two main questions: (i) why are vertebrate CNC not conserved in insects, in contrast to coding regions, and (ii) does the evolution of non-coding DNA explain the apparently higher complexity of vertebrates (which is not due to an increased gene content, since the protein coding genes content of metazoan genomes is surprisingly similar). To address these questions we designed a measure to quantify the conserved genetic information between pairs of vertebrate and insect genomes, and proved that the proportion of non-coding bases in the conserved fraction is similar between these phyla.
We investigated the persistence length of CNC and conserved coding regions in the two lineages. Persistence length is the length of a genomic region over which a certain percentage of sequence identity is consistently maintained. The concept of persistence length is loosely inspired by physical models of polymers, and gives some indication about the internal organization of the conserved regions. For coding regions, it is probably related to the conserved ungapped blocks, which are readily recognizable in multiple alignments of distantly related proteins; it might also reflect the exon lengths distribution in vertebrates and insects. For CNC, we hypothesize that persistence length reflects the length of a functional unit of genetic information, and thus can give us insights into the function of these elements. The last part of our analysis focuses on the dynamic properties of conserved regions. Operationally, we determine persistence time as the evolutionary time interval after which sequence divergence appears to be accelerated or sequence similarity becomes undetectable. We established the evolutionary kinetics of CNC over time, and show that the persistence time differs strikingly between CNC and conserved coding sequences. These results explain and reconcile most previous observations about conservation of different genomic regions, and open the way to more detailed studies on kinetics of CNC evolution.
Proportion of bases covered by pairwise alignments
For each functional sequence class (coding, non-coding, repeats), we assessed the distribution of sequence identity through measurable intervals (see Methods). The sequence identity of coding regions peaks at 70–80% for both Hs-Gg and Dm-Dv, (Additional file 2). We performed a similar analysis for a larger number of vertebrate and insect pairs. As expected, more closely related vertebrate species peaked at a higher percentage of identity (80%–90% for the Homo sapiens – Mus musculus alignments, 85%–95% for the Homo sapiens – Canis familiaris alignments, and 75%–85% for D.melanogaster-D.pseudoobscura alignments, not shown). We limited our analysis to two pairs having directly comparable coding sequence identity distribution. The observation that CDS identity is the same in Hs-Gg and Dm-Dv genome pairs is consistent with the values reported in the respective genome sequencing papers [10, 11], and with neutral genomic distances between these 2 species (see Additional file 1). These values can be explained by a faster evolution of the Drosophilid species, confirmed by several independent measures of genomic distance [12, 13].
Functional distribution of sequences conserved with more than 80% identity
Here, the parameter r is the equilibrium sequence identity value asymptotically reached after infinite divergence time, a parameter which can be empirically determined by aligning unrelated sequences from the same species pair. For gapped alignment algorithms, we typically find r values of approximately 0.45. The neutral distance d is the expected number of mutations per base-pair in the absence of any kind of selection. The estimates for d used in this work are derived from different sources (Additional file 1). Note that equation 1 corresponds to the Jukes-Cantor model  with a modified equilibrium value r.
For a given alignment, we define the amount of conserved sequence information as the number of bases in the reference sequence multiplied by the corresponding selection coefficient, which can be computed form the observed sequence identity by solving equation 1 for s (see Methods for more details). The resulting information is scaled in base-pair units (double bits). Following this principle, one can compute the total amount of sequence information conserved between two species from the number of bases contained in different conservation classes, as determined by our sliding window approach (see Methods). The amount of conserved sequence information is not identical to the amount of sequence information that is currently under selection in a given species. It is expected to decrease with increasing phenotypic divergence (although not much if one assumes that most biochemical and physiological processes are conserved within phyla).
Composition of genomic regions of >95%, >90% and > 85% sequence identity in Hs-Gg and Dm-Dv alignments.
Hs-Gg > 95%
Hs-Gg > 90%
Hs-Gg > 85%
Dm-Dv > 95%
Dm-Dv > 90%
Dm-Dv > 85%
When we compare conservation in long regions (≥100 bp), a very clear difference appears between vertebrates and insects. In vertebrates, for a total number of 3794 ≥100 bp long, > 95% conserved fragments, we retrieve 3320 CNC, 411 partly coding and 63 coding (1,66 % CDS); in the Dm-Dv pair, we retrieve 86 fragments, including 20 CNC, 40 partly coding and 26 coding (30.23 % CDS). While the number of long, highly conserved CDS in both species is comparable, the number of partially coding, highly conserved sequences might reflect a higher conservation of UTR sequences in vertebrates. However, the 166-fold higher number of long CNC in vertebrates clearly suggests that they represent a sequence class absent in insects. If we consider regions of at least 150 bp, we find 1751 CNC in the human genome, and not a single one in the D. melanogaster genome, which is an even more striking difference.
To exclude the possibility that vertebrate long CNC are associated with known genes, we checked our set of vertebrate CNC for evidence of transcription. We extracted functional annotation from Ensembl, and classified the CNC as either a transcribed (and untranslated) part of an exon, or associated with a gene (within 1000 bp of an exon), or distant (> 1000 bp of an exon). In our set of 3794 non-coding fragments of a minimal length of 100 bp, 506 (13.34 %) are transcribed, 291 (7.76%) are associated with genes, and the remaining 2997 (79.0%) are located further than 1000 bp of any documented gene. This confirms previous reports that human-chicken conserved elements are often located far from genes . It further suggests that most of these elements are not included in transcripts or proximal promoters of documented genes.
The overall frequency of rare alleles tends to increase with sequence identity (3B, 3C). which suggests that human non-coding sequences in distinct identity classes are subject to different evolutionary constraints. These results extend and complement the work of Drake et al . We confirm, based on a separate, unbiased dataset, that CNC sequences are constrained, and that human-chicken conserved sequences, which evolved early in the vertebrate lineage have been maintained under selective pressure until our recent past.
In drosophilids, CNC alignments are less conserved than CDS in most species, but the gap between coding and non-coding increases greatly in more distant species (see e.g. D. virilis and A. gambiae, Figure 4B). This result confirms that the Drosophilids genome has very few long highly conserved sequences. The rapid drop of the mean % identity of non coding alignments is due to the disappearance of CNC conservation in genome alignments of distant species, both for vertebrates and insects. For the most conserved class of vertebrate alignments, 58% of non coding regions are alignable between human and fish, versus 95% of coding alignments (66.48% identity in alignable non coding sequences, 69.23% in alignable CDS). These results clearly show that, in both vertebrates and insects, non-coding sequences have much shorter persistence time than coding sequences.
We analyzed the patterns of conservation of CNC in vertebrates and insects, and introduced for this purpose new methods and concepts. For most analyses, we used a sliding window technique which enables us to compute percent identity statistics for individual bases. This circumvents the problem of fragmenting genomes into conserved regions, a process that is very sensitive to the parameter settings of alignment algorithms and thus not robust. Moreover, there is no guarantee that conserved regions determined by computational procedures represent natural units of genome function and evolution. Our analysis of the persistence length of conserved sequence regions indicates that this approach is effective in unraveling conservation properties that are unique to vertebrate non-coding regions.
We chose to interpret genome-wide percent identity figures in the light of an evolutionary model enabling us to compute the total amount of conserved sequence information between two genomes. In doing so, we have opted for a simple but nevertheless flexible evolutionary model with few parameters. Flexibility results from an intrinsically simple way to take into account the effects of indels and biases due to the alignment scoring systems on observed % identity values. In fact, there is only one parameter that depends on the genome base composition and the alignment scoring system, the equilibrium identity value r, which can be readily determined by aligning large numbers of non-homologous sequences from the two genomes. The consistent results that we obtain for different species pairs indicate that our method of relating alignment -based sequence identity to conserved sequence information is an effective way to project % identity figures from species pairs with different mutational distances onto a comparable scale.
One of the major results of our study is that the proportion of conserved non-coding to coding bases is similar in both taxa, as is the distribution of conserved information across different sequence identity classes. Yet we observe a very significant difference in the length distribution of these sequences: the majority of conserved vertebrate base pairs occur in conserved DNA fragments longer than 100 bp, whereas insects CNC are organized in short fragments, closely mirroring the length distribution of conserved coding sequences. Our set of extremely long CNC reflect the same phenomena as vertebrate ultra-conserved elements (UCE) . UCE are defined as sequence segments longer than 200 base-pairs which are absolutely conserved between human, mouse and rat. Like persistently long sequence regions unraveled by our approach, they are unique to vertebrates and mostly non-coding. However, altogether they represent only about 120'000 bp. Conversely, the fraction of bases contained in persistently conserved sequence regions of at least 85% identity and length 150 or longer totals 2.1 million bases. At the same time, this fraction is about 20 times enriched in non-coding bases (94.4%). Our results indicate that UCE are thus just the tip of the iceberg of a much larger class of vertebrate-specific non-coding regions with unique conservation properties.
What is the function of these long CNC? Several publications focusing on a subset of highly conserved elements suggest that some of them are distant regulators of developmental genes [4, 8, 18, 19]. A large-scale proof of the enhancer activity of these sequences has recently been produced . Based on the increasing amount of recent evidence, we postulate that most of the highly conserved CNC are distal regulators of gene expression. Likewise, conserved elements in drosophilids and worms were found to occur in the vicinity of developmental regulators and transcription factors . Interestingly, and consistent with our results, drosophilid cis-regulatory elements have been postulated to be typically less than 50 bp . If so, the striking CNC length differences suggest that conserved regulatory elements are much longer in vertebrates than insects, which indicates that more transcription factor binding sites might be included in a regulatory module, and suggests a more complex regulation of gene expression in vertebrates.
To understand the rules governing the evolution of CNC, we investigated the persistence time of a large set of coding and non-coding sequences through the evolutionary trees of insects and vertebrates. In closely related vertebrate species, we observe that CNC are more conserved than coding regions. However, the reversal of this trend at larger evolutionary distances shows that CNC have a shorter persistence time than coding regions. A similar evolutionary phenomenon occurred at least twice, in vertebrates and in insects, where we observe a relative slow-down of coding region evolution over longer time periods. In the light of these observations, the complete disappearance of perceptible sequence conservation of non-coding regions across phyla appears to be the continuation of a trend that is also operating within phylum, though perhaps at lower intensity.
To our knowledge, this is the first systematic report describing the kinetic aspects of the evolution of non-coding regions of a whole genome. A couple of previous studies addressed this question for specific non-coding regions. Consistent with our observation, UCE as defined by Bejerano and coworkers  were found to have accelerated evolutionary rates in the lineages leading to birds and amphibians, as compared to the rates observed between mammals. A study on CNC evolution in the Hox gene cluster shows that the evolution of these regions is significantly faster in Xenopus, compared to more closely related mammalian species. The authors postulate a faster evolution of cis-regulatory sequences in the amphibian lineage, as well as in the stem lineage of mammals .
In order to understand the evolutionary forces that lead to different kinetic behaviors of CNC and coding regions, one has to compare the observations to predictions made by various models. The classical Dayhoff model upon which our estimation of conserved sequence information is based, assumes gene-specific evolutionary rates due to varying degrees of purifying selection. The overall strength of selection remains constant over time for a given gene; however, the specific constraints acting on a particular base will change, allowing for any possible base substitution over sufficiently long time periods. According to the Dayhoff model, orthologous sequences evolving in two different lineages will asymptotically approach the saturation percent identity value characteristic of alignments of unrelated sequences (formula 1).
The kinetic behavior of conserved coding regions as shown in Fig. 4 is not compatible with this model. The relatively high divergence observed between closely related species probably reflects rapid saturation with silent mutations and conservative amino acid replacements. The slower rate documented by more distant time points is consistent with the assumption that a fraction of the coding regions code for proteins that retain the same function throughout the animal kingdom. This fraction presumably evolves under invariable functional constraints that prohibit complete sequence divergence.
The evolutionary kinetics of insect CNC could in principle be explained by a Dayhoff process as described by formula 1. On the other hand, the time-course of vertebrate CNC shows accelerated evolutionary rates in the lineages leading to amphibians and fishes, which is not compatible with a neutral model of molecular evolution and suggest that specific evolutionary constraints underlie the divergence patterns of these sequences.
This conservation dynamics of vertebrate CNC is perhaps best explained by an evolutionary process consisting of long periods of high stability alternating with short periods of rapid and pervasive adaptive changes. In fact, such a scenario is part of an emerging view of animal body plan evolution put forward, for instance, in recent publications by Davidson, Prud'homme, and others [22–24]. The key assumptions underlying their model are: (i) different body plans of insects, vertebrates, worms and other phyla are executed by the same "toolbox" of genes, many of them encoding transcription factors, (ii) the different body plans result from different temporal and spatial expression patterns of toolbox genes, (iii) the different expression patterns result from changes in the cis-regulatory regions of individual genes, not from changes in the sequence specifiCity of the cognate trans-acting transcription factors. According to this view, morphological changes are the result of a rewiring of a hierarchical gene regulatory network via cis-regulatory mutations. Conversely, body plan stability requires high conservation of the cis-regulatory regions of toolbox genes.
This model is supported by several studies indicating that many of the highly conserved CNC may in fact be tissue-specific enhancers of developmental genes (see for example). Particularly relevant in this context is the work by Prabhakar et al. showing that non-coding regions conserved between primates, but lacking visible conservation in more distant vertebrates are active transcriptional regulators . The idea of a common toolbox is based on the intriguing observation that phyla-specific CNC of insects, worms and vertebrates, are associated with overlapping sets of conserved developmental genes . Several studies proved that changes in CNC produce a morphological phenotype via a change in the expression pattern of a nearby gene. For example, the different expression pattern of Hoxc8 in the thoracic region of mouse and chicken is associated with a relative expansion of the cervical region in the chicken. This crucial difference in the expression of a developmental gene is caused by a conserved enhancer region, differing by only a few nucleotides between mouse and chicken. The chicken Hoxc8 enhancer is sufficient to reproduce a chicken-like expression pattern in the mouse . Several observations on insect wing color and other developmental patterns, as well as the pelvic reduction in stickleback fishes (reviewed by) further suggest that morphological evolution often occurs via cis-regulatory changes that affect the expression of broadly conserved genes.
In view of the above model, the specific conservation kinetics of vertebrate CNC suggests a partial rewiring of developmental gene regulatory network at the early stages of amniote evolution. At later stages, the low rate of CNC changes speaks for high stability of major parts of body plan in lineages leading to chicken and man. Note further that differential mutation rates cannot explain the slower evolution of CNC relative to coding regions, since the level of Hs-Gg conservation strongly correlates with the selective constraints revealed by data from human population studies. Thus, the most stringent subset of highly conserved CNC has most likely evolved in amniotes and has been exposed to a constantly high selective pressure from the human-chicken common ancestor up to the recent spread-out of the human populations. This set of sequences is thus likely to contain key regions regulating amniote development.
Here we introduced the concepts of persistence time and length, and applied them to characterize evolutionary kinetics of coding and non-coding regions. We analyzed the length distribution of CNC in vertebrates and insects. Our results show that a similar proportion of conserved coding to non coding regions exists in vertebrates and insects, but they are organized in longer fragments in vertebrates. This observation gives insight into the design principles of regulatory regions in both phyla. We also show that non coding sequences have a much shorter evolutionary persistence time than coding sequences. Our results might explain why vertebrate CNC are not found in other phyla, and suggest that non-coding regions are an important factor of morphological evolution. With more genomes becoming available, more detailed analyses based on these criteria will help associate individual CNC with lineage specific physiological and morphological changes.
The following pairwise Blastz genomic alignments were downloaded from the UCSC FTP repository : Dm-Dv: dm2(BGDPv.4) – droVir1, Hs-Gg: hg18-galGal3 [28–30] (see Additional file 1 for species abbreviations). Sequence identity percentage was computed in an asymmetric fashion for the reference species (Hs or Dm) by counting the number of conserved bases in a sliding window containing 60 bp of the reference species. This number is used to compute the percent identity for the base at the center of the window (pos. 30). By moving the window in 1 bp steps, this procedure yields a % identity value for each aligned base of the reference genome in an aligned region. Bases closer than 30 bp to the limits of aligned regions are assigned the values of the first or last window, respectively.
The sequence identities were assigned to 10 discrete bins. The first includes the aligned sequence with 55 % or less identity (33 of 60 bases), and the next bins increase in 5% increments until 100% identity (last bin: 58–60 conserved identical bases). Each base in the reference genomes was further classified as coding or non-coding based on Ensembl genome annotation  (drosophila melanogaster_core_37_4e, and homo sapiens core_45_36g). Perl scripts based on the Ensembl Perl API  were used for this purpose.
Selection coefficients were computed from the median sequence identity of each bin (e.g. 0.983 for most conserved class) by solving equation (1) in the Results Section. The equilibrium identity value r was determined empirically for each species pair in the following way: 900 pairs of genomic segments of length 60 were extracted randomly from the genomes and aligned with the program align (Myers & Miller, CABIOS 4:11–17) from the fasta2 package . For each alignment, the identity level was computed as the fraction of bases in the reference sequence that are paired with identical bases in the target sequence. Note that align reports a different type % identity value, obtained by dividing the number of identical residue pairs by the length of the alignment including gaps. The alignments were generated with the HoxD55.q matrix downloaded from the UCSC genome browser web site. This matrix also provides default gap penalties which we left unchanged. The mean % identity for random sequences obtained in this way was 44.1% for Dm-Dv and 42.8% for Hs-Gg.
The total amount of conserved information for a species pair was obtained by summing the conserved information for percent identity classes which correspond to a selection coefficient between 0 and 1 according to equation 1, i.e. the classes with sequence identity equal or higher than value corresponding to the neutral mutational distance of the two species (parameter d).
Selective pressure for CNC was computed based on Perlegen allele frequency data . The data, obtained from the authors for the human genome version hg16, were 'lifted' to the hg18 version using the liftOver program) and corresponding chain files downloaded from the UCSC web site .
From 61'746 known SNPs in the human-chicken alignable regions, we used 41'775 in this study, for which genotyping data were available for all 71 individuals of the Perlegen panel. Each genotyped SNP was annotated with the Ensembl annotation. The sequence identity percentage, and allele frequency spectra were established for the different functional and sequence identity classes tested. Rare (1–3 alleles in 71 individuals) and frequent (>3) alleles' counts were compared using the exact Fisher's test.
To establish the persistence time of CNC, we used whole genome multiple alignments downloaded from the UCSC genome browser (17-species vertebrate alignments and 8-species insect alignments). Vertebrate genomes used in this analysis included hg17, rheMac2, mm7, monDom2, galGal2, xenTro1 and danRer3. Insect alignments included dm2, droSim1, droYak1, droAna1, dp3, droVir1 and anoGam1. We selected a subset of alignments at least 100 bp long, and including at least 6 of 7 of our target species. Identity percentage with reference genomes hg17 or dm2 was calculated for all species as described before. Identity was set to mutational equilibrium values (44% for insects and 43% vertebrates) for target regions missing in the alignments or including >50% of gaps.
The selected set included 44'709 insect alignments and 181'556 vertebrate alignments, with an average length of 120 bp. For each alignment, the bases of the reference genome sequence were labeled as coding, non-coding or repeat based on Ensembl annotations. We define as CDS all sequences with at least 80% of coding bases, and as NC, all sequences consisting exclusively of non-coding bases (excluding repeats). Alignments not satisfying either of these criteria were excluded. We split the set of alignments in 6 bins of equal size, stratifying them by the mean % identity values observed with two close target species. For vertebrates, we considered macaque and mouse; for insects D. simulans and D. yakuba. The two most conserved bins of the four classes (vertebrate/insect, coding/non-coding) were considered for further analysis.
We thank Stylianos Antonarakis, Winship Herr and the reviewers for helpful comments and pointing us to useful references, Rasmus Nielsen for suggestions regarding the population genetics analysis, Eivind Valen, Roberto Fabbretti, Christian Iseli and the whole Vital-IT team for technical support. This project was founded by grants from the Swiss Institute of Bioinformatics and the Swiss National Science Foundation (205321-103973).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.