Ancient Pbx-Hox signatures define hundreds of vertebrate developmental enhancers

Background Gene regulation through cis-regulatory elements plays a crucial role in development and disease. A major aim of the post-genomic era is to be able to read the function of cis-regulatory elements through scrutiny of their DNA sequence. Whilst comparative genomics approaches have identified thousands of putative regulatory elements, our knowledge of their mechanism of action is poor and very little progress has been made in systematically de-coding them. Results Here, we identify ancient functional signatures within vertebrate conserved non-coding elements (CNEs) through a combination of phylogenetic footprinting and functional assay, using genomic sequence from the sea lamprey as a reference. We uncover a striking enrichment within vertebrate CNEs for conserved binding-site motifs of the Pbx-Hox hetero-dimer. We further show that these predict reporter gene expression in a segment specific manner in the hindbrain and pharyngeal arches during zebrafish development. Conclusions These findings evoke an evolutionary scenario in which many CNEs evolved early in the vertebrate lineage to co-ordinate Hox-dependent gene-regulatory interactions that pattern the vertebrate head. In a broader context, our evolutionary analyses reveal that CNEs are composed of tightly linked transcription-factor binding-sites (TFBSs), which can be systematically identified through phylogenetic footprinting approaches. By placing a large number of ancient vertebrate CNEs into a developmental context, our findings promise to have a significant impact on efforts toward de-coding gene-regulatory elements that underlie vertebrate development, and will facilitate building general models of regulatory element evolution.


Background
Cis-regulatory elements play an essential role in the precise co-ordination of vertebrate development as illustrated by the increasing number of examples where mutations in such sequences lead to developmental malformations [1][2][3]. One of the major challenges in modern biology is the deciphering of the regulatory language, syntax and grammar, encoded in the genome, that directs spatio-temporally restricted gene expression. To achieve this requires the identification and functional characterisation of cis-regulatory elements, followed by the deconvolution of the TFBSs therein.
Cis-regulatory elements can be predicted by sequence conservation analysis, as tight clusters of functional TFBSs can be under strong evolutionary constraint [4][5][6][7]. Alternatively, targeted approaches involving chromatin immunoprecipitation (ChIP) can be used to identify binding-events between specific transcription factors and DNA [8][9][10]. These are complementary approaches, as ChIP analyses are restricted to identifying regulatory regions that are targets of selected TFs at the particular time-points chosen for the analysis, whilst sequence conservation can identify elements irrespective of the TFs that bind to them or the developmental time-points at which they act. Sequence conservation can also provide evidence for ancient gene regulatory network (GRN) interactions that are shared between species. Comparative approaches applied to vertebrate genomes have identified a set of putative regulatory elements showing extreme conservation across mammals (Ultraconserved elements [11]), as well as Conserved Noncoding Elements (CNEs) shared between mammals and fishes [6,12,13]. These elements are clustered around developmental genes [6,12] and a large proportion of CNEs that have been tested in transgenic assays drive spatially restricted reporter gene expression in mouse or zebrafish embryos [6,7,14]. Furthermore, a number of CNEs have been shown to have roles in developmental diseases [1,15]. Despite their high sequence conservation between vertebrates, only a minute fraction of CNEs can be traced back to invertebrate chordates [16]. Thus, CNEs represent a set of cis-regulatory elements that are likely to be fundamental during development of the vertebrate body plan and comprise a valuable resource for deciphering the genomic regulatory code for vertebrate development.
Phylogenetic footprinting has been successfully implemented to identify TFBSs that play key roles in the action of individual CNEs [17][18][19]. However, there have been very few studies seeking to place large numbers of deeply conserved CNEs into a developmental context through using this approach [20,21]. Furthermore, despite progress having been made identifying key sequence motifs within vertebrate promoter elements [22], ancient CNEs have remained somewhat recalcitrant to systematic motif-identification algorithms, despite some elegant targeted approaches. Within non-coding elements conserved amongst mammals, a large number of long motifs [12-22 nucleotides) [23] and some shorter motifs [24] have been identified as overrepresented. However, the majority of these were not matched to any known factors, nor linked to any patterns of enhancer activity, so the biological significance of these motifs is hard to interpret. Recently, an elegant study used a classifier algorithm to identify sequence motifs predictive of heart enhancer activity in mammalian CNEs [25]. However, mammalian CNEs represent a set of sequences that only partially overlap with the more ancient mammal-fish CNEs and it is not clear to what extent they are functionally and mechanistically alike. Studies seeking to identify motifs that contribute to tissue-specific expression of deeply conserved vertebrate CNEs have discovered novel motifs associated with forebrain enhancer activity [7,21]. As part of a large-scale project to characterise the in-vivo enhancer activity of CNEs, Pennacchio et al. [7] used four human-fugu CNEs that drove forebrain reporter expression in mouse embryos to identify 6 enriched 5 bp-long sequence motifs. 23 elements enriched for these motifs were tested for enhancer activity, of which 4 were found to drive forebrain expression -an enrichment for this expression domain compared to the original enhancer set. Li et al. [21] characterised 13 CNEs driving forebrain reporter expression in zebrafish embryos, identifying 5 enriched motifs of 6 bp and demonstrating that these sequences contributed to forebrain enhancer activity. These investigations go some way towards providing a developmental context for the CNEs with those motifs, but this is somewhat limited by the factors that bind to them remaining uncharacterised. Whilst it is unclear to what extent ancient vertebrate CNEs are composed of 'conventional', previously characterised TFBSs, candidate motif search approaches have provided evidence that mammalian UCEs are enriched for known TFBS motifs [26] and that ancient vertebrate CNEs associated with genes involved in CNS development show enrichment for Oct and Sox motifs [20]. The success of these isolated studies hints that it may be possible to systematically identify functional TFBSs within CNEs by phylogenetic footprinting.
The sea lamprey (Petromyzon marinus) hails from an anciently diverging jawless vertebrate lineage, the agnathans, which split from the jawed vertebrate lineage 550-650 million years ago [27]. We have previously found a significant number of CNEs that are conserved between lamprey and jawed vertebrates [28]. We predicted that the relatively low sequence identity between the lamprey and jawed-vertebrate homologous elements would facilitate the identification of conserved TFBS motifs within them. In addition, characterisation of these motifs could illuminate ancient GRN interactions common to all vertebrates. Thus, we sought to identify TFBS motifs in CNEs by performing phylogenetic footprinting, using the lamprey elements as a guide.
Here we identify deeply conserved TFBS motifs for the Pbx-Hox heterodimer within a cluster of CNEs associated with the meis2 gene. We use in-silico analyses to demonstrate that jawed vertebrate CNEs and other sets of conserved vertebrate enhancers are highly enriched in Pbx-Hox motifs. Using reporter assays in zebrafish and lamprey embryos, we show that these motifs correlate with enhancer function in the hindbrain and pharyngeal arches. These results represent a further step toward decoding vertebrate CNEs, allowing a large proportion of them to be more firmly placed into a developmental context and revealing ancient gene regulatory network interactions for hindbrain patterning that are shared across vertebrates. Finally, our findings enable us to hypothesise an evolutionary scenario regarding the role of many CNEs in the evolution of the vertebrate hindbrain and the branchial region of the head.

Results
A set of meis2 CNEs drive expression in the hindbrain and cranial ganglia in zebrafish and lamprey embryos We previously identified a genomic region, downstream of the developmental gene meis2, containing a number of CNEs that are conserved between jawed vertebrates and lamprey [28] (Additional File 1). We grouped these CNEs into five separate elements (Additional File 2) for functional testing in a zebrafish tol2 reporter assay [29]. Four of these elements drive discreet and complementary patterns of reporter expression in the hindbrain of zebrafish embryos, with homologous zebrafish and lamprey elements driving highly similar expression patterns ( Figure 1). These patterns of reporter expression are consistent with the endogenous expression of meis2 in the hindbrain [30,31], where Meis proteins play a crucial patterning role by interacting with Hox and Pbx transcription factors [32,33]. Lamprey and zebrafish CNE 3285 elements both drive GFP expression in the cranial ganglia and primary neurons of the hindbrain and spinal cord. CNE 3288 elements of zebrafish and lamprey drive GFP in neurons of the hindbrain posterior to rhombomere 4 (r4), as shown by comparison to RFP expression in r3 and r5 in a transgenic line containing RFP under the control of a krox20 regulatory element [34]. CNE 3299 elements up-regulate GFP in the anterior hindbrain -r2-4 for the lamprey homolog and r3-4 plus neural crest migrating into the hyoid pharyngeal arch for the zebrafish homolog. CNE 329X of lamprey and zebrafish both drive GFP expression in the anterior hindbrain and neurons of the midbrain. We have developed a parallel reporter assay in lamprey embryos (in submission) to assess the functional conservation of CNEs across vertebrates. Using this assay, we have tested lamprey CNEs 3285 and 3299 for enhancer activity during lamprey embryogenesis. In lamprey embryos, CNE 3285 drives GFP expression in the cranial ganglia and CNE 3299 in the anterior hindbrain ( Figure  1). Thus, for both of these elements, the pattern of reporter expression driven in lamprey embryos is almost identical to the pattern driven in zebrafish embryos (Figure 1). This provides compelling evidence that these CNEs are part of a gene-regulatory network for hindbrain patterning that is conserved across all vertebrates.

Some meis2 CNEs contain deeply conserved Pbx-Hox TFBS motifs
Because of its clear and specific expression pattern in the hindbrain of zebrafish and lamprey embryos, we chose element 3299 as a starting point for the identification of putative transcription-factor binding-sites by phylogenetic footprinting. A number of studies have documented a role for the anterior Hox proteins in regulating rhombomere-specific gene expression by binding as hetero-dimers and -trimers with the TALE-class homeodomain proteins Pbx and Meis [17,18,35,36]. These complexes bind to characteristic binding-sites composed of partially overlapping Pbx-Hox half-sites, frequently in conjunction with a distal Meis/Prep site [17,18,35,36]. In some cases it has been shown that the pbx-hox motif is both necessary and sufficient for highly specific patterns of reporter expression, for instance for activity of a mouse hoxb1 enhancer in r4 in the mouse hindbrain [17] and for r4 and pharyngeal arch activity of a mouse hoxb2 enhancer [36].
We identified two Pbx-Hox motifs within CNE 3299, conforming to the TGATNNAT consensus [37,38], that are conserved across all sequenced vertebrate genomes, each closely associated with conserved Meis motifs (TGACAG/A) [39] (Figure 2). In the zebrafish sequence, the first pair of Pbx-Hox and Meis motifs is also preceded by a Pbx-Meis motif (TGATTGACAG/A) [39]. We verified the essential nature of these motifs for rhombomere-specific activity of the enhancer through mutagenesis of the zebrafish element followed by reporter assay (Methods). Mutating the first cluster of motifs (sub1) resulted in a loss of the neural crest expression of the wild type enhancer and less anteriorly restricted expression in the hindbrain compared to the wild type element (Figure 2b, c). Mutation of the second Pbx-Hox and Meis motif cluster abrogated reporter expression by this enhancer altogether, whilst a construct in which both motif clusters were mutated (sub12) also drove no GFP expression. Interestingly, CNEs 3285, 3288 and 329X were also found to harbour conserved Pbx-Hox and Meis motifs. Together, the expression patterns of these elements in the hindbrain (3285, 3288, 3299, 329X), as well as in the pharyngeal arch neural crest (3299), suggests that these motifs may represent a common feature of CNEs that drive segment-specific expression patterns in the vertebrate hindbrain and pharyngeal arches.

Pbx-Hox motifs are enriched in CNEs and in other sets of conserved vertebrate enhancers
In order to address how widespread Pbx-Hox motifs are across conserved vertebrate enhancers, we performed a systematic scan for these motifs in vertebrate CNEs. We searched for instances of the canonical Pbx-Hox motif, TGATNNAT, that are completely conserved across CNE multiple sequence alignments. In a set of 246 alignments of CNEs between human, zebrafish, fugu and lamprey (Methods), we identified 61 conserved motifs, representing 22 fold enrichment over shuffled alignments (Methods and Additional File 3). Furthermore, in a set of 4259 gnathostome CNE alignments (of human, fugu and zebrafish sequences), 712 conserved motifs were identified; a 9 fold enrichment compared to shuffled alignment controls.
Further analysis of Pbx-Hox motifs in the gnathostome set reveals a paucity of cytosines at variable positions 5 and 6 ( Figure 3). This is a feature of characterised Pbx-Hox binding-sites, where T, A or G at these positions contribute to determining the Hox specificity of the binding site [38,40,41]. Furthermore, positions 9 and 10, immediately 3' to the canonical Pbx-Hox motif, show strong bias towards G/T and A/G respectively, thereby defining a more stringent TGATNNATKR (KR) consensus motif that is also consistent with previously characterised Pbx-Hox binding-sites [17,18,37,38] (Figure 3). Further analysis of the lamprey and gnathostome CNE alignment sets results in even stronger enrichment for this 'KR' motif (Additional File 3).
We complemented our 'bottom-up' search for Pbx-Hox motifs in CNEs with a 'top-down' de novo motif search using the tool Cis-Finder [42]. Strikingly, one of the top-scoring predicted motifs identified by Cis-Finder matches our consensus KR motif for a set of 6, 693 human sequences from the CONDOR CNE database [43] ('CONDOR CNEs') ( Figure 3, Methods). The KR motif occurs 562 times in this CNE set, representing a highly significant enrichment over shuffled versions of the motif (p = 5.7 × 10 -5 ), and when compared to control genomic regions and the entire human genome ( Table 1, Methods and Additional File 4). Interestingly, the Meis motif is also significantly enriched in the CONDOR CNE set (p = 1.0 × 10 -4 )(Additional File 5). We then examined the distribution of KR motifs in other sets of evolutionarily conserved non-coding sequences. The VISTA Enhancer Browser (EB) [44] contains over 1300 human sequences, around half of which drive reporter gene expression in mouse embryos at day 11.5. There is a significant enrichment for the KR motif (p = 0.0033) across the entire dataset compared with shuffled versions despite the fact that some of the sequences in EB are not deeply conserved (Table 1). Finally, we analysed a large set of deeply conserved human CNEs identified through comparison with the cartilaginous chimera, Callorhinchus milii [13], and once again found significant enrichment for the KR motif (p = 6.2 × 10 -5 ) ( Table 1 -'Shark CNEs').
Pbx-Hox motifs are associated with hindbrain and pharyngeal arch CNE enhancer function Next, we tested whether Pbx-Hox motifs within CNEs associate with segment-specific reporter expression in the hindbrain and pharyngeal arches. To do this, we assayed 21 zebrafish CNEs containing conserved Pbx-Hox motifs for reporter expression in zebrafish. All of these CNEs are conserved across gnathostomes, with 11 also identifiable in lamprey (Additional File 2). Elements were chosen to represent a range of different genes from the lamprey and gnathostome CNE sets. 12 of these 21 elements consistently up-regulate patterns of reporter expression, comprised of 8 from the lamprey set and 4 from the gnathostome set. It should be noted that some of the elements from which no consistent expression patterns were obtained may act as enhancers in-vivo, but not in our transient transgenic reporter assay, possibly due to being taken out of their genomic context. Remarkably, 11 of the 12 GFP-expressing elements (91.6%) drive expression either in the hindbrain, pharyngeal arches or both, with one element expressing in the trunk musculature ( Figure 4). In support of the hypothesis that these elements are directly regulated by specific Hox proteins, which have segmentally-restricted expression patterns, the majority of the elements expressing in the hindbrain do so in particular rhombomeres, as shown by comparison with r3r5 RFP expression (Figure 4). Hindbrain reporter expression driven by these elements is often further restricted dorso-ventrally (e.g. Nkx6-1_4281), medio-laterally (e.g. Pax2_217) and temporally (e.g. Tshz3_43509).
We next examined functional data from the VISTA Enhancer Browser (EB). Compared to shuffled motifs, the KR motif was found to be significantly enriched in those elements annotated as hindbrain positive as well as those positive for either hindbrain, branchial arch or cranial nerve expression (Table 1). Investigating those EB elements that overlap CNEs from the CONDOR set, we found significant enrichment for the KR motif in those with hindbrain expression (HB+, 64 motifs in 112 kb) compared with those with no hindbrain annotation (HB-, 85 in 238 kb) (chi-square p = 0.0042). We then focused upon those sub-regions within EB enhancers that align directly with CNEs. Within these deeply conserved regions, there was more than two-fold enrichment for the stringent Pbx-Hox motif (30 occurrences in 24990 bp of HB+ elements compared with 32 occurrences in 60341 bp of HB-elements; p = 0.001). Importantly, this enrichment demonstrates that Pbx-Hox motifs in ancient CNEs show a correlation with hindbrain reporter expression. We also analysed a smaller dataset from the cneBrowser [21] that contains evolutionarily conserved enhancers associated with genes expressed in forebrain and hindbrain during zebrafish development. Although only 18 of 146 enhancers are annotated as hindbrain positive, 7 out of a total of 17 identified KR motifs reside in hindbrain positive enhancers (p = 3 × 10 -4 ) ( Table 1). Figure 3 Pbx-hox motifs in CNEs strongly resemble verified PBX-HOX binding-sites. Position frequency logos generated from Gnathostome alignments (based on 712 conserved human TGATNNAT motifs in 4529 CNE alignments), Human CNEs (generated from the CONDOR CNE set using Cis-finder [41]) and from previous studies [36] (Literature). The relative base frequencies at positions 5 and 6, and 9 and 10, in CNEs, are in good agreement with known functional Pbx-Hox binding sites, supporting a strong KR consensus.
CNEs containing Pbx-Hox motifs are associated with genes that have roles in A-P patterning of the hindbrain and head We have examined the distribution of Pbx-Hox motifs across CNEs of different genes, to ask whether genes with the highest enrichment of Pbx-Hox motifs in their CNEs have roles in hindbrain or pharyngeal arch patterning (Table 2 and Additional file 6). In keeping with the common use of auto-regulation in gene-regulatory networks [45], we find the CNEs of the HOXD cluster and the Hox co-factors, PBX3 and MEIS2, to be amongst those with the highest number of these motifs. Many of the other genes with the highest density of Pbx-Hox motifs in their CNEs have characterised roles in anterior-posterior (A-P) head patterning and show segment specific patterns of expression during development. For instance, the ZNF503/703 (Nlz1 and Nlz2) zinc-finger proteins are essential for specification of rhombomere 4 in zebrafish [46,47]. The orphan nuclear receptor genes NR2F1/2 (COUP-TF1/2) are negative transcriptional regulators involved in the retinoic acid signalling pathway, which has a key role in A-P patterning of the hindbrain and pharyngeal arches [48]. The members of the teashirt protein family (TSHZ1, 2 and 3) show segment-specific hindbrain expression [49], Tshz1 being essential for segmentally restricted gene expression in the hindbrain and pharyngeal arches of frog and mouse [50,51].
There is good agreement between the genes highlighted by our in-silico binding-site search and by microarray screens for downstream targets of hoxb1 in rhombomere 4 of zebrafish [52] and mouse [53]. Specifically, the expression levels of znf503, tshz2, evi1, zic4, shox, and meis2.1 are decreased upon knock-down of Enrichment analysis for Pbx-Hox KR motifs, relative to shuffled versions (retaining G+C content for each binding site), within different sets of CNEs. CNEs from the VISTA enhancer browser (EB) and zebrafish cneBrowser (CB) sets have also been grouped according to annotated expression in the hindbrain (HB), branchial arches (BA) or cranial nerves (CN). All sequences are human except the Zebrafish cneBrowser set. N/S = not significant HoxB1 in zebrafish, with Znf503, Nkx6-1, Atbf1 and Mab21l2 down-regulated in HoxB1-/-mouse embryos. Accordingly, the CNEs around each of these genes are enriched in Pbx-Hox motifs ( Table 2 and Additional File 6). Thus, both microarray datasets are consistent with our prediction that Pbx-Hox motifs in CNEs represent direct regulatory links between Hox genes and their targets during development.

Discovery of Pbx-Hox motif enrichment is a further step toward de-coding CNEs
Despite a pervasive assumption that CNEs bind transcription factors in order to elicit gene activation, there is, perhaps surprisingly, very little direct evidence to confirm this. We sought to identify TFBS motifs in CNEs through phylogenetic footprinting, reasoning that the relatively high divergence of lamprey CNEs would highlight important motifs. The utility of this approach is confirmed by our identification of conserved Pbx-Hox and Meis TFBS motifs in CNEs. The enrichment of the Pbx-Hox TFBS motif in the jawed vertebrate CNE set reveals this motif to be a regulatory signature that is utilised by a large proportion of highly conserved cis-regulatory elements (the 6, 693 CONDOR CNEs contain 562 KR motifs and 1416 TGATNNAT motifs). Whilst enriched motifs identified in mammalian conserved elements include a few that show partial overlap with variants of the Pbx-Hox consensus motif [23,24], the link between those enriched motifs and Hox factors had not been made, and their strong enrichment in more ancient CNEs had not been characterised. This enrichment agrees with the crucial, conserved roles of Hox factors in development of the vertebrate body plan. Indeed, the association of these motifs with hindbrain and pharyngeal arch enhancer function is in keeping with the characterised roles of Pbx, Hox and Meis factors in patterning these domains.
Despite the crucial roles of Hox factors in patterning the vertebrate embryo, relatively few downstream target genes, other than the hox genes themselves, have been identified. Our data suggests that Pbx-Hox motifs in CNEs can identify such targets. The striking manner in which Pbx-Hox and Meis TFBS motifs are highlighted as conserved sequence blocks in multiple alignments, especially when lamprey sequences are included, leads us to predict that this footprinting approach will be useful for further deciphering the regulatory code within vertebrate enhancers. In combination, our in-silico and functional analyses form an important link between well characterised cis-regulatory motifs and a large proportion of relatively uncharacterised ancient CNEs, helping to better place these elements within a developmental context. This represents a significant further step in systematically de-coding the enhancers responsible for development of the vertebrate body plan and highlights the utility of the lamprey as a model system for investigating vertebrate gene regulation. The diversity of expression patterns driven by our tested elements suggests that Pbx-Hox TFBSs are just one component of a complex cis-regulatory logic encoded within these enhancers. Whilst responding to A-P patterning cues by interacting with particular Hox factors through Pbx-Hox TFBSs, these elements concomitantly determine the tissues in which they are active (e.g. hindbrain vs pharyngeal arch) and limit the expression patterns dorso-ventrally, medio-laterally and temporally. An example of this is the CNE Pax2_403, which drives reporter expression that is restricted to a ventrolateral population of neurons in r2-3 of the hindbrain (Figure 4). Furthermore, whilst some of our functionally characterised CNEs drive reporter expression in domains with sharp boundaries that are co-incident with rhombomere boundaries -similar to that of previously characterised Pbx-Hox regulated elements -this is not the case for all of them. This could be due to the Pbx-Hox input establishing a competence for the enhancer to drive expression within particular rhombomeres, which is further restricted to specific sub-domains within the rhombomeres by the influence of other regulatory inputs to the enhancer. This would result in expression domains that do not encompass the whole area of expression of the regulating Hox factor. It is likely that the reason why many previously characterised Pbx-Hox regulated elements show expression domains across whole rhombomeres and with tight boundaries co-incident with rhombomere boundaries is that the majority of these elements are regulating Hox factors, and thus setting up or maintaining the rhombomerespecific Hox expression patterns. Many of the elements described in this study may be acting downstream of this Hox network, utilising these AP patterning cues along with other cues to further pattern the hindbrain. The tissue specificity of these enhancers, as well as the restriction of expression to specific domains and time points, is presumably due to other factors acting as specifiers by binding to nearby TFBSs. Identifying these specifiers and characterising their TFBSs, as well as the nature of their interactions with Hox factors, are key tasks toward understanding the cis-regulatory logic underlying vertebrate development. The set of putative Hox-responsive cis-regulatory elements identified in this study provides a powerful resource that will facilitate efforts toward this end.
Our expression data from the mutated versions of zebrafish CNE 3299 suggests that the multiple Pbx-Hox and Meis sites predicted in this enhancer may interact with each other, to co-operatively modulate and restrict reporter expression. The two clusters of Pbx-Hox and Meis motifs do not contribute equally to the expression driven by this enhancer in the hindbrain and pharyngeal arch neural crest. The second Pbx-Hox and Meis motif cluster appears to be necessary for the general function of this enhancer, as its mutation results in the loss of reporter expression in both hindbrain and neural crest. In contrast, the first Pbx-Hox and Meis motif cluster appears to be necessary (but not sufficient) for neural crest expression, but not for hindbrain expression. Conversely, it appears to restrict the hindbrain expression, as reporter expression is seen more anteriorly when this cluster is mutated. This is reminiscent of interactions between Pbx-Hox and Meis/Prep binding sites within a Hoxb1 enhancer, which direct expression of this gene to r4 the hindbrain in mouse and chick [54]. In that case, it was found that the formation of a Pbx-Hox-Meis/Prep ternary complex on Pbx-Hox and Meis sites within this enhancer could be restricted by the binding of a Pbx1-Prep1 heterodimer to a nearby site, thus limiting the expression driven by this enhancer to r4. This highlights the complexity of the regulatory interactions between transcription factors that are likely to bind to CNEs, a complexity that could well underlie their high sequence constraint.
A potential role for CNEs in the evolution of vertebrate head patterning A strength of identifying conserved cis-regulatory elements is that they can provide compelling evidence for conserved GRNs. Our reporter assay data from zebrafish and lamprey embryos clearly demonstrate functional conservation of enhancers shared between the most distantly related extant vertebrate lineages. We deduce that all vertebrates share aspects of a GRN for hindbrain patterning, downstream of nested Hox expression. As the sea lamprey is from a vertebrate lineage that diverged prior to the evolution of many jawed vertebrate innovations, such as paired appendages and jaws [55], we predict that the lamprey reporter assay will be a crucial tool for investigating the gene regulatory changes involved in vertebrate evolution.
Without detailed knowledge of the function or mechanism of action of CNEs it has been difficult to derive scenarios of how they evolved and became fixed in vertebrate genomes. The findings from our in silico and functional analyses, coupled with previous characterisation of Pbx-Hox and Meis transcription-factor complexes, enable us to propose a hypothesis regarding the role of a large number of CNEs in vertebrate evolution. Recognising the same TFBS motifs in worms, flies and vertebrates, the Pbx, Hox and Meis factors are part of an ancient regulatory language shared across bilaterians [17,36,37,56,57]. Nevertheless, none of the CNEs containing these motifs are identifiable in invertebrate genomes, leading us to speculate that many of these elements may have arisen in the vertebrate lineage. Accordingly, our functional data suggest that many of these CNEs have roles in patterning an elaborate head and brain -key vertebrate innovations [54]. We hypothesise that the fundamental role of head patterning in vertebrates led to the functional conservation of these elements and that their reliance upon the precise organisation of TFBSs necessitated their strict sequence conservation. The mechanisms through which new cisregulatory elements arise in the genome are still largely unresolved. In this case, the finding that simple Pbx-Hox sites are sufficient to drive robust and specific, but modifiable, expression [17] hints that these particular TFBSs may pioneer new cis-regulatory elements, functioning as one of the fundamental seeds from which many CNEs were able to grow.

Conclusions
The finding that vertebrate CNEs are highly enriched for Pbx-Hox binding-site motifs represents a further step toward de-coding ancient vertebrate enhancers. Coupled with our experimental data, this enables a large proportion of these elements to be more firmly placed into a developmental context and reveals ancient gene regulatory network interactions for hindbrain and head patterning that were present in ancestral vertebrates. Finally, our findings lead us to hypothesise that the evolution of many of these CNEs contributed to the elaboration of the vertebrate hindbrain and the branchial region of the head.

Identification of CNEs
6, 693 non-redundant human CNEs (average length 116 bp) were retrieved from the CONDOR database [43] at http://condor.nimr.mrc.ac.uk (Additional File 7). We used these to search lamprey sequence reads available from the NCBI trace server at http://www.ncbi.nlm.nih. gov/Traces/trace.cgi with sensitive parameters (-W 7 -q -1 -e 5e-4) as described previously [28]. The lamprey trace sequences were searched because they represent a greater coverage of the lamprey genome than the publicly available draft genome assembly, which consists of many short contigs and thus provides little advantage with regard to identification of conserved syntenic regions. Lamprey sequences satisfying the initial parametric threshold were further analysed for contamination, and those with > 90% homology to human or chicken across the whole read (i.e. extending outside the evolutionarily conserved region in other vertebrates) were removed.

Alignments
The sequences of human, fugu and most zebrafish CNEs were retrieved from the CONDOR database [57]. Additional zebrafish CNEs that were not previously included in the CONDOR database due to absence from earlier assemblies were identified using BLAST against a more recent zebrafish genome assembly (Zv8 release 58). Sequences in each alignment were clipped to the same size to prevent unaligned edges. To align the sequences we used ClustalW version 1.83. These alignments formed the lamprey CNE set (comprised of alignments of CNEs from human, fugu, zebrafish and lamprey) and the gnathostome CNE set (containing alignments of CNEs from human, fugu and zebrafish). As a control, for each CNE we also generated 1000 multiple alignments by randomly shuffling the columns of each alignment using the seqboot implementation in Phylip version 3.67. The sequences of lamprey and zebrafish CNEs in these datasets are given in Additional File 8 and Additional File 9 respectively. The sequences of CNEs from the EB and cneBrowser datasets are given in Additional File 10 and Additional File 11 respectively.

Scanning CNEs and control sets for Pbx-Hox motifs
We searched for Pbx-Hox motifs in two different types of datasets. Firstly, in multiple sequence alignments of CNEs from the CONDOR CNE database (the lamprey and gnathostome CNE sets). Secondly, in datasets consisting of sequences from just one species (the EB CNEs, shark CNEs, human elements from the CON-DOR CNE set, cneBrowser CNEs). The two different types of datasets required different types of control to test for Pbx-Hox motif enrichment. For the alignment sets, we generated control sets of shuffled alignments. To find evolutionarily conserved Pbx-hox motifs (TGATNNAT and TGATNNATKR) we employed the software Cis-Finder [42] on our two alignment sets and their respective shuffled alignment controls. A motif match was only considered if it matched all aligned species and occurred at the exact same aligned position. For the single species sequence sets we generated shuffled motifs, based upon the KR motif, as a control to search across the same sets. In parallel we also employed a de-novo motif finding strategy implemented in Cis-Finder on the CONDOR CNE set. It scans a set of DNA sequences for over-represented position frequency matrices (PFMs), clusters these and then estimates significance using the false discovery rate [42]. The TGATNNAT and TGATNNATKR motif occurances in the CONDOR CNE set are detailed in Additional File 12 and Additional File 13.
To characterise the frequency of Pbx-Hox KR motifs in different gene loci, we used the CONDOR CNE set, in which CNEs are grouped according to gene locus as specified in the CONDOR database [57]. For each gene locus, we counted the frequency of Pbx-Hox KR motifs in the associated CNEs and compared this to the average frequency in 1000 sets of randomised versions of CNEs from that locus. A markov chain model of order zero was used to generate shuffled sequences. To model DNA sequences, 4 states (A, C, G, T) and 4 transitions were used. Transition probabilities were retrieved from the CNE set by calculating the relative frequencies of the bases.

Measuring relative enrichment of Pbx-Hox motifs
To measure the enrichment, we compared the occurrence of Pbx-Hox motifs in a test set against shuffled versions (Table 1), calculated mean and standard deviation and generated z-scores. The z-scores were then transformed into p-values under a normal distribution model. We also counted Pbx-Hox occurrence and shuffled versions in a number of control regions and across the whole human genome (Additional File 4).

Zebrafish transgenesis
CNEs were amplified from zebrafish and lamprey genomic DNA by PCR, sub-cloned into the Pcr8/GW/TOPO vector (Invitrogen) and then into a Tol2 construct (pGW_cfosEGFP) [29,58,59], using the Gateway LR Clonase II enzyme (Invitrogen). The Tol2 reporter assay was performed as described previously [29]. Transient transgenic zebrafish embryos were screened for GFP expression at 24-30hpf, 48-54hpf and 72-78hpf using a Leica M165FC microscope and photographs taken with a Leica DFC310FX camera. Expression patterns were deemed consistent when found in > 20% of founders, consistent with previous studies [25,60].

CNE Mutagenesis
Mutations in zebrafish CNE 3299 were introduced by PCR from genomic DNA with primers containing the desired mutations either through conventional PCR (for sub1) or megaprimer PCR (for sub2) [61]. Mutated CNE PCR products were then cloned for zebrafish transgenesis as described above.

Lamprey transgenesis
The transgenesis protocol was based upon that developed in Xenopus [62]. Lamprey CNEs were amplified from genomic DNA and cloned into the cFos-I-sceI-EGFP plasmid, which contains the mouse cFos minimal promoter and EGFP coding sequence flanked by I-sceI restriction sites. Plasmids were extracted using the EndoFree Plasmid Maxi Kit (Qiagen) and eluted with water through QIAQuick columns (Qiagen). Fresh restriction digests (20 μl containing 400 ng plasmid, 15 units I-SceI enzyme (NEB), 1 × I-SceI buffer + BSA, digested for 40 minutes at 37°C) were micro-injected into 5-6hpf lamprey embryos using a Pico-Spritzer with drop volume of 2-3 nl. Lamprey husbandry was performed as described previously [63]. Embryos were screened for GFP expression between embryonic days 7-16. Typical survival rates ranged from 20-50% of injected embryos. The promoter alone drives highly mosaic background expression in the ectoderm in roughly 50% of surviving embryos. Enhancer-specific expression was seen in approximately 10% of surviving embryos.

Additional material
Additional file 1: Genomic sequences used to generate MLAGAN alignment. The genomic sequences of human, fugu, zebrafish and lamprey used to generate the MLAGAN alignment in this study.
Additional file 2: CNEs functionally tested for this study. The sequences of the CNEs of zebrafish (dr) and lamprey (pm) that were tested by reporter assay in zebrafish embryos for this study.