Sequence conservation and combinatorial complexity of Drosophila neural precursor cell enhancers

Background The presence of highly conserved sequences within cis-regulatory regions can serve as a valuable starting point for elucidating the basis of enhancer function. This study focuses on regulation of gene expression during the early events of Drosophila neural development. We describe the use of EvoPrinter and cis-Decoder, a suite of interrelated phylogenetic footprinting and alignment programs, to characterize highly conserved sequences that are shared among co-regulating enhancers. Results Analysis of in vivo characterized enhancers that drive neural precursor gene expression has revealed that they contain clusters of highly conserved sequence blocks (CSBs) made up of shorter shared sequence elements which are present in different combinations and orientations within the different co-regulating enhancers; these elements contain either known consensus transcription factor binding sites or consist of novel sequences that have not been functionally characterized. The CSBs of co-regulated enhancers share a large number of sequence elements, suggesting that a diverse repertoire of transcription factors may interact in a highly combinatorial fashion to coordinately regulate gene expression. We have used information gained from our comparative analysis to discover an enhancer that directs expression of the nervy gene in neural precursor cells of the CNS and PNS. Conclusion The combined use EvoPrinter and cis-Decoder has yielded important insights into the combinatorial appearance of fundamental sequence elements required for neural enhancer function. Each of the 30 enhancers examined conformed to a pattern of highly conserved blocks of sequences containing shared constituent elements. These data establish a basis for further analysis and understanding of neural enhancer function.


Background
Studies over the last two decades have revealed that cisregulatory elements, i.e. enhancers, contain multiple DNA-binding sites for different transcription factors (TFs) that cooperatively function to direct the tissue specific expression of their associated genes [1]. DNA sequence comparisons of different co-regulating enhancers suggest that many of these enhancers rely on different combina-tions of TFs to achieve coordinate gene regulation [2]. For example, during early Drosophila neural development, combinatorial interaction of proneural basic helix-loophelix (bHLH) TFs with homeodomain proteins, regulate commitment and patterning of neural precursors [3][4][5][6][7][8].
Cross-species analysis of individual Drosophila enhancers, using EvoPrinter or conventional alignment based phylogenetic comparative analysis [9,10] and the twelve sequenced Drosophila genomes, representing over 160 million years of collective evolutionary divergence, reveals that these enhancers are made up of clusters of highly conserved sequence blocks (CSBs), separated by less conserved sequences of variable length [11]. CSBs that are longer than 8-10 bp are likely to be made up of adjacent or overlapping DNA-binding sites for different TFs. For example, the Drosophila Krüppel central domain enhancer contains overlapping highly conserved binding sites for its known regulators [12][13][14]10]. Specifically, work from the Jäckle laboratory [14] has shown that one CSB of the central domain enhancer, 16 base pairs in length, contains overlapping binding sites for the antagonistic Bicoid activator and the Knirps repressor TFs.
In order to initiate the functional dissection of CSBs that make up neural precursor gene enhancers and to gain a better understanding of their architecture in terms of the substructure of their constituent sequence elements, we have developed a multi-step protocol (collectively known as cis-Decoder) that allows for the rapid identification of short 6 to 14 bp DNA elements, termed cis-Decoder tags (cDTs), within enhancer CSBs; these cDTs are shared between CSBs of two or more enhancers with either related or divergent functions [11]. To discover enhancer type-specific elements that regulate gene expression in neural precursor cells -including genes expressed in early delaminating CNS neuroblasts (NBs) and the proneural clusters and sensory organ precursors of the PNS -we have performed cis-Decoder analysis of CSBs from in vivo characterized enhancers. For early CNS development, we have selected the previously described enhancers of six genes that activate expression in early delaminating CNS NBs: deadpan (dpn), hunchback (hb), nerfin-1, scratch (scrt; the SA enhancer), snail (sna) and worniu (wor) ( Table 1) [15][16][17][18]. For the cis-regulatory regions that drive expression in the proneural clusters (PNCs) and sensory organ precursors (SOPs) of the PNS we selected the in vivo characterized enhancers for bearded (brd), deadpan (dpn), rhomboid (rho), scrt and sna (Table 1) [19][20][21][22][23][24].
Our analysis of the CSBs from these characterized enhancers has identified known TF DNA-binding sites and novel sequences of as yet unknown function. Enhancer typespecific sequence elements within CSBs appear in different combinations and contexts in enhancers of co-regu-lated genes. The information gained from cis-Decoder analysis of the neural precursor cell enhancer CSBs was used to discover a novel co-regulating enhancer that directs Drosophila nervy expression. Our studies indicate that although specific core DNA-binding sites (such as those for bHLH and homeodomain TFs) are enriched in enhancers of co-regulated genes, enhancer-binding specificity is most likely conferred through sequences that flank the consensus core docking sites. The fact that shared sequence elements of co-regulated enhancers reside in different combinations and positional ordering within each of the enhancers, suggests that their combined presence but not necessarily their relative positions is required for cis-regulatory function.

Neural precursor cell enhancers share highly conserved core sequence elements
To determine the extent to which neural precursor cell enhancers share highly conserved sequence elements, we performed cis-Decoder analysis of in vivo characterized enhancers (Table 1) [15][16][17][18][19][20][21][22][23][24][25][26][27][28]. Our analysis revealed the presence of both novel elements and sequences that contained consensus DNA-binding sites for known regulators of early neurogenesis. Table 2 lists cDTs shared by multiple CNS or PNS neural precursor cell enhancers. None of the elements shown were present in our collection of 819 CSBs from in vivo characterized mesodermal enhancers, thus ensuring their enrichment in neural enhancers. Highlighted are consensus binding sites for known TFs; basic Helix-Loop Helix (bHLH) factors and Suppressor of Hairless [Su(H)], respectively acting in proneural and neurogenic pathways [7]; Antennapedia class homeodomain proteins [29], identified by their core ATTA binding sequence, and the ubiquitously expressed Pbx-(Pre-B Cell Leukemia TF) class homeodomain protein Extradenticle, a cofactor of many TFs [30], identified by the core binding sequence of ATCA. More than half the conserved cDTs were novel, without identified interacting proteins. Many of the CSBs consisted of 8 or more bp, and often contained core sequences identical to binding sites for known factors as well as other core sequences that aligned with shorter novel cDTs, suggesting that the longer cDTs may contain core recognition sequences for two or more TFs.
Most cDTs discovered in this analysis represent elements that are shared pairwise, i.e., by only two of the NB enhancers examined (see the website for a list of cDTs that are shared by only two of the enhancers examined). The fact that the majority of cDTs are shared two ways, with only a small subset of sequences being shared three or more ways, suggests that the cis-regulation of early neural precursor genes is carried out by a large number of factors acting combinatorially and/or that many of the identified cDTs may in fact represent interlocking sites for multiple factors, and the exact orientation and spacing of these sites may differ among enhancers.

Neural specific cDTs that contain bHLH TF DNA-binding sites
During Drosophila neurogenesis, bHLH proteins function as proneural TFs to initiate neurogenesis in both the central and peripheral nervous system. TFs encoded by the achaete-scute complex function in both systems, while the related Atonal bHLH protein functions exclusively in the PNS [31]. Different proneural bHLH TFs, acting together with the ubiquitous dimerization partner Daughterless, bind to distinct E-boxes that contain different core sequences [32]. In addition to the core recognition sequence, flanking bases are important to the DNA binding specificity of bHLH factors [33].
One of the principle observations of this study was that the core central two bases of the hexameric E-box DNAbinding site (CANNTG; core bases are bold throughout) were conserved in all the species used to generate the Evo-Print. All of the enhancers included in this study contained one or more conserved bHLH-binding sites (Table  3), with NB and PNS enhancers averaging 3.9 and 4.1 binding sites respectively. More than a third of the core bases in NB bHLH sites contained a core GC sequence, and more than a third of the core bases in PNS bHLH sites contained either a core GC or a GG sequence. The most common E-box among the NB CSBs was CAGCTG with 14 sites in four of the six enhancers. The CAGCTG and CAGGTG E-boxes are high-affinity sites for Achaete/Scute bHLH proteins [22,34]. However the CAGCTG site itself is not specific to NB enhancers, as evidenced by its presence in four of the mesodermal enhancer CSBs characterized previously [11]. The most common bHLH-binding site among PNS enhancers was also the CAGCTG E-box with 11 occurrences in six of the 13 enhancers. In contrast, the most common bHLH motif in enhancers of the E(spl)-complex [25][26][27][28] was CAAGTG (data not shown), with 16 occurrences in 8 of the 11 enhancers. CAGGTG,    previously shown to be an Atonal DNA-binding site [32], was also common in E(spl) enhancers, with 9 occurrences in 8 of the 13 enhancers, but was less prevalent among NB enhancers. The CAGGTG box was also overrepresented in PNS and E(spl) enhancers relative to its appearance in NB enhancers, and it was also present in four of the characterized mesodermal enhancer CSBs. The CAGATG box was present six times among PNS enhancers but not at all among NB enhancers. Thus there appears to be some specificity of E-boxes in the different enhancer types. The fact that each of these E-boxes is conserved in all the species in the analysis, suggests that there is a high degree of specificity conferred by the E-box core sequence.
Our analysis also reveals that not only are the core bases of E-boxes shared between similarly regulated enhancers, but bases flanking the E-box were also found to be highly conserved and are also frequently shared by these enhancers. Among the E-boxes found in CSBs of NB enhancers (many are illustrated in Table 2) aaCAGCTG (core bases of E-box are bold, flanking bases lower case) is repeated three times in nerfin-1 and once in scrt; gCACTTG is repeated three times in scrt; CAGCTGCA is repeated twice in wor, and CAGCTGctg is repeated twice in scrt (see Fig 1).
In the dpn CNS NB enhancer, the E-box CAGCTG is found twice, separated by a single base (CAGCTGaCAGCTG). None of these sequences were present in mesodermal enhancers examined, but each is found in PNS enhancers; CAGCTGCA is repeated multiple times among PNS enhancers. Among the conserved PNS enhancer E-boxes (CAAATGca, gcCAAATG, cacCAAATGg, CACATGttg, gCACGTGtgc, ttgCACGTG, agCACGTGcc, aCAGATG, ggCAGATGt, CAGCTGccg, CAGCTGcaattt, gCAGGTGta and cCAGGTGa) each, including flanking bases, is found in two or three PNS enhancers, and these are distributed among all 13 enhancers. Of these, only agCACGTGcc, CAGCTGccg, cCAGGTGa were found once in our sample of neuroblast enhancers and none were found in our sample of mesodermal enhancers. The sequence aaCAAGTG is found in 4 E(spl) complex enhancers, those for E(spl)m8, mγ, HLHmδ and m6, and the sequence aCAGCTGc is found twice in E(spl)m8 and once in m4 and m6; neither sequence was found in our mesodermal enhancers. There-fore, although a given hexameric sequence may often be shared by all three types of enhancers, NB, PNS and E(spl), when flanking bases are taken into account there appears to be enhancer type-specific enrichment for different E-boxes.

Neural specific cDTs that contain Antennapedia class homeodomain DNA-binding sites
Antennapedia class homeodomain proteins play essential roles in multiple aspects of neural development including cell proliferation and cell identity [35]. The segmental identity of Drosophila NBs is conferred by input from TFs encoded by homeotic loci of the Antennapedia and bithorax complexes [36][37][38]. For example, ectopic expression of abd-A, which specifies the NB6-4a lineage, down-regulates levels of the G1 cyclin, CycE [38]. Loss of Polycomb group factors has been shown to lead to aberrant derepression of posterior Hox gene expression in postembryonic NBs, which causes NB death and termination of proliferation in the mutant clones [39].
We have examined the enhancer-type specificity of sequences flanking the Antennapedia class core DNAbinding sequence, ATTA [40]. Nearly 25% of the NB and PNS CSBs examined in this study contain this core recognition sequence. ATTA-containing sites were found multiple times in selected NB and PNS enhancers ( Figure 1). The cis-Decoder analysis identified 18 different neural specific ATTA containing cDTs that were exclusively shared by two or more PNS enhancers or CNS enhancers and 10 were found to be shared between PNS and CNS. The most common cDT, ATTAgca, was shared by two CNS and two PNS enhancers (Figure 1; consensus homeodomain-binding sites are bold, flanking sequence lower case). In addition, 6 homeodomain-binding site cDTs were found twice in wor CSBs, aATTAccg, tttgaATTA, aat-caATTA, ATTAATctt and aaacaaATTAg, but not in other CNS or PNS enhancer CSBs. In some cases these cDTs were found repeated in given enhancer CSBs. Only one of these cDTs aligned with CSBs of enhancers of the E(spl) complex. Given that 2/3 of the occurrences of HOX sites in these promoters can be accounted for by cDTs whose flanking sequences are shared between enhancers, it is unlikely that the appearance of these shared sequences occurs by chance.
In summary, the appearance of Hox sites in the context of conserved sequences shared by functionally related enhancers suggests that the specificity of consensus homeodomain-binding sites is conferred by adjacent bases, either through recognition of adjacent bases by the TF itself or in conjunction with one or more co-factors.

Neural specific cDTs that contain Pbx/Extradenticle sites
Examination of the cDTs from Drosophila NB and PNS enhancers revealed that many contained the core Pbx/ Extradenticle docking site ATGA [41,42]. In Drosophila, Extradenticle has been shown to have Hox-dependent and independent functions [43]. Studies have also shown that Pbx factors provide DNA-binding specificity for homeodomain TFs, facilitating specification of distinct structures along the body axis [43]. In the CNS enhancers of Drosophila, most predicted Pbx/Extradenticle sites are not, however, found adjacent to Hox sites.
Our analysis revealed that 8 of the Pbx motifs were shared between CNS and PNS enhancer types, and 16 were shared between similarly expressed enhancers (Figure 2), thus indicating that there appears to be some degree of specificity to Pbx site function when flanking bases are taken into account. Three of the Pbx binding-site containing elements also exhibit ATTA Hox sites: 1) the dodecamer GATGATTAATCT (Pbx site is ATGA, Hox sites in bold) shared by the PNS enhancers edl and amos (references in Table 1), contains a homeodomain ATTA site that overlaps the Pbx site by a single base, and 2) the smaller heptamer ATGATTA, shared by pfe and ato, likewise contains a homeodomain ATTA site (bold) that overlaps ATGA Pbx site by a single base. Adjacent Hox and Pbx sites have been documented to facilitate synergy between the two factors [44]. Taken together our findings suggest that, as with homeodomain-binding sites, the conserved bases flanking putative Pbx sites are functionally important. These flanking bases are likely to confer different DNAbinding affinities for Pbx factors or are required for binding of other TFs.

Neural specific cDTs that contain Suppressor of Hairless binding sites
Also indicating a degree of biological specificity of enhancer types is the distribution of Suppressor of Hairless Su(H) binding sites among neural enhancers. Su(H) is the Notch pathway effector TF of Drosophila [45]. The members of the E(spl) complex, both the multiple basic helix-loop-helix (bHLH) repressor genes and the Bearded family members, have been shown to be Su(H) dependent [23,26]. The consensus in vitro DNA binding site for Su(H) is RTGRGAR (where R = A or G) [25]. Notch signaling via Su(H) occurs through conserved single or paired sites [46] and the presence of conserved sites for other Shared cDTs that contain Antennapedia class homeodomain protein DNA-binding sites within CNS and PNS neural pre-cursor cell enhancers Figure 1 Shared cDTs that contain Antennapedia class homeodomain protein DNA-binding sites within CNS and PNS neural precursor cell enhancers. Shown is a Cytoscape display of CNS and PNS neural precursor cell enhancer cDTs that contain core ATTA homeodomain DNA-binding sites. cDTs flanking the enhancer names are shared by CSBs of a single enhancer type, and cDTs positioned between the enhancer names are shared in common by CSBs of the two different enhancer types. Only cDTs of 7 or more bases shared by two or more enhancers are portrayed.
Shared cDTs that contain Pbx/Extradenticle core DNA-bind-ing sites transcription regulators associated with CSBs containing Su(H) binding sites has been documented [47].
Within the CSBs of the six NB enhancers examined, only two, dpn and wor, contained conserved putative Su(H)binding sites; two dpn sites matched one of the Su(H) consensus sites (GTGGGAA) and two wor sites match the sequence ATGGGAA. Only one of the two dpn sites contained flanking bases conforming to the widely distributed CGTGGGAA site of E(spl) Su(H) binding sites and none of the NB enhancers contained paired Su(H) sites typical of the E(spl) enhancers [25,46].  [48,49], and an AGGA Tramtrack (Ttk) DNA-binding core recognition sequence [50], but the order and context of these three sites is different for each enhancer). Although Su(H) binding sites were present in only a minority of NB and PNS enhancers, the conservation of core bases, as well as the complexity of their flanking conserved sequences points to a diversity of Su(H) function and interaction with other factors.

Neural specific cDTs that contain core DNA-binding sites for other known TFs
Two of these elements, one exclusively present in NB enhancers (CAGGATA) and a second exclusively present in PNS enhancers (GTAGGA), contained consensus core
Most of the cDTs of Table 2 do not contain sequences corresponding to consensus binding-sites of known regulators of NB expression. The fact that they are represented multiple times in NB CSB sequences suggests that they contain binding sites for unknown regulators of neurogenesis in Drosophila.

cis-Decoder analysis reveals a complex sub-structure of enhancer CSBs
EvoPrint analysis revealed that all of the enhancer regions examined in this study contained multiple CSBs that were greater that 15 to 20 bases in length. The occurrence of overlapping DNA-binding sites for different TFs is currently the best explanation for the maintenance of intact CSB sequences across ~160 millions of years of collective species divergence. Our analysis has revealed that the sequence context, order and orientation of shared cDTs can differ between co-regulating enhancers.
Two examples are given here of the complex contextual appearance of cDTs that appear frequently in CNS and PNS enhancers (Figure 3). Each of the eight CSBs shown was nearly fully 'covered' by cDTs of the NB library (data not shown), suggesting that each contains multiple overlapping binding sites for a number of TFs. First, examination of the distribution of cDT GCTGCA reveals that it overlaps, by one and two bases, adjacent but different consensus bHLH sites in scrt CSB # 32, while in scrt CSB # 23 it overlaps a third consensus bHLH sequence by two bases. In the PNS enhancer char, in CSB # 17, GCTGCA overlaps a bHLH site, but in a different configuration (overlapping four bases) than found in the two CNS enhancers illustrated in Figure 3A. In amos CSB # 26, GCT-GCA appears adjacent to a HOX site and does not overlap a bHLH site. Second, examination of the distribution of the cDT GGCACG reveals that it overlaps different consensus bHLH sites in scrt CSB # 32 and wor CSB # 106, overlapping the bHLH site in the former by one base and in the latter by four bases. GGCACG overlaps a CAGCTG bHLHbinding site in rho CSB # 18, but in a different configuration than the overlap with CAGCTG in the wor CSB. In the PNS enhancer scrt, GGCACG in CSB # 5 overlaps a Hairy site N-box (consensus CACNAG) [48,49]. N-boxes were most common in E(spl) CSBs, but were also present in NB and PNS enhancer CSBs. In these two examples, and others we have examined, there is no consistent spatial constraints to the association of known TF-binding sites (i.e., Shared sequence elements are found in different orientations and patterns within CSBs of neural precursor cell enhancers  bHLH-binding E-box sites) with novel cDTs; a picture that emerges is one of combinatorial complexity, in which known or novel cDTs are associated with each other in different contexts on different CSBs.
As an initial step toward determining if different TFs interacted with one another or competed for flanking DNAbinding sites, we examined the proximity of known binding sites to one another in CSBs for bHLH, Hox, Pbx and Su(H). The results of this analysis for NB CSBs are shown in Table 4; data for other enhancer types is summarized here. Most striking was the presence of multiple adjacent Hox ATTA sites (10 instances on NB CSBs) and combinations of Hox and Pbx sites (9 instances NB CSBs). A typical example is the association of one Pbx site, a bHLH site and two Hox sites on a wor NB enhancer CSB (AATCATTT-GTAATAATTAG; Pbx site is ATCA, Hox sites are TAAT and ATTA, and bHLH site is bold). Associations of Hox and Pbx sites was also apparent in PNS enhancer CSBs, and in addition there was a high level of combined Hox and bHLH sites (11 instances on PNS CSBs), but in E(spl) enhancers only a higher level of the combination of Hox and Pbx sites (8 instances) was apparent. An example of the association of Hox and bHLH sites in a PNS enhancer is found in an achaete-scute dorso-central enhancer CSB (CAAAACAACACTTGCTCTATTAAC; bHLH site in bold and Hox site is ATTA). There was also a distinctly higher level of Pbx sites on the same CSBs as bHLH sites in NBs CSBs (6 instances), but this combination was not apparent for PNS or E(spl) CSBs. Association of bHLH sites with Su(H) binding sites was apparent in E(spl) enhancer CSBs, especially when presence on adjacent CSBs (14 instances) was taken into account. Only in one of the 7 instances of paired Su(H) sites on E(spl) enhancers were these sites on the same CSBs, while in four other instances they were on adjacent CSBs. Although we often find sites in close proximity, both known and functionally uncharacterized sites are, with a few exceptions, not present in fixed uniform orientation in similarly regulated enhancers. This highlights the complex combinatorial arrangement and position flexibility of TF-binding sites within enhancer CSBs.

The use of cis-Decoder, FlyEnhancer and EvoPrinter to identify novel enhancers
We have used the information derived from cis-Decoder analysis of neural precursor cell enhancers to search for other genomic sequences with similar cis-regulatory properties. Having identified cDTs found multiple times among NB enhancers, we used the genomic search tool FlyEnhancer [55] to identify Drosophila melanogaster genomic sequences that contained clusters of the following cDTs (number in parenthesis is the total number of each cDT in our sample of six NB enhancers): GGCACG (6), GGAATC (4), TGACAG (6), TGGGGT (4), CAGCTG (14), TGATTT (9) CAAGTG (7), CATATTT (5), TGATCC (7) and CTAAGC (6). As a lower limit, a minimum of three CAGCTG bHLH sites was set for this search, because of the prevalence of this site in nerfin-1 and deadpan NB enhancers. Each sequence detected by this search was subjected to EvoPrinter analysis to determine the extent of its sequence conservation. Among the cDT clusters identified, our search identified a 5' region adjacent to the nervy gene ([] that contained three conserved CAGCTG sites as well five other sites identical to TGACAG, GGAATC, TGGGGT, GGCACG and CATATTT (see below). nervy, originally identified as a target of homeotic gene regulation, is expressed in a subset of early CNS NBs, as well as in PNS SOP cells [56]. Later studies have implicated nervy, along with cyclic adenosine monophosphate (cAMP)dependent protein kinase (PKA) in antagonizing Sema-1a-PlexA-mediated axonal repulsion [57], and nervy has been shown to promote mechanosensory organ development by enhancing Notch signaling [58].
EvoPrinter analysis revealed that the cluster of neural precursor cell enhancer cDTs positioned 90 bp upstream from the nervy transcribed sequence contains highly conserved sequences ( Figure 4A; chr2R:20,162,556-20,163,290). This region contains 10 CSBs that include six conserved E-boxes, three of which conform to the CAGCTG sequence that was prominent in nerfin-1 and deadpan promoters. To determine if this region functions as a neural precursor cell enhancer, we generated transformant lines containing the nervy CSB cluster linked to a minimal promoter/GFP reporter transgene (see methods section). Our analysis of the reporter expression driven by the nervy upstream fragment revealed a pattern indistinguishable from early nervy mRNA expression [56] ( Figure  5). Specifically, we detected expression in a large subset of early delaminating NBs and in SOPs and secondary precursor cells of the PNS. Significantly, the nervy enhancer, unlike nerfin-1 and deadpan NB enhancers, activates reporter expression in then PNS and not just in early NBs.
A new cDT-library was generated combining the nervy enhancer CSBs and the NB and PNS enhancer CSBs used to generate the libraries described above. The new cDTs, along with the previously defined cDTs were aligned back to nervy CSBs (Figure 4b). Most cDTs were found only once in previously examined NB or PNS CSBs, but 21 cDTs appeared in our original analysis, described above, that did not include the nervy enhancer. The addition of this new enhancer to our analysis resulted in the discovery of a significant number of cDTs that had not been found previously. Three cDTs that were identified in the previous analysis, tCAGCTGc, cagCAGCTG and aaCAGCTG, contain bHLH DNA-binding sites (central bases of E-box in bold, flanking sequence are lower case). Aligning cDTs that are specific to the CNS or PNS may indicate  Table 1. cDTs generated by the inclusion of the nervy CSBs in the cDT library construction are also shown. CNS neuroblast specific cDTs are highlighted in red typeface, PNS precursor cell specific are noted with blue typeface and those present in both are indicated with black typeface (the number of enhancers that contain a cDT is also indicated).
sequences required to specifically drive expression in either the CNS or PNS.

Conclusion
The major finding of this study is that enhancers of co-regulated genes in neural precursor cells possess complex combinatorial arrangements of highly conserved cDT elements. Comparisons between NB and PNS enhancers identified CNS and PNS type-specific cDTs and cDTs that were enriched in one or another enhancer type. cis-Decoder analysis also revealed that many of the conserved sequences contain DNA-binding sites for classical regulators of neurogenesis, including bHLH, Hox, Pbx, and Su(H) factors. Although in vitro DNA-binding studies have shown that many of these factors have a certain degree of flexibility in the sequences to which they bind, defined in terms of a position weight matrix [60], our studies show that for any given appearance these sites are actually highly conserved across all species of the Drosophila genus. The genus invariant conservation in many of these characterized binding sites indicates that there are distinct constraints to that sequence in terms of its function.
The high degree of conservation displayed in the enhancer CSBs could derive from unique sequence requirements of individual TFs, or the intertwined nature of multiple DNA-binding sites for different TFs. Thus there is a higher degree of biological specificity to these sites than the flexibility that is detected using in vitro DNA-binding studies.
As an example, the requirement for a specific core for the bHLH binding site, i.e., for a CAGCTG E-box for nerfin-1, deadpan and nervy, suggests that it is the TF itself that demands sequence conservation; however, the requirement for conserved flanking sequences suggests that additional specific factors may be involved. Although the inter-species conservation of core and flanking sites has been noted by others [25], the extent of this conservation is rather surprising. To what extent and how evolutionary changes in enhancer function take place, given the conservation of core enhancer sequences, remains a question for future investigation.
In addition to classic regulators of neurogenesis, cis-Decoder reveals additional conserved novel elements that are widely distributed or only detected in pairs of enhancers. Many of these novel elements flank known transcription binding motifs in one CSB, but appear independent of known motifs in another. The appearance of novel elements in multiple contexts suggests that they may represent DNA-binding sites for additional factors that are essential for enhancer function. Only through discovery of the factors binding these sequences will it become clear what role they play in enhancer function.
Preliminary functional analysis of CSBs within the nerfin-1 neuroblast enhancer reveals that CSBs carry out different regulatory roles (Alexander Kuzin, unpublished results).
Altering cDT sequences within the nerfin-1 CSBs reveals that most are required for cell-specific activation or repression or for normal enhancer expression levels. CSB swapping studies reveals that, for the most part, the order and arrangement of a number of tested CSBs was not important for enhancer function in reporter studies. The discovery of the nervy neural enhancer by searching the genome with commonly occurring NB cDTs underscores the potential use of EvoPrinter and cis-Decoder analysis for the identification of additional neural enhancers. By starting with known enhancers and building cDT libraries from their CSBs, one now has the ability to search for other genes expressed during any biological event.

Generation of EvoPrints and CSB-libraries
EvoPrinter analysis was performed as described [10,61]. This analysis used EvoPrinterHD (please see Availability & requirements for more information) a second-generation EvoPrinter program that uses an enhanced-BLAT algorithm for increased resolution of conserved sequences [61].
Expression pattern of the nervy enhancer-GFP reporter transgene during embryonic CNS and PNS development  [59].
Detailed instructions are provided at the EvoPrinter web site.
When possible, all twelve Drosophila species were used for the EvoPrint analysis, while species that exhibited sequencing gaps were excluded. CSBs within enhancers were curated from either an EvoPrint, which reveals bases conserved in all species, or a relaxed print (also known as an EvoDifference profile) that identifies base pairs that are conserved in all but one of the species. The collective evolutionary divergence for all of the EvoPrints was greater than 140 My and in most cases, when all twelve species were included in the analysis, EvoPrints represented over 200 My of additive divergence. With the exception of two NB enhancers, scrt and wor, the size of each curated sequence was less than 1800 bases ( Table 1). CSBs of 6 bp or longer were extracted from the EvoPrints using Evo-Print parser to generate CSB libraries. The number of CSBs in each enhancer, enhancer length, and relation of the enhancer with respect to the transcriptional start site is shown in Table 1. Lists of CSBs for each library are given at the cis-Decoder web site (please see Availability & requirements for more information).

Generation of cis-Decoder Tag libraries
In order to focus the analysis on neural-specific and neural-enriched cDTs, those cDTs that were found at high frequency in non-neural (mesodermal) enhancers were placed in a shared/common cDT-library. To identify neural specific cDT elements, the frequency of cDTs was scored against an out-group of mesodermal CSBs [11], and subsequently the common elements were removed. Prior to removal of mesodermal cDTs, the number of NB cDTs was 856, whereas after removal of shared cDTs, the number dropped to 272, indicating that the majority of cDTs shared by NB enhancers were also present in mesodermal enhancers.
Three cDT-libraries were generated by alignment of NB, PNS and E(spl) CSBs and are provided at the cis-Decoder web site (please see Availability & requirements for more information). The number of cDTs in each library was 272, 333 and 226 respectively. Of the 272 NB cDTs, less than half (120) aligned exclusively with NB CSBs, and did not align with PNS or E(spl) CSB sequences. Only 21% of the NB cDTs corresponded to PNS tags -in other words only 21% of the NB tags aligned two times or more with PNS CSBs.

Cytoscape analysis
We have adapted the biomolecular interaction network software Cytoscape [62] in order to display shared cDTs from different enhancer CSBs. The following data structure was used: node1 xx node2, where node1 is the name of an enhancer, xx refers to any designator and node2 is the cDT sequence. This data structure facilitates the display of enhancer identity and shared sequence elements in an interactive pattern. Cytoscape analysis requires elimination of the reverse complements of cDTs in order to avoid duplicate representation. To eliminate duplicate reverse-complement cDTs, we used the program cDT-Uncomplementer (please see Availability & requirements for more information). After removing duplicates, cDTcataloger was used to name each node according to the enhancer aligning with that cDT.

Identification of novel neural precursor cell enhancers
To identify novel enhancers that direct gene expression in neural precursor cells, we curated cDTs that were shared by multiple identified NB enhancers and submitted them to the web-based genomic search tool FlyEnhancer [55], to discover other genomic regions with similar densities of cDTs. Candidate sequences that contained densities of cDTs alignments were subject to EvoPrinterHD analysis to determine the extent of conservation. Candidate enhancer regions were selected for enhancer/reporter studies.

Generation and analysis of nervy enhancer/reporter transformant lines
Genomic DNA containing the putative nervy enhancer (734 bp) was amplified by PCR using standard methods. Primers for the nervy upstream region including BglII and Nhe1 sites (bold) were respectively AGATCTCTAAAGC CCTCGATGTGCCC (5') and GCTAGCTCCGACCAGTCG-TAAGTGGCG (3'). Fragments were gel purified and cloned into the pCRII-TOPO double promoter vector. Sequencing verified the fidelity of the PCR and cloning. After cutting with Bgl and Nhe1, gel purification was performed and fragments were cloned into pH-Stinger [63]. Details of our procedure are available upon request. The generation of transformant lines and embryo immunohistochemistry were carried out as described previously [64].

Authors' contributions
WR and KB participated in the design and implementation of the algorithms. AK and MK participated in the cloning of enhancers. TB and WFO conceived of the study, participated in the design and coordination of the algo-rithms and prepared the manuscript. All authors have read and approved the final draft of the manuscript.