Comparative genomics using teleost fish helps to systematically identify target gene bodies of functionally defined human enhancers
© Parveen et al; licensee BioMed Central Ltd. 2013
Received: 9 November 2012
Accepted: 19 February 2013
Published: 23 February 2013
Human genome is enriched with thousands of conserved non-coding elements (CNEs). Recently, a medium throughput strategy was employed to analyze the ability of human CNEs to drive tissue specific expression during mouse embryogenesis. These data led to the establishment of publicly available genome wide catalog of functionally defined human enhancers. Scattering of enhancers over larger regions in vertebrate genomes seriously impede attempts to pinpoint their precise target genes. Such associations are prerequisite to explore the significance of this in vivo characterized catalog of human enhancers in development, disease and evolution.
This study is an attempt to systematically identify the target gene-bodies for functionally defined human CNE-enhancers. For the purpose we adopted the orthology/paralogy mapping approach and compared the CNE induced reporter expression with reported endogenous expression pattern of neighboring genes. This procedure pinpointed specific target gene-bodies for the total of 192 human CNE-enhancers. This enables us to gauge the maximum genomic search space for enhancer hunting: 4 Mb of genomic sequence around the gene of interest (2 Mb on either side). Furthermore, we used human-rodent comparison for a set of 159 orthologous enhancer pairs to infer that the central nervous system (CNS) specific gene expression is closely associated with the cooperative interaction among at least eight distinct transcription factors: SOX5, HFH, SOX17, HNF3β, c-FOS, Tal1beta-E47S, MEF and FREAC.
In conclusion, the systematic wiring of cis-acting sites and their target gene bodies is an important step to unravel the role of in vivo characterized catalog of human enhancers in development, physiology and medicine.
One of the main emerging challenges for genomics research is to discover all functional regions in the human genome. The completion of human genome sequencing/assembly and its annotation using computational and comparative genomic approaches has led to the cataloging of ~25,000 protein-coding genes. Key questions now relate to understanding how the spatial and temporal expression patterns of these human genes are established at cellular and organismal level . In eukaryotes, transcriptional regulation tends to involve combinatorial interactions between several transcription factors, which allow for a sophisticated response to multiple conditions in the cellular environment [2, 3]. In metazoans the precise spatial and temporal patterns of gene’s expression also require enhancer elements, distant regions of DNA that can loop back to the promoter . To comprehend the molecular mechanism that governs specific expression patterns, it is important to identify the distant acting transcriptional regulatory elements (enhancers) associated with each predicted gene . Furthermore, the ability to identify such elements is an essential step toward understanding how gene expression is altered in pathological conditions . However, this task remains difficult due to lack of knowledge of the vocabulary controlling gene regulation and the vast genomic search space, with many of such distantly acting enhancers are positioned remotely from their target gene bodies .
Metazoan genes hold extremely intricate regulatory sequences that direct complex patterns of expression in diverse cell types, tissues and development phases . Expression of a typical animal gene is likely to be governed by several distinct enhancer elements that can be located in 5′ and 3′ genomic regions, as well as within intronic intervals . Metazoan cis-regulatory sequences are modular with each enhancer is responsible for a subset of the total gene expression pattern and usually mediate expression within a specific tissue/cell type or developmental phase/domain . These elements are typically up to 500 bp long and contains binding sites for sequence-specific several distinct transcription factors .
The origin of organismal complexity is often thought to be the consequence of evolution of novel gene functions subsequent to gene duplication events . According to the classical model (describing evolutionary fate of duplicate genes) one copy of the duplicated gene pair often degenerates by accumulating deleterious mutations, whereas the other copy keeps the ancestral function. This model further predicts that very rarely, one gene copy may obtain a novel adaptive function, resulting in the preservation of both duplicates, one copy with the new function and the other preserving the ancestral function. However, in numerous cases, empirical data suggest that the fraction of genes preserved subsequent to duplication events is much higher than predicted by the classic model. Keeping in view the regulatory complexity of eukaryotic genes, it was proposed that complementary degenerative mutations in distinct regulatory elements of duplicated genes can assist the preservation of both copies, thus providing long-term opportunities for the evolution of novel gene functions . Numerous duplicate genes have been confirmed to evolve following this model of regulatory subfunctionalization. For instance, zebrafish engrailed-1 and engrailed-1b is a pair of transcription factor genes generated by a duplication event specifically in the lineage of teleost fish . Expression pattern analysis revealed distinct expression domains for zebrafish engrailed paralogs with engrailed-1 is expressed in the pectoral appendage bud, whereas engrailed-1b is expressed in the hindbrain/spinal cord region . The mouse genome harbor single ortholog (engrailed-1) for both genes of the zebrafish, which is expressed in both pectoral appendage bud and hindbrain/spinal cord. Complementary changes in gene expression domains after gene duplication events appear to be a general rule rather than exception and such changes usually happen rapidly after gene duplication .
Genomic comparison of diverse set of vertebrate species revealed many genomic intervals that have remained conserved throughout the vertebrate lineage . Some of these sequences correspond to coding genes and non-coding RNAs, however two third of them are unlikely to produce a functional transcript . These sequences fall in the new category of elements, which we collectively call as conserved non-coding elements (CNEs) . These elements are experimentally characterized to harbor transcriptional regulatory elements, so involved in gene expression regulation . Therefore, comparative genomics based strategies are now being employed to predict genomic regions harboring transcriptional regulatory elements even in the absence of knowledge about the specific characteristics of individual cis-regulatory element .
To explore the functional significance of conserved non-coding genomic elements, Pennacchio and coworkers (2006 & 2008) carried out in vivo enhancer analysis of hundreds of human CNEs in transgenic mice assay by using LacZ as a reporter gene [15, 17, 18]. This data confirmed the gene regulatory function for ~1000 of these sequences, directing reproducibly the reporter expression in diverse set of body tissues at mouse embryonic day 11.5 . To elucidate the significance of this in vivo characterized catalog of human enhancers in organismal development, physiology and medicine it is essential to pinpoint the precise target gene for each of these elements. Noteworthy, Pennacchio and coworkers (2006 & 2008) associated the gene regulatory potential of these human enhancers with the genes harboring them (intragenic) or their immediate flanking genes (intergenic) . However, the empirical evidence showed that enhancers regions are often located at large distances from transcriptional start site of the genes upon which they act . They may be located upstream or downstream of target gene, within introns, in introns of unrelated neighboring genes or can be found at a distance of 1 Mb or even greater and are still able to regulate the gene expression in tissue specific manner . Scattering of enhancer elements over larger regions in vertebrate genomes, impede attempts to assign precise target gene bodies to functionally characterized enhancer elements.
The association of the human CNE-enhancers with their target gene bodies by comparative syntenic analyses and through comparison of the reporter expression induced by CNE-enhancers with the reported endogenous expression patterns of the neighboring genes
No. of enhancers
No. of target genes
Minimal evidence for association
Paralogy mapping (duplication)
Orthology mapping (synteny)
Synteny along with expression (MGI in-situ)
Once the enhancers were assigned to their probable targets the next, we sought to use these associations as a training dataset to gauge the maximum distance at which an enhancer can act upon its target gene body. Furthermore this study aims to define the central nervous system (brain and spinal cord) specific transcription factor code.
Results and discussion
Predicting target genes for human CNE-enhancers
We devised a rule-based procedure to associate CNE-enhancers with their respective target gene (see methods). The human CNE-enhancers opted for this purpose is conserved over longer evolutionary distance, i.e. between human and teleost fish. Pennachio and coworkers has confirmed the gene regulatory potential of these deeply conserved human elements by employing transgenic mice assay . Combined employment of comparative genomics and expression pattern analysis has assisted us to explicitly assign these human enhancers to their target gene bodies.
Among the unduplicated set of CNE-enhancers (149/192) 45 regions were assigned to a single target gene only on the basis of synteny comparison (Table 1, for detailed list see Additional file 1: Table S1). For example, BLAST based search of a CNE-enhancer (hs529) residing on Hsa9 (17,322,200-17,324,371) identified putative orthologs of this human interval in zebrafish, medaka, Fugu, stickleback and tetraodon genomes (Figure 3B). Synteny comparison of human, zebrafish, medaka, Fugu, stickleback and tetraodon orthologous loci revealed the differential loss of BNC2, C9orf39 and FOXE1 genes (positioned in the vicinity of CNE-enhancer) from the corresponding loci suggesting them as bystanders. SH3GL2 was the only gene in this human locus which maintains physical linkage with the CNE-enhancer in all examined genomes. Therefore, the gene regulatory potential of this CNE-enhancer (CNE_ SH3GL2) was assigned to human SH3GL2 gene (Figure 3B).
In 48/149 unduplicated CNE-enhancers we noticed that more than one syntenically conserved gene (2 or 3) are showing analogy in expression pattern with the CNE-enhancer. In this case, two or more genes are declared as target for the relevant CNE-enhancer (Table 1). For instance, a human CNE-enhancer (hs1305) on Has8 (80,874,361-80,876,746), revealed conserved syntenic association with HEY1 and STMN2 in all the compared orthologous loci (Figure 4B). Furthermore, reported endogenous expression pattern of both of these genes matches with the reporter expression pattern induced by the CNE-enhancer (Additional file 3: Table S2). Therefore, we assigned both HEY1 and STMN2 as target genes for this human enhancer region (Figure 4B).
Range of action of human cis-acting sites
It is often assumed that cis-acting regulatory sites act on the nearest neighboring genes and therefore the conventional search space for enhancers includes immediate upstream or downstream intervals of genes of interest. However, some examples exist where enhancers skip nearby promoters and specifically act on distantly located genes . For instance, tissue specific expression of developmentally important SHH and SOX9 genes were found to be regulated by enhancers positioned ~1 MB away from their transcription start site. Intriguingly, in this study substantial proportion (31/192) of enhancer-target gene associations are established at ≥ 1 Mb genomic interval and thus firmly establishing the fact that long-range spatial interaction among regulatory sites and their target gene bodies are not rare exceptions but occur on pervasive scale (Figure 2B and Additional file 4: Table S3). Consequently, this would imply that assignment of full complement of regulatory sites that act upon a single gene of interest can be seriously impeded by extreme site separation and this could in turn hinder attempts to associate disease causing non-coding mutations with their concerned gene bodies. Rigorously defined enhancer-target gene associations established in this study enable us to underscore the idea that linear proximity (among regulatory sites and target genes) rule is inadequate for the identification of cis-regulatory regions and thus extends the previously established maximum search space for enhancers from 2 Mb DNA template around the gene of interest to 4 Mb DNA template (2 Mb on either side) around the gene of interest (Additional file 4: Table S3).
CNS specific transcriptional factor code
In higher eukaryotes the cell type specific or temporal specific influence of enhancer on their target genes is implemented through interactions of these cis-regulatory modules with TFs. The co-occurrence of distinct set of TFs (heterotypic clustering) with each type of binding site represented many times within the same regulatory region (homotypic clustering) are the key features of TF interaction networks in complex metazoan . An important feature of homotypic site clustering is that it facilitates cooperative binding of factors that interacts to moderate or weaker sites . Numerous transcription factors work in concert to regulate target genes in a developmental, cell, or tissue-specific manner. Typically, specific type of cooperativity (similar set of distinct TFs) is required to regulate diverse set of genes exhibiting temporally and spatially synchronized expression (co-expressed genes) [25, 26]. For example, skeletal-muscle-specific expression of distinct gene sets has been associated with the cooperative interactions among at least five TFs: Mef-2, Myf, Sp1, SRF, and Tef . Prediction of transcription factor cooperativity has been carried out in yeast and human but unfortunately our current knowledge about combinations of TFs that contributes to the tissue specificity of cis-regulatory modules is limited. This in turn has limited the large scale bioinformatics study of tissue-specific gene regulation.
Here we seek to identify TFs that act cooperatively to define CNS (central nervous system) specificity of an enhancer. For this purpose, among the selected subset of distant acting developmental enhancers we choose the one for which reproducible CNS-specific activity has been shown in vivo in E11.5 mouse embryos . This data set consists of 159/192 elements, majority (118/159) of which are explicitly associated with single target gene (Table 1). Given the fact that a typical binding motif for TF can be as short as 5-8 bp, in silico matches to such short motifs occur frequently by chance alone, with many of these predicted sites presumably non-functional. Therefore, a major challenge in computational identification of such motifs that must be overcome is distinguishing functional TFBSs from spurious motif matches. In order to better define biologically relevant combinations of TFs, while analyzing each brain specific cis-regulatory module for an input set of known TFs we focused on: (i) evolutionary conservation of each enhancer across human and mouse lineages; (ii) conserved binding motifs that occurred more than once in an enhancer. This stringent criterion combining the technique of phylogenetic foot printing and possibilities of occurrence of homotypic interactions within typical metazoan enhancers reveals that the brain specific cis-regulatory modules have evolutionary conserved binding site preferences for SOX5, HFH, SOX17, c-FOS, HNF3β, c-REL, MEF2, nMYC, USF, FREAC, Tal1beta-E47S, NF-kappaB, AML1 and ARNT, the fourteen transcriptional factors (Additional file 5: Table S4).
The CNE enhancers across the 14 TFs of training data set are then developed as a matrix and non-conserved/non-coding elements (Additional file 6: Table S5) across the 14 TFs of the control data set developed as a matrix . The two correlation matrices of the training and control data sets developed as R X = [r ij 2] and R Y = [r ij 2] and revealed the pattern of TFs existing in the two comparative data sets (Additional file 7: Table S6). In the training data set a specific pattern of TF interactions is exposed by the formation of three distinctive groups of TFs with strong within and poor or no correlation among the groups, leaving aside two TFs: Tal1beta-E47S and MEF. These two are neither correlated between nor with any other member of the three groups formed. Here the findings of R X = [r ij 2] may be grouped into three clusters configuration as,
➢ Cluster-1: n-MYC, ARNT, USF (each r ij 2higher than 69%)
➢ Cluster-2: c-REL, NF-kappaB (r2 is 30%)
➢ Cluster-3: FREAC, AML-1, c-FOS, HNF3-β, SOX17, HFH, SOX5 (12% ≤ r ij 2 ≤ 54%)
The R Y = [r ij 2] present findings in contrast with the above three cluster structure found from the R X = [r ij 2]. Apart from a highly correlated group of three TFs similar to cluster-1, not any other group structure is seen with the remaining TFs (Additional file 7: Table S6) in the control data set. A presumption of a non-interactive pattern of TFs in the control data set may be considered here.
Probability table of transcription factor binding sites in training and control data sets
P(TF i ) in training data set
P(TF i ) in control data set
❖ Cluster-3: Talb-E47S, MEF, FREAC, AML-1, c-FOS, HNF3-β, SOX17, HFH, SOX5.
We may conclude that there are three distinct clusters of TF internally interactive and showing brain specific cis-regulatory modules binding site preferences in the training data set. The statistical significance of pattern of TFs interactions in the training data set exposed are well supported by the probability table (Table 2), the correlation matrix (Additional file 7: Table S6) interpreted as R = [r ij 2] and are further statistically significant with the control results of non-conserved and non-coding elements.
List of transcription factors having over-representative occurrence in brain specific cis -acting sites
Known endogenous expression pattern
RUNT-type transcription factor
spinal cord, hindbrain
spinal cord, forebrain, midbrain, hindbrain
Basic-leucine zipper (bZIP) transcription factor
CNS (components not defined yet)
Winged helix-turn-helix transcription repressor
CNS (components not defined yet)
Winged helix-turn-helix transcription repressor DNA
CNS (components not defined yet)
Winged helix-turn-helix transcription repressor DNA
Spinal-cord, forebrain midbrain, hindbrain
myocyte-specific enhancer factor 2A
spinal cord, forebrain, midbrain, hindbrain
spinal cord, forebrain, midbrain, hindbrain
High mobility group, HMG1/HMG2
spinal cord, forebrain, midbrain, hindbrain
High mobility group, HMG1/HMG2
spinal cord, hindbrain, midbrain
spinal cord, forebrain, midbrain, hindbrain
spinal cord, forebrain, hindbrain
Metazoan cis-regulatory landscape is complex, with modular organization and widespread spatial distribution around the target genes. This complexity hampers attempt to localize and catalog the entire repertoire of enhancers that orchestrate spatially and temporally diverse expression patterns for single gene of interest. Recent relatively high throughput transgenic mice assay based studies generated the genome-wide experimentally validated data set for hundreds of human enhancers. Regulatory activities of each of these enhancers were confirmed through enhancer induced LacZ reporter expression in E11.5 mice embryos and tissue specificity of expression was also described. Such genome-wide collections of enhancer data set are expected to contribute immensely in understanding i) structural anatomy of metazoan enhancers ii) mechanisms of enhancer-target gene interactions iii) comprehensively catalog the genetic regulatory circuits for developmentally critical genes iv) role of cis-regulatory mutations/alterations in development and disease. However, the utility of these in vivo characterized enhancers for variety of biological applications requires their systematic association with target gene bodies. This study is an attempt to systematically associate subset of these functionally defined enhancers with their target genes and thus establishing regulatory interactions for dozens of human genes. These explicit associations would enable the screening of this subset of enhancers for those pathogenic mutations not affecting the coding sequences of concerned genes but disrupting the functionality of these cis-acting regulatory sites and thus altering temporal, spatial and quantitative aspects of gene expression. Furthermore, assigning enhancers to bona fide target genes assisted us to gauge maximum range at which an enhancer can access its target promoter and delineated the fact that in human genome long-range regulatory interactions occur more frequently and involve longer distances than was previously anticipated. Maximum range of enhancer action reported in this study should serve as a guideline when analyzing the chromosomal deletions/rearrangements associated disorders; such as locus alternations are known to disrupt communication among distant regulatory sites and their target genes.
Gene regulation is exerted by cooperative interactions among TFs that binds to clusters of sites within cis-regulatory regions. Distinct cis-regulatory modules direct reporter expressions selectively in a particular cell/tissue specific manner are likely to interact with similar set of TFs and thus defining the TFBSs code for co-expressed genes. In this respect, given the abundant set of experimentally defined regulatory sequences, which is sufficient to direct expression of a reporter gene in a cell-specific pattern and list of TFs that are known to be relevant to that tissue, it is possible to construct biologically relevant, tissue specific, complex heterotypic TFs cluster model. Under this assumption, among the CNE-enhancers linked to their target genes we analyzed the large subset of regions that direct gene expression selectively to CNS and defined combinatorial heterotypic interactions of multiple TFs that are likely to bind to typical CNS specific cis-regulatory module. This analysis not only figures out the generalized structure of typical CNS specific enhancer, but has established TF interaction network that can be used as training data set for large scale identification of CNS specific enhancers.
In vivo dataset
Experimentally verified catalog of human enhancers which is the basis of this study was obtained from VISTA Enhancer Browser . The core dataset of the VISTA Enhancer Browser consists of experimental in vivo data of human and mouse tissue-specific enhancers . These enhancer regions were initially identified by evolutionary sequence conservation or by ChIP-seq . Subsequently these putative enhancer sequences are tested in a transgenic mouse assay to validate their in vivo function and to determine their tissue specificity . Elements that show reproducible and consistent LacZ reporter gene expression among at least three mouse embryos are presented as positive enhancers elements, whereas elements for which no reporter expression is observed among a minimum of five transgenic embryos are defined as negative . The dataset of positive embryos is reported comprehensively in terms of sequence coordinates, flanking genes, annotated expression patterns and details of reproducibility of each structure, images of individual embryos, and series of histological sections . Currently this database host experimentally confirmed 975 human enhancer sequences, directing reproducibly the reporter expression in diverse set of embryonic domains at embryonic day 11.5. We restricted our analyses to 192 human-teleost fish conserved sequences (Table 1, for detailed list see Additional file 1: Table S1). For each of the candidate enhancer element, we retrieved from VISTA enhancer browser information such as, genomic sequence, VISTA enhancer ID, conservation depth, and name of neighboring genes, tissue specificity and image data (Additional file 3: Table S2).
Assigning the target gene to human CNE-enhancers
In order to associate each of the selected subset of human CNE-enhancer with their bona fide target gene, we analyzed the neighboring genomic context using the UCSC  and Ensembl genome browsers  and drafted a locus map depicting the flanking genes spanning at least 2 MB interval on either side of CNE-enhancer (Additional file 2: Figure S1). The sequence similarity of selected subset of CNEs between the human and the teleost fish genome suggests that they are functional in both lineages. It would be appropriate then to speculate that the target genes would also be the same in both species. Given this assumption, comparative picture of these CNE-enhancers bearing human synteny maps was observed in currently available teleost genome (zebrafish, tetraodon, stickleback, medaka and Fugu) by using Multi-species view option at Ensembl genome browser . This allowed us to map carefully the genomic context of evolutionary conserved human enhancers in corresponding zebrafish, tetraodon, medaka, stickleback and Fugu loci (Additional file 2: Figure S1). Among these anciently diverged genomes (human-teleost fish, >450 Mya) uninterrupted physical linkage between CNE-enhancer and one or more neighboring genes was taken as an evidence of functional association (Additional file 1: Table S1 and Additional file 2: Figure S1). To further confirm these associations, for one or more genes depicting evolutionary conserved physical association with CNE-enhancer, the endogenous expression pattern of the mouse ortholog was obtained from MGI . We preferred available gene expression obtained by RNA in-situ hybridization. Reporter gene expression induced by the selected CNE-enhancer is also captured from the VISTA enhancer browser database . We manually compared the image data of transgenic mouse embryos expressing LacZ reporter gene under the influence of CNE-enhancer element with the RNA in-situ hybridization based endogenous expression data of genes residing in the neighborhood of enhancer sequence (Additional file 3: Table S2).
Duplicated copies of selected subset of CNE-enhancers (dCNEs) were searched through BLAST based similarity searches at Ensemble and UCSC genome browsers [55, 56]. We categorized the duplicated enhancers into those, with duplicated copies only in fish lineage (only a single counterpart in human), duplicated copies only in human (only a single counterpart in fish), and the ones that contains duplicated copies in both fish and human lineages (Figure 2A). Duplicated CNE-enhancer facilitated further, to link them explicitly with their target gene through paralogy mapping, i.e. by identifying the genes that have paralogs in the genomic regions that harbor at least two dCNEs from the same family. Paralogy relationship among target genes of duplicated set of enhancers was generated by using paralogy prediction pipeline of Ensembl genome browser where maximum likelihood phylogenetic gene trees (generated by TreeBeST) play a central role .
Estimation of range of action of human CNE-enhancers
Orthology mapping, paralogy mapping and expression pattern analysis helped in assigning bona fide target genes to total of 192 human CNE-enhancers. These large numbers of enhancer-target gene associations enables us to define the genomic range of regulatory activity for human enhancer sequences. For this purpose we calculated the distance between the CNE-enhancers and transcriptional start site of their predicted target genes and then examined the distribution of distances. We partitioned the range of enhancer action as, CNE-enhancers embedded within intronic intervals of target gene (intragenic), CNE-enhancers whose target gene lies within the ranges, e.g. 0-200 kb, 201-400 kb, 401-600, 601-999 and >1 Mb (Figure 2B). Our data shows that 36/192 (18.75%) enhancers are located within the intronic interval of genes they regulate, 47/192 (24.48%) enhancers are within a range of 0-200 kb from their assigned gene, 32/192 (16.67%) enhancers are within a distance of 201-400 kb, 19/192 (9.89%) positioned within 401-600 kb from their associated gene, 27/192 (14.06%) separated by a distance of 601-999 kb from their target. Intriguingly, 31/192 (16.14%) enhancers were found to act on their concerned gene body from a distance of >1 Mb (Figure 2B and Additional file 4: Table S3).
Transcription factor analysis
To establish the central nervous system (CNS) specific transcriptional factor (TF) code we selected 159/192 the subset of human CNE-enhancers that were shown to drove expression in various domains mouse CNS (Additional file 5: Table S4). For this purpose the technique of phylogenetic foot printing was employed on human and mouse orthologous enhancer regions to track the occurrence of evolutionary conserved grouping of transcription factor binding sites (TFBSs) in experimentally verified subset of brain specific enhancers (Additional file 5: Table S4).
Mouse orthologs of human enhancers were obtained through BLAST based similarity searches. Human-mouse conserved transcription factor binding sites in each CNE-enhancer were detected with computer program ConSite . The ConSite screen for conserved TFBSs was performed against the JASPAR database with 85% conservation cutoff, 60 bp window size and 75% transcription factor score threshold settings.
To track cooperative heterotypic interaction among distinct set of TFs within brain specific enhancers, suitable statistical methodologies were employed for their identification and verification. We formulated a multivariate data matrix with n (rows) as the sample of enhancers and p (columns) the number of TFs for training and control data sets (for control data set see Additional file 6: Table S5). For the materialization of the known biological background that occurrences of TFs in sample of enhancers are not mutually exclusive, the repeated occurrence of a TF is determined by finding the individual probability of the occurrence of a TF (P(TF) i in a sample). Looking for the patterns and structures in TFs, primarily the training data matrix of 159 enhancers across 14 TFs and control data matrix of non-conserved/non-coding elements are subjected to a two step exploratory data analysis. Computation of probabilities of TFs in (Table 2) and correlation matrices R X = [r ij 2] (lower diagonal in Additional file 7: Table S6) and R Y = [r ij 2] (upper diagonal in Additional file 7: Table S6) complete the two steps employed for the initial exploration of patterns of TFs in the control and training data sets respectively. The probability table (Table 2) is a classified presentation of P(TF) i with P(TF) i < 0.5 as members of group-1 and for P(TF) i ≥ 0.5 members of group-2 in the training and control data sets.
The correlation matrices of the data sets are desirable to define clusters of TFs that may covary together among all possible pairs of TFs. For the purpose, the squared correlation coefficient (R = [r ij 2]) is interpreted as it indicates a meaningful and practical co-variation among the variables (Additional file 7: Table S6) [59, 60].
Principal Component Analysis (PCA), a powerful multivariate exploratory tool is used to identify patterns in specifically high (P) dimension, interrelated data sets and express the data sets by highlighting their similarities and differences. For multivariate data sets that are interrelated, appropriate application of PCA is using R = [r ij ] matrix for eigen analysis. Therefore, PCA will be used as a means of constructing an informative graphical representation of the data set by projecting the data onto a lower dimensional space. In the study, control and training data will be presented in a three dimensions (3D) subspace of the first three PCs [61, 62]. The PCs derived by the eigen analysis of correlation matrix (R = [r ij ]) is a linear combination of the original p variables (the TFs) and each PC uncorrelated with the other, meaning these are the new transformed data expressed in terms of the patterns existing in the original data set. The total PCs derived are equal to the number of original variables present in the dataset. The p PCs formed are with decreasing order of magnitude of variance of the total variation in the data sets. Thus the first three PCs capturing most of the variation in the data set is visualized in a 3D representation. The coefficient of the variables in each of the linear combination, i.e. the PC is defined as loadings. The magnitude of these loadings represents the importance of each variable present. Thus a 3D representation of the loadings of the first three PCs will identify any cluster structure present in the variables (the TFs), exhibiting the co-occurring pattern of TFs in the control and training data sets.
The comparative analysis of control and training is of major significance in the validation of clusters of known TFs highly represented in human brain specific enhancers.
We are thankful to the NCB computer programmers Yasir Mahmood Abbasi and Aftab Alam for providing the computational facility. This research was supported by the Higher Education Commission of Pakistan.
- Maston GA, Evans SK, Green MR: Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006, 7: 29-59. 10.1146/annurev.genom.7.080505.115623.View ArticlePubMedGoogle Scholar
- Giangrande PH, Zhu W, Rempel RE, Laakso N, Nevins JR: Combinatorial gene control involving E2F and E Box family members. EMBO J. 2004, 23: 1336-1347. 10.1038/sj.emboj.7600134.PubMed CentralView ArticlePubMedGoogle Scholar
- Abbasi AA, Paparidis Z, Malik S, Bangs F, Schmidt A: Human intronic enhancers control distinct sub-domains of Gli3 expression during mouse CNS and limb development. BMC Dev Biol. 2010, 10: 44-10.1186/1471-213X-10-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Nolis IK, McKay DJ, Mantouvalou E, Lomvardas S, Merika M: Transcription factors mediate long-range enhancer-promoter interactions. Proc Natl Acad Sci USA. 2009, 106: 20222-20227. 10.1073/pnas.0902454106.PubMed CentralView ArticlePubMedGoogle Scholar
- Anderson E, Peluso S, Lettice LA, Hill RE: Human limb abnormalities caused by disruption of hedgehog signaling. Trends Genet. 2012, 28: 364-373. 10.1016/j.tig.2012.03.012.View ArticlePubMedGoogle Scholar
- Nobrega MA, Pennacchio LA: Comparative genomic analysis as a tool for biological discovery. J Physiol. 2004, 554: 31-39.PubMed CentralView ArticlePubMedGoogle Scholar
- Goode DK, Elgar G: Capturing the regulatory interactions of eukaryote genomes. Brief Funct Genomics. 2012, Epub ahead of printGoogle Scholar
- Levine M, Tjian R: Transcription regulation and animal diversity. Nature. 2003, 424: 147-151. 10.1038/nature01763.View ArticlePubMedGoogle Scholar
- Abbasi AA: Evolution of vertebrate appendicular structures: Insight from genetic and palaeontological data. Dev Dyn. 2011, 240: 1005-1016. 10.1002/dvdy.22572.View ArticlePubMedGoogle Scholar
- Abbasi AA: Molecular evolution of HR, a gene that regulates the postnatal cycle of the hair follicle. Sci Rep. 2011, 1: 32-PubMed CentralView ArticlePubMedGoogle Scholar
- Force A, Lynch M, Pickett FB, Amores A, Yan YL: Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999, 151: 1531-1545.PubMed CentralPubMedGoogle Scholar
- Zhang J: Evolution by gene duplication: an update. Trends Ecol Evol. 2003, 18: 292-298. 10.1016/S0169-5347(03)00033-8.View ArticleGoogle Scholar
- Abbasi AA, Paparidis Z, Malik S, Goode DK, Callaway H: Human GLI3 intragenic conserved non-coding sequences are tissue-specific enhancers. PLoS One. 2007, 2: e366-10.1371/journal.pone.0000366.PubMed CentralView ArticlePubMedGoogle Scholar
- Dermitzakis ET, Reymond A, Antonarakis SE: Conserved non-genic sequences - an unexpected feature of mammalian genomes. Nat Rev Genet. 2005, 6: 151-157.View ArticlePubMedGoogle Scholar
- Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA: In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006, 444: 499-502. 10.1038/nature05295.View ArticlePubMedGoogle Scholar
- Wasserman WW, Sandelin A: Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004, 5: 276-287. 10.1038/nrg1315.View ArticlePubMedGoogle Scholar
- Visel A, Minovitsky S, Dubchak I, Pennacchio LA: VISTA enhancer browser–a database of tissue-specific human enhancers. Nucleic Acids Res. 2007, 35: D88-92. 10.1093/nar/gkl822.PubMed CentralView ArticlePubMedGoogle Scholar
- Visel A, Prabhakar S, Akiyama JA, Shoukry M, Lewis KD: Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat Genet. 2008, 40: 158-160. 10.1038/ng.2007.55.PubMed CentralView ArticlePubMedGoogle Scholar
- Blackwood EM, Kadonaga JT: Going the distance: a current view of enhancer action. Science. 1998, 281: 60-63.View ArticlePubMedGoogle Scholar
- Vavouri T, McEwen GK, Woolfe A, Gilks WR, Elgar G: Defining a genomic radius for long-range enhancer action: duplicated conserved non-coding elements hold the key. Trends Genet. 2006, 22: 5-10. 10.1016/j.tig.2005.10.005.View ArticlePubMedGoogle Scholar
- Woolfe A, Elgar G: Comparative genomics using Fugu reveals insights into regulatory subfunctionalization. Genome Biol. 2007, 8: R53-10.1186/gb-2007-8-4-r53.PubMed CentralView ArticlePubMedGoogle Scholar
- Leipoldt M, Erdel M, Bien-Willner GA, Smyk M, Theurl M: Two novel translocation breakpoints upstream of SOX9 define borders of the proximal and distal breakpoint cluster region in campomelic dysplasia. Clin Genet. 2007, 71: 67-75.View ArticlePubMedGoogle Scholar
- Gotea V, Visel A, Westlund JM, Nobrega MA, Pennacchio LA: Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 2010, 20: 565-577. 10.1101/gr.104471.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U: Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature. 2008, 451: 535-540. 10.1038/nature06496.View ArticlePubMedGoogle Scholar
- De Bosscher K, Vanden Berghe W, Haegeman G: The interplay between the glucocorticoid receptor and nuclear factor-kappaB or activator protein-1: molecular mechanisms for gene repression. Endocr Rev. 2003, 24: 488-522. 10.1210/er.2002-0006.View ArticlePubMedGoogle Scholar
- Bartholdy B, Matthias P: Transcriptional control of B cell development and function. Gene. 2004, 327: 1-23. 10.1016/j.gene.2003.11.008.View ArticlePubMedGoogle Scholar
- Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE: Human-mouse genome comparisons to locate regulatory sites. Nat Genet. 2000, 26: 225-228. 10.1038/79965.View ArticlePubMedGoogle Scholar
- Bult CJ, Eppig JT, Kadin JA, Richardson JE, Blake JA: The mouse genome database (MGD): mouse biology and model systems. Nucleic Acids Res. 2008, 36: D724-728.PubMed CentralView ArticlePubMedGoogle Scholar
- Diez-Roux G, Banfi S, Sultan M, Geffers L, Anand S: A high-resolution anatomical atlas of the transcriptome in the mouse embryo. PLoS Biol. 2011, 9: e1000582-10.1371/journal.pbio.1000582.PubMed CentralView ArticlePubMedGoogle Scholar
- Yoshikawa M, Senzaki K, Yokomizo T, Takahashi S, Ozaki S: Runx1 selectively regulates cell fate specification and axonal projections of dorsal root ganglion neurons. Dev Biol. 2007, 303: 663-674. 10.1016/j.ydbio.2006.12.007.View ArticlePubMedGoogle Scholar
- Rankin EB, Higgins DF, Walisser JA, Johnson RS, Bradfield CA: Inactivation of the arylhydrocarbon receptor nuclear translocator (Arnt) suppresses von Hippel-Lindau disease-associated vascular tumors in mice. Mol Cell Biol. 2005, 25: 3163-3172. 10.1128/MCB.25.8.3163-3172.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Moser M, Knoth R, Bode C, Patterson C: LE-PAS, a novel Arnt-dependent HLH-PAS protein, is expressed in limbic tissues and transactivates the CNS midline enhancer element. Brain Res Mol Brain Res. 2004, 128: 141-149. 10.1016/j.molbrainres.2004.06.023.View ArticlePubMedGoogle Scholar
- Ali AE, Wilson YM, Murphy M: A single exposure to an enriched environment stimulates the activation of discrete neuronal populations in the brain of the fos-tau-lacZ mouse. Neurobiol Learn Mem. 2009, 92: 381-390. 10.1016/j.nlm.2009.05.004.View ArticlePubMedGoogle Scholar
- Dong M, Wu Y, Fan Y, Xu M, Zhang J: c-fos modulates brain-derived neurotrophic factor mRNA expression in mouse hippocampal CA3 and dentate gyrus neurons. Neurosci Lett. 2006, 400: 177-180. 10.1016/j.neulet.2006.02.063.View ArticlePubMedGoogle Scholar
- Bakalkin G, Yakovleva T, Terenius L: NF-kappa B-like factors in the murine brain. Developmentally-regulated and tissue-specific expression. Brain Res Mol Brain Res. 1993, 20: 137-146. 10.1016/0169-328X(93)90119-A.View ArticlePubMedGoogle Scholar
- O’Riordan KJ, Huang IC, Pizzi M, Spano P, Boroni F: Regulation of nuclear factor kappaB in the hippocampus by group I metabotropic glutamate receptors. J Neurosci. 2006, 26: 4870-4879. 10.1523/JNEUROSCI.4527-05.2006.View ArticlePubMedGoogle Scholar
- Kalinichenko VV, Gusarova GA, Shin B, Costa RH: The forkhead box F1 transcription factor is expressed in brain and head mesenchyme during mouse embryonic development. Gene Expr Patterns. 2003, 3: 153-158. 10.1016/S1567-133X(03)00010-3.View ArticlePubMedGoogle Scholar
- Gray PA, Fu H, Luo P, Zhao Q, Yu J: Mouse brain organization revealed through direct genome-scale TF expression analysis. Science. 2004, 306: 2255-2257. 10.1126/science.1104935.View ArticlePubMedGoogle Scholar
- Lein ES, Hawrylycz MJ, Ao N, Ayres M, Bensinger A: Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007, 445: 168-176. 10.1038/nature05453.View ArticlePubMedGoogle Scholar
- Hammock EA, Eagleson KL, Barlow S, Earls LR, Miller DM: Homologs of genes expressed in caenorhabditis elegans GABAergic neurons are also found in the developing mouse forebrain. Neural Dev. 2010, 5: 32-10.1186/1749-8104-5-32.PubMed CentralView ArticlePubMedGoogle Scholar
- Waite MR, Skidmore JM, Billi AC, Martin JF, Martin DM: GABAergic and glutamatergic identities of developing midbrain Pitx2 neurons. Dev Dyn. 2011, 240: 333-346. 10.1002/dvdy.22532.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng Q, Andersson E, Hedlund E, Alekseenko Z, Coppola E: Specific and integrated roles of Lmx1a, Lmx1b and Phox2a in ventral midbrain development. Development. 2011, 138: 3399-3408. 10.1242/dev.065482.View ArticlePubMedGoogle Scholar
- Lin X, Shah S, Bulleit RF: The expression of MEF2 genes is implicated in CNS neuronal differentiation. Brain Res Mol Brain Res. 1996, 42: 307-316. 10.1016/S0169-328X(96)00135-0.View ArticlePubMedGoogle Scholar
- Lyons GE, Micales BK, Schwarz J, Martin JF, Olson EN: Expression of mef2 genes in the mouse central nervous system suggests a role in neuronal maturation. J Neurosci. 1995, 15: 5727-5738.PubMedGoogle Scholar
- Duckworth EA, Butler T, Collier L, Collier S, Pennypacker KR: NF-kappaB protects neurons from ischemic injury after middle cerebral artery occlusion in mice. Brain Res. 2006, 1088: 167-175. 10.1016/j.brainres.2006.02.103.View ArticlePubMedGoogle Scholar
- Freudenthal R, Boccia MM, Acosta GB, Blake MG, Merlo E: NF-kappaB transcription factor is required for inhibitory avoidance long-term memory in mice. Eur J Neurosci. 2005, 21: 2845-2852. 10.1111/j.1460-9568.2005.04126.x.View ArticlePubMedGoogle Scholar
- Zhao X, DA D, Lim WK, Brahmachary M, Carro MS: The N-Myc-DLL3 cascade is suppressed by the ubiquitin ligase Huwe1 to inhibit proliferation and promote neurogenesis in the developing brain. Dev Cell. 2009, 17: 210-221. 10.1016/j.devcel.2009.07.009.PubMed CentralView ArticlePubMedGoogle Scholar
- Campbell DB, Levitt P: Regionally restricted expression of the transcription factor c-myc intron 1 binding protein during brain development. J Comp Neurol. 2003, 467: 581-592. 10.1002/cne.10958.View ArticlePubMedGoogle Scholar
- Magdaleno S, Jensen P, Brumwell CL, Seal A, Lehman K: BGEM: an in situ hybridization database of gene expression in the embryonic and adult mouse nervous system. PLoS Biol. 2006, 4: e86-10.1371/journal.pbio.0040086.PubMed CentralView ArticlePubMedGoogle Scholar
- Azim E, Jabaudon D, Fame RM, Macklis JD: SOX6 controls dorsal progenitor identity and interneuron diversity during neocortical development. Nat Neurosci. 2009, 12: 1238-1247. 10.1038/nn.2387.PubMed CentralView ArticlePubMedGoogle Scholar
- Batista-Brito R, Rossignol E, Hjerling-Leffler J, Denaxa M, Wegner M: The cell-intrinsic requirement of Sox6 for cortical interneuron development. Neuron. 2009, 63: 466-481. 10.1016/j.neuron.2009.08.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Ravanpay AC, Olson JM: E protein dosage influences brain development more than family member identity. J Neurosci Res. 2008, 86: 1472-1481. 10.1002/jnr.21615.View ArticlePubMedGoogle Scholar
- Heng JI, Tan SS: Cloning and characterization of GRIPE, a novel interacting partner of the transcription factor E12 in developing mouse forebrain. J Biol Chem. 2002, 277: 43152-43159. 10.1074/jbc.M204858200.View ArticlePubMedGoogle Scholar
- Prasad S, Singh K: Interaction of USF1/USF2 and alpha-Pal/Nrf1 to Fmr-1 promoter increases in mouse brain during aging. Biochem Biophys Res Commun. 2008, 376: 347-351. 10.1016/j.bbrc.2008.08.155.View ArticlePubMedGoogle Scholar
- Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D: The UCSC genome browser database: update 2011. Nucleic Acids Res. 2011, 39: D876-882. 10.1093/nar/gkq963.PubMed CentralView ArticlePubMedGoogle Scholar
- Stalker J, Gibbins B, Meidl P, Smith J, Spooner W: The ensembl web site: mechanics of a genome browser. Genome Res. 2004, 14: 951-955. 10.1101/gr.1863004.PubMed CentralView ArticlePubMedGoogle Scholar
- Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R: EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009, 19: 327-335.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res. 2004, 32: W249-252. 10.1093/nar/gkh372.PubMed CentralView ArticlePubMedGoogle Scholar
- Guilford JP, Fructher B: Fundamental statistics in psychology and education. 1978, New York: McGraw-Hill, 545 p-Google Scholar
- Ferguson GA, Takane Y: Statistical analysis in psychology and education. 1989, New York: McGraw-Hill, 587 p-Google Scholar
- Everitt BS: Multivariable modeling and multivariate analysis for the behavioral sciences. 2009, CRC Press, 320 p-Google Scholar
- Joliffe IT: Principal component analysis. 2002, New York: Springer-Verlag, 487 p-Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.