Gene promoters show chromosome-specificity and reveal chromosome territories in humans

Background Gene promoters have guided evolution processes for millions of years. It seems that they were the main engine responsible for the integration of different mutations favorable for the environmental conditions. In cooperation with different transcription factors and other biochemical components, these regulatory regions dictate the synthesis frequency of RNA molecules. Predominantly in the last decade, it has become clear that nuclear organization impacts upon gene regulation. To fully understand the connections between Homo sapiens chromosomes and their gene promoters, we analyzed 1200 promoter sequences using our Kappa Index of Coincidence method. Results In order to measure the structural similarity of gene promoters, we used two-dimensional image-based patterns obtained through Kappa Index of Coincidence (Kappa IC) and (C+G)% values. The center of weight of each promoter pattern indicated a structure similarity between promoters of each chromosome. Furthermore, the proximity of chromosomes seems to be in accordance to the structural similarity of their gene promoters. The arrangement of chromosomes according to Kappa IC values of promoters, shows a striking symmetry between the chromosome length and the structure of promoters located on them. High Kappa IC and (C+G)% values of gene promoters were also directly associated with the most frequent genetic diseases. Taking into consideration these observations, a general hypothesis for the evolutionary dynamics of the genome has been proposed. In this hypothesis, heterochromatin and euchromatin domains exchange DNA sequences according to a difference in the rate of Slipped Strand Mispairing and point mutations. Conclusions In this paper we showed that gene promoters appear to be specific to each chromosome. Furthermore, the proximity between chromosomes seems to be in accordance to the structural similarity of their gene promoters. Our findings are based on comprehensive data from Transcriptional Regulatory Element Database and a new computer model whose core is using Kappa index of coincidence.


Background
Inside the body, somatic cells exercise their overall functions in G 0 phase (the period between cell divisions) [1][2][3]. During this phase, individual chromosomes are impossible to distinguish by light or electron microscopy. For instance, when cells are terminally differentiated, some of them enter in a permanent (quiescent state) G 0 phase, such as myocyte cells, the majority of neuronal cell types or pancreatic beta cells. Other types of cells exhibit a temporary G 0 phase, such as glial cells or hepatocyte cells, which divide under controlled conditions. However, less is known of the precise location of chromosomes and their relationship with the internal nuclear membrane and nuclear pores through which the traffic of molecules is made. Inside the nucleus of specialized cells, spatial arrangements of chromosomes in G 0 phase play an important role in the regulation of gene expression patterns [4,5]. The nucleus lacks of membrane compartmentalization [6,7]. In telophase, mitotic chromosomes unfold into chromatin state [8,9]. Immediately after nuclear membrane is formed, heterochromatin is allocated to the nuclear periphery whereas euchromatin is generally contained towards the nuclear interior. In G 0 phase, chromatin shows different states of condensation, such as constitutive heterochromatin, facultative heterochromatin and euchromatin [10,11]. Constitutive heterochromatin consists of permanently condensed DNA, usually containing multiple short repeats and low gene density. Facultative heterochromatin represents a temporary DNA condensation state, located in heterochromatin landscape surface [12,13]. The active part of the nucleus (gene rich areas), where the transcription of DNA to mRNA is made, is represented by euchromatin domain. In order to initiate the transcription process, the relaxed structure of euchromatin allows regulatory proteins and RNA polymerase complexes to bind to DNA for transcription initiation and elongation of mRNA [14]. Euchromatin domains which are never stored as facultative heterochromatin are usually under active transcription and contain housekeeping genes, otherwise crucial for basic cell functions [15]. Genes embedded inside facultative heterochromatin can transit to and from euchromatin, depending on different functions that the cell needs to perform, in certain time intervals or under the action of certain external stimuli. It is recognized that many active genes that are brought into or near heterchromatin landscapes become repressed and their transcriptional reactivation is made by reallocation to the nuclear interior [16][17][18]. Nevertheless, other studies show that some genes are transcriptionally active close to nuclear periphery [19][20][21]. Electron microscopy images show a lack of heterchromatin around nuclear pores [22]. Although active inside euchromatin, some inducible genes from the nuclear interior are relocated near nuclear pores for a fast response under the action of certain stimuli [23][24][25][26][27]. However, facultative heterochromatin represents one of many methods through which cells, start or stop the expression of certain genes. Heterochromatin is also critical in morphogenesis and differentiation. In embryogenesis, chromatin establishes different structural landscapes depending on cell specialization. For instance, Hox gene clusters [28,29] are responsible for the spatial structure of the body. In humans, these genes are located on chromosome 7 (HOXA gene clusters), 17 (HOXB gene clusters), 12 (HOXC gene clusters) and 2 (HOXD gene clusters). In embryogenesis, Hox genes are brought to the surface into euchromatin domain in order to be expressed in a sequential manner [30,31]. Polycomb-group proteins and other biochemical mechanisms reshape chromatin depending on the cell type, allowing a favorable positioning of these genes inside euchromatin domain [32]. In terminally differentiated somatic cells, Hox genes are permanently silenced by their inclusion inside heterochromatin domain. Moreover, modulation of gene expression through chromatin structure is not limited only to single genes or gene clusters. For instance, in female morphogenesis an X chromosome is silenced through its condensation inside facultative heterochromatin [33][34][35] (the Barr body), while the active X chromosome is included in euchromatin domain. In G 0 phase, genes of common function can colocalize inside the nuclear space in order to share the same transcription machinery [36]. Thus, these genes may be incorporated into the same transcription factory or in close neighboring transcription factories [37,38]. It appears that these active regions are positioned between chromosome territories.
In this paper we tried to identify some structural features of gene promoters located on different chromosomes in the human genome. Our hypothesis was based on the fact that promoter sequences are more exposed to the biochemical transcription machinery and therefore may reflect the chromosome boundaries much better. Previously, approaches towards promoter analysis include motif sequences and other structural parameters, such as DNA curvature, bendability, stability, nucleosome positioning or comparison of various DNA sequences [39][40][41][42][43][44][45][46]. Nevertheless, a clear association between promoter nucleotide sequences and chromosome territories was never hypothesized. The purpose of our work was to establish a possible functional significance of promoter sequences which may explain the dynamic relationship between different chromosome territories.

Methods
In our approach we used 1200 promoter sequences (50 random promoters from each chromosome) from Transcriptional Regulatory Element Database [47,48]. We were mainly interested in the regions flanking the putative TSS, ranging from -700b to 299b. We used Visual Basic to develop a software program for promoter analysis -called PromKappa (Promoter analysis by Kappa). The source code implementation of this program is attached to our Additional file 1. We used sliding window approach to extract two types of values, namely Kappa Index of Coincidence (Kappa IC) and (C+G)%.

Kappa index of coincidence
The Index of coincidence principle is based on letter frequency distributions and has been used for the analysis of natural-language plaintext in cryptanalysis [49]. Kappa Index of Coincidence is a form of Index of Coincidence used for matching two text strings. However, we managed to adapt Kappa IC for the analysis of a single DNA sequence. This adaptation of Kappa IC is used for calculating the level of "randomization" of a DNA sequence. Kappa IC is sensitive to various degrees of sequence organization such as simple sequence repeats (SSRs) or short tandem repeats (STRs) [50]. The formula for Kappa IC is shown below, where sequences A and B have the same length N. Only if an A[i] nucleotide from sequence A matches the B[i] correspondent from sequence B, then ∑ is incremented by 1. Q represents the number of letters in the alphabet (in our case Q=4).
With small changes, the same method for measuring the Index of Coincidence has been applied for only one sequence, in which the sequence was actually compared with itself, as shown below in the algorithm implementation.
Where N is the length of the sliding window, A represents the sliding window content, B contains all variants of sequences generated from A (from u+1 to N), C counts the number of coincidences occurring between sequence B and sequence A, and T variable counts the total number of coincidences found between sequences of B and the sequence A.

Cytosine and guanine content
We extracted C+G values from each sliding window considering the nucleotide frequencies from the entire promoter sequence. In the first stage, to determine the (C+G)% content for the entire promoter sequence we used the formula: Where "TOT" (total) designates the promoter sequence. CG TOT represents the percentage of cytosine and guanine, (A+T+C+G) TOT represents the sum of occurrences of A, T, C and G, and (C+G) TOT represents the sum of occurrences of C and G. In the next stage we used the value of CG TOT to calculate the (C+G)% content from the sliding window (SW): Where CG SW represents the percentage of cytosine and guanine from the sliding window. In this stage, CG SW value is relative to CG TOT . The expression (A+T+C+G) TOT represents the sum of occurrences of A, T, C and G from the sliding window sequence. (C+G) SW represents the sum of C and G occurrences in the sliding window sequence. Nevertheless, in our implementation we also included the option to extract CG SW values without considering CG TOT .

Promoter analysis
By extracting Kappa IC percentages and C+G content from a sliding window (window size of 30 nt and a step of 1 nt) we have been able to measure the localized values along the promoter sequences ( Figure 1A,B). Kappa Index of Coincidence values were plotted on a graph against (C+G)% values, which form a recognizable pattern for each promoter sequence ( Figure 1C). The x-coordinate of each point was represented by a (C+G)% value and the ycoordinate was represented by a corresponding Kappa IC value. As expected, by using a large window size we obtained smooth promoter patterns, whereas a small window size generated sharp and distinguishable characteristics of promoters. These patterns are composed from clusters of various sizes on the y-axis ( Figure 1C and Additional file 2). The center of weight from each pattern was plotted on a graph designed to show the distribution of promoters for each chromosome. Furthermore, in order to observe the boundaries in which Homo sapiens promoters are included, we used 8,515 gene promoters from EPD [51,52] (Eukaryotic Promoter Database) to perform a more general distribution ( Figure 1D and Additional file 3). In this case we used a color scheme to highlight the denser surfaces. Red areas represent clusters of similar promoters while blue areas represent unique or rare promoters.

Results
We first investigated if some promoter patterns occur more often on certain chromosomes. Secondly we determined if chromosome territories could be revealed by using Kappa IC. In the third analysis we examined the distribution of Kappa IC values against the number of genetic diseases associated with each chromosome.  Figure 2B). The order of chromosomes by promoter Kappa index of coincidence is shown in Figure 2C,D. Interestingly, chromosomes X and Y contain promoters with the lowest CG% and Kappa index of coincidence values. Promoter regions with the highest Kappa Index of Coincidence values (ie. chromosomes 4,5,7,21) contain various SSRs and STRs structures ( Figure 2B). This further suggests that in their evolution, promoters located on these chromosomes experienced few point mutations and accumulated more Slipped Strand Mispairing (SSM) mutations [53].
In contrast, promoter regions with the lowest Kappa Index of Coincidence values (ie. chromosomes Y,X,12,8), contain more interspersed nucleotides (A,T,C,G ≈ 25%) and less SSRs and STRs structures ( Figure 2B). Acordantly, this further suggests that in their evolution, promoters located on these chromosomes have accumulated a multitude of random point mutations, thus disrupting SSR structures like poly(dA:dT) or poly(dC:dG) tracts [54,55] in shorter elements. Although without immediate consequences, point mutations that occur in promoter regions, gradually change gene expression patterns and consequently, their gene relation within certain biological pathways.

Heterochromatin and euchromatin are two main evolutionary forces
Chromosomes such as 1, 9, 16 or the Y-chromosome contain large regions of constitutive heterochromatin [56][57][58]. In terms of evolution, across generations the X-chromosome is also occasionally a part of heterochromatin (the Barr body). Our results suggest that promoters located on chromosomes which contain regions frequently included in heterochromatin, seem to exhibit only average to low Kappa Index of Coincidence values ( Figure 2B), which further suggests that among other roles, heterochromatin is also acting as a shield for the inner core against point mutations originating from outside the nucleus. Although controversial, the "bodyguard" model [59] of heterochromatin appears to be partially true, but not as a protective role, but rather as a layered evolutionary mechanism in which some vital regions of the genome are exposed for rapid phenotypic changes (ie. tissue-specific genes) and those regions which need less change are more protected (ie. housekeeping genes). It is known that mammalian housekeeping genes evolve more slowly than tissue-specific genes [60]. Furthermore, is also accepted that non-coding regions suffer more mutations than coding regions [61]. Evolutionary, chromatin structure may influence the distribution of point mutations or other mutational events in the promoter sequence. A chromatin-dependent distribution of point mutations can lead to a gradual shift in gene expression. Gene promoters located mainly inside euchromatin domain remain prone to stable SSM mutations, favoring the maintenance of SSR or STR structures in the promoter regions. For instance, poly (dA:dT) tracts inside promoters were often associated with high gene expression levels while a disruption of poly(dA:dT) tracts in shorter elements had an opposite effect [62]. Although SSM mutations may appear with an equal probability in all promoters during DNA replication, it seems that only SSRs or STRs of promoters stored inside euchromatin are preserved. Accordingly, functional SSRs or STRs of promoters stored inside heterochromatin are gradually deteriorated by point mutations events. In most organisms, constitutive heterochromatin is usually associated with chromosomal areas of repetitive DNA sequences (commonly around the chromosome centromere and near telomeres), which seem to confer an overall trigger pattern for a tight colloid-like formation between nucleosomes [63,64]. However, functional areas (promoters and genes) that have a lower predisposition for a tight nucleosome packing, are more susceptible to point mutations inside heterochromatin than classical repetitive DNA sequences. Based on the overall promoter-chromosome specificity distributions (Figure 2), our hypothesis for a possible evolutionary dynamics of the eukaryotic nucleus would imply a permanent exchange of DNA areas between heterochromatin and euchromatin domains ( Figure 3). Inside heterochromatin ( Figure 3A), DNA repetitions degraded by point mutations lose their overall ability for tight nucleosome packing. Inside euchromatin ( Figure 3B), SSM mutations favor DNA repetitions, which over time, gain a predisposition for tight nucleosome packing, and ultimately, allowing for heterochromatin formation. Nevertheless, in such a hypothesis the selection pressure may decide the speed by which some DNA areas are brought to the surface into the heterochromatin landscapes.

Chromosome territories in humans
What surprised us in particular, was the symmetry of chromosome order when they are arranged by promoter Kappa IC values ( Figure 2D blue "amphora" shaped semi-circles). Generally, chromosomes were numbered according to their size. In Figure 2D we show an abstracted model in which chromosomes are ordered by Kappa IC values of promoters (colored in blue), however, in this model the blue arrows follow the order of chromosomes according to their size (starting from chromosome 4 -which contains promoters with the highest Kappa IC values). Thus, the arrows that connect more distant chromosomes in this order, show a proportional increased semi-circle radius (a radius proportional with the relative distance between them). Nevertheless, the apparent 2-fold symmetry on Y-axis (between chromosomes 4-11 and chromosomes 19-Y) further suggests that there is a correlation between chromosome length and the structure of gene promoters located on them ( Figure 2D and Additional file 5). In addition, by complying with the same rules described above, when chromosomes were ordered by (C+G)% values of promoters, we could not observe any obvious symmetries ( Figure 2D -red color arrows). Figure 2C shows the order of chromosomes and their position to one another when they are arranged separately by the two values.
Chromosomal territories have cell-type specificity [65]. Relying exclusively on sequence composition, our promoter distributions may show which chromosomes are most frequently adjacent inside the nucleus in G 0 phase. Human genome codes for~2600 transcription factors [66]. However, the number of available transcription factors (and consequently the number of transcription factories) expressed at any given time is relative to each cell type. Genes located relatively close to each other in the nuclear space have a greater probability of being incorporated into the same transcription factory [67,68]. In this regard, our results suggest that gene promoters with similar structures (ie. similar DNA-binding sites and SSRs), seem to be included in the same transcription factories. This further implies that genes with different promoter structures, although close in the nuclear space, may be included in different transcription factories. Interestingly, the order of chromosomes after Kappa IC   values of promoters, partially coincide with chromosomal territories of human fibroblast nuclei in G 0 phase observed by Bolzer et al. [69] ( Figure 4A). The MDS (multidimensional scaling) plot from Bolzer et al. provides a 2D distance map of the mean locations of the IGCs (fluorescence intensity gravity centers) of all heterologous chromosome territories (CTs) established from 54 G 0 nuclei. Here, we notice some similarity of distribution for certain groups of chromosomes, such as chromosome 1 and 4 or chromosome 11 (containing beta globin gene clusters) and 16 (containing alpha globin gene clusters) ( Figure 4A,B). In order to obtain an overview of this correlation with the results presented by Bolzer et al. regarding the mean locations of chromosomes in G 0 phase ( Figure 4A), we have subdivided their distribution into two main sectors. We have chosen two circular perimeters, the first perimeter (perimeter 1), which incorporates the chromosomes found at the extremity of their distribution, and a smaller circular perimeter (perimeter 2), which includes the chromosomes that are closer to the zero point (the middle of the chart). In our distribution ( Figure 4B), we correlated all points present in perimeter 1 by using green dots and all points present in perimeter 2 by using red dots. We noticed that peripheral dots (red color) from our distribution correspond to perimeter 2 area from Bolzer et al. distribution, whereas central dots (green color) from our distribution correspond to perimeter 1 from Bolzer et. al distribution. Furthermore, the interchromosomal contact probabilities between pairs of chromosomes presented by Lieberman-Aiden E et al. [70], showing that chromosomes 16, 17, 19, 20, 21 and 22 preferentially interact with each other, were also correlated with our results. In our distribution of gene promoters, these chromosomes are located very close to each other and are relatively united by a single diagonal line (except chromosome 22 which is slightly below chromosome 19see Figure 4B), suggesting a similar conclusion. Although many factors may be involved, this comparison of observed vs. calculated positions suggests that the DNA sequence composition dictates the overall positions of chromosomes in G 0 phase. In this regard, areas of chromosomes that contain gene promoters with common structures (ie. Kappa IC and (C+G)% values) seem to position themselves next to each other, relative to each cell type. A more detailed distribution of promoters belonging to each chromosome is shown in Figure 5, which may further detail the chromosomal areas of interaction.

Promoter Kappa IC values vs. genetic diseases
A more intriguing association was made between the number of genetic diseases/chromosome and promoter Kappa IC and (C+G) values ( Figure 6A,B). Although the number of genetic diseases associated with individual chromosomes may exceed several hundred, we used a list of common types of genetic diseases provided by NCBI [71]. It seems that high values of Kappa IC and (C+G)% of gene promoters are directly associated with the number of classic genetic diseases. Exception to this relative proportion are chromosomes 21, 22 and X, which exhibit asynchronous values between Kappa IC, (C+G) and the number of common genetic diseases/chromosome ( Figure 6A,B).

Discussion
Gene promoters are located upstream of TSS (Transcription Start Site). A typical promoter region consists of a core promoter and regulatory domains. The association of transcription factors within a promoter precedes the RNA synthesis [72]. Accordingly, the structure of a promoter is recognized by the presence of known promoter elements, such as TATA box, GCbox, CCAAT-box, BRE and INR box [73]. In order to elucidate the evolutionary relationships, many comparisons have been made between gene promoters of different species. Nevertheless, correlations made between promoters of genes located on different chromosomes of the same species have been poorly studied. In this regard, we have chosen a different approach to analyze promoter sequences by using two-dimensional image-based patterns obtained through Kappa Index of Coincidence (Kappa IC) and (C+G)% values [74]. Each pattern is composed of vertically aligned clusters of Kappa IC (y-axis) and (G+C)% (x-axis) values. Vertical positions of these clusters form a promoter pattern which has a specific form for each promoter sequence. Their shape is explained by the presence of different structures such as simple sequence repeats (SSRs) or short tandem repeats (STRs). In order to investigate a possible relationship between promoters of genes located on different chromosomes, we have plotted the center of weight from 1200 promoter patterns ( Figure 5A-X). The center of weight of each promoter pattern indicates an average between all SSRs and STRs present in the promoter sequence. An explanatory model of an image-based promoter pattern can reveal some visual insights into different promoter regions, such as the locations of all SSRs and STRs ( Figure 7A-F). We have also noticed the directions and the angles of these promoter distributions which may suggest an evolutionary tendency ( Figure 1D).
The haploid human genome contains a nuclear volume of approximately 1000 μm 3 and 3.2 billion base pairs of compacted DNA [75][76][77]. Nucleosomes compact and regulate access to DNA by assuming specific positions [78,79]. The interaction between nucleosomes that incorporate functional sequences located at great distances inside the nucleous, is provided by a favorable positioning of other nucleosomes that incorporate non-coding sequences. Accordingly, an overall picture begins to take shape, namely that the evolutionary process can not tolerate non-functional information. Although many studies show that refined mechanisms involved in the dynamics of the nucleus are ATP (adenosine-5'-triphosphate) dependent processes [80,81], we wonderd if self-organization processes and other biophysical phenomena could be evan more involved than previously thought. Nevertheless, DNA guided self-organization processes that may concern chromatin mobility will be of utmost importance for our understanding of the dynamics of the nucleus.
In a recent study, we have suggested that eukaryotic genomes may exhibit at least 10 classes of promoters [82]. In future research we wish to highlight the distribution of these promoter classes on each chromosome. Furthermore, we are also interested to observe the differences between Kappa IC values of introns and exons related to each chromosome in order to understand if the relative proportions presented here will remain constant.

Conclusions
In this paper a comprehensive analysis was undertaken for promoter sequences from Homo sapiens. In our approach we used 1200 promoter sequences (50 random promoters from each chromosome) from Transcriptional Regulatory Element Database. In order to measure the structural similarity of gene promoters, we used two-dimensional image-based patterns obtained through Kappa Index of Coincidence (Kappa IC) and (C+G)% values. The center of weight of each promoter pattern indicated an average between all SSRs and STRs present in the promoter sequence. A distribution of these average values showed that gene promoters appear to be specific to each chromosome. Furthermore, the proximity between chromosomes seems to be in accordance to the structural similarity of their gene promoters. Although chromosomes are positioned differently depending upon each cell type, they exhibit a predisposition for a standard arrangement. High Kappa IC and (C+G)% values of gene promoters were also directly associated with the most frequent genetic diseases. Taking into consideration these observations, a general hypothesis for the evolutionary dynamics of the genome has been proposed. In this hypothesis, heterochromatin and euchromatin domains exchange DNA sequences according to a difference in the rate of mutations.