Implications of human genome structural heterogeneity: functionally related genes tend to reside in organizationally similar genomic regions

Background In an earlier study, we hypothesized that genomic segments with different sequence organization patterns (OPs) might display functional specificity despite their similar GC content. Here we tested this hypothesis by dividing the human genome into 100 kb segments, classifying these segments into five compositional groups according to GC content, and then characterizing each segment within the five groups by oligonucleotide counting (k-mer analysis; also referred to as compositional spectrum analysis, or CSA), to examine the distribution of sequence OPs in the segments. We performed the CSA on the entire DNA, i.e., its coding and non-coding parts the latter being much more abundant in the genome than the former. Results We identified 38 OP-type clusters of segments that differ in their compositional spectrum (CS) organization. Many of the segments that shared the same OP type were enriched with genes related to the same biological processes (developmental, signaling, etc.), components of biochemical complexes, or organelles. Thirteen OP-type clusters showed significant enrichment in genes connected to specific gene-ontology terms. Some of these clusters seemed to reflect certain events during periods of horizontal gene transfer and genome expansion, and subsequent evolution of genomic regions requiring coordinated regulation. Conclusions There may be a tendency for genes that are involved in the same biological process, complex or organelle to use the same OP, even at a distance of ~ 100 kb from the genes. Although the intergenic DNA is non-coding, the general pattern of sequence organization (e.g., reflected in over-represented oligonucleotide “words”) may be important and were protected, to some extent, in the course of evolution.


Background
Heterogeneity of the structural characteristics of genomic sequences, such as GC content (isochores), CpG distribution, copy-number variation, and repetitive DNA content has been the subject of numerous studies for decades. Other studies have been devoted to heterogeneity of functional and evolutionary features of the genome, including protein-and non-protein-coding DNA, codon usage, developmental stage-and tissue-specificity of gene expression, distribution of conserved and ultraconserved regions, recombination and mutation hot and cold spots, and linkage disequilibrium blocks [1][2][3][4][5][6][7]. Many genomes have been sequenced and are available for further analysis. Nevertheless, the coding DNA remains the genome's most investigated component. Analyses of its structure and function have been the basis for a wide variety of studies, ranging from the analysis of functionally related gene groups and gene-alignment-based interspecies comparisons [8] to analysis of gene-adjacent regulatory sequences [9].
The simplest structural characteristic of genomic sequences is their nucleotide composition. Relatively good correspondence of nucleotide composition with the Chargaff's second parity rule [10] enables disregarding certain differences in the within-strand contents of G vs. C, and A vs. T, and limiting nucleotide-composition studies to an analysis of the molar proportion of G+C in DNA, or GC content. GC content shows high heterogeneity along the genome and correlates with many genomic features, such as recombination rate, abundance of single nucleotide polymorphisms (SNPs), and different types of repetitive elements [11][12][13]. Of special interest is the wellknown correlation of GC content with gene density [12,14]. Furthermore, GC-rich regions contain many compact genes with short introns, whereas genes in GC-poor regions tend to contain larger introns [15]. A correlation between GC content and gene expression has been found as well [11]. It is generally accepted that broadly expressed (housekeeping) genes typically reside in GCrich regions, although the correlation strength between gene-expression specificity and regional GC content may vary depending on the method used to estimate expression [16]. Furthermore, tissue-specificity of genes varies with their GC content; for example, genes specifically expressed in the central nervous system are more GC rich than housekeeping genes, whereas genes related to germ-line tissues tend to be GC-poor [17,18].
Recent studies have indicated that regulatory sequences of functional gene groups differ in the genome's GC-poor and GC-rich regions. In addition, sequences that might influence nucleosome positioning and density differ between these two contrasting regions (for review, see [19]). Moreover, different functional gene groups have contrasting base compositions [20], which might explain the relationship between genes' tissue-specificity and their local GC content. In summary, investigations have shown a correlation between isochore GC content and the resident genes nucleotide composition and functioning.
Analyses of di-and trinucleotide frequencies in five GC-isochore families of the human genome showed unexpected organizational differences between whole isochore sequences, with the corresponding intergenic and coding sequences located in different isochores, in exons and introns [21]. Similar differences were found in gene regulatory regions and in local sequences that might influence nucleosome positioning and density [19]. These differences in the abundance of short oligonucleotides might be related to chromatin organization, which itself plays a role in gene expression and replication timing. An important finding was that genome structural heterogeneity might affect the distribution of gene categories on a larger scale than the classical isochores [22].
In earlier studies, we used the oligonucleotide-counting method (k-mer analysis), referred to as compositional spectrum analysis (CSA) for alignment-free genome comparisons [23][24][25]. We recently employed this approach in an investigation of organizational heterogeneity of vertebrate genomes with a special focus on the human genome [26]. We considered two types of heterogeneity: compositional (variability of sequence nucleotide frequencies) and organizational (variability of sequence nucleotide orders). A compositional spectrum (CS) comparison of sequences with the same (or similar) GC content can detect groups of genomic segments with very different organizational patterns (OPs). We were interested in testing whether the OP of a genome region affects the type of genes residing there, i.e. whether functionally related genes tend to inhabit regions with similar OPs. To test this, we arbitrarily divided the human genome sequence into 100 kb segments and then classified them into five compositional groups according to their average GC content. For each such GC range, we identified large groups of segments that differed in their CS organization (referred to as OP groups), and compared the genes residing in segments of the same OP group; 13 of the 38 OP groups showed significant enrichment in genes connected to specific geneontology (GO) terms. Thus, one of the analyzed groups was considerably enriched in genes connected to the GO terms "mitochondrion" and "ribonucleoprotein complex". Another OP group was enriched in genes related to a few GO terms: "epithelial cell differentiation", "epithelium development", and "keratinocyte differentiation". These findings enabled us to examine the relationship between gene function and CS organization of the associated genome regions.

Results
As expected, OP variability could be detected within each compositional group of segments [26]. Correspondingly, segments from each GC range were subdivided into clusters according to OP similarity. Altogether, we identified 38 different OP groups in the five compositional groups. A substantial proportion of the segments contained protein-coding genes (Table 1; see also Additional files 1, 2, 3 and 4 in the supporting material for more details). Most of the genes were located in segments of the L2, H1, and H2 compositional groups with moderate GC content: 37-52% (we employed the names used in the literature for denoting groups with corresponding GC content). As already known and discussed [14,27], there is a strict correlation between GC content and gene density. We found that L2, H1, and H2 compositional groups had higher OP complexity than L1 and H3; therefore, they could be subdivided into more OP clusters. This last trend can be partially explained by simple combinatorial rules based on A-C-T-G nucleotide distribution [26]. We found that each OP group contained segments that were widely spread among and within different chromosomes. It is worth noting that our CSA addressed the whole segment sequence, including both the gene sequences and intergenic DNA. (see Additional file 1, 2, 3, and 4).

Different OP groups are enriched in different functional gene categories
Previous studies have clearly demonstrated that compositionally different genomic segments (GC-rich vs. GCpoor) may also differ with respect to their genes' functional categories. Thus, a considerably higher proportion of housekeeping genes was found in GC-rich vs. GC-poor human genome regions [17]. We hypothesized that genomic segments with different OPs would display functional specificity despite GC-content similarity [26]. To test this, CSA was applied to 100 kb segments in the five compositional groups with GC ranges corresponding to the classical isochores. We found that segments sharing the same OP type are significantly enriched in genes related to the same biological process (developmental, signaling, etc.), biochemical complex, or organellar components (although this did not exclude enrichment with genes connected to other categories). These relations were identified by screening 38 OP-type clusters: one-third showed significant enrichment for genes connected to specific GO terms. The main findings are illustrated by several examples from the compositional groups L2, H1, and H2 presented in Figure 1, and more comprehensive results are shown in Additional files 1, 2, 3 and 4. We also tested genome organization patterns using CSA based on the two-letter alphabet (purines, R = G or A and pyrimidines, Y = C or T). In this CSA version, the abundance of 20-mer words was counted and compared to that of 10-mer words in the full A, T, G, C alphabet [24,25]. Although the results were not identical, they are similar and highly correlated.

Repeat-masking test
Bearing in mind that repetitive DNA sequences comprise a considerable proportion of the human genome, we checked for an influence of these repeats on OP group size and GOterm enrichment by employing the RepeatMasker tool (Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0 1996-2010 unpublished data <http://www.repeatmasker. org>) on segments with GC = 42-47%. In addition, we conducted OP-group identification within the repeat-masked (RM) sequence, and analyzed the enrichment of the resulting OPs by genes related to the same GO terms. Differences in oligonucleotide frequencies between OP groups might come from repeat sequences; therefore, the two tests cannot be expected to generate identical patterns. Nonetheless, the results in the RM test seemed quite similar to those in the original test. In the original test, we found that 5 of the 15 OP groups from the GC = 42-47% range were enriched for 30 GO terms (connected to 186 genes); for RM sequences, 6 out of 16 OP groups were enriched for 46 GO terms (connected to 291 genes). Two out of the five OP groups that were found to be significantly enriched for specific GO terms in the original test fully coincided with those in the RM test. In total, 2/3 of the GO terms enriched in certain OP groups in the original test were also enriched in certain OP groups under the RM test. We believe that some of the repeat sequences, like other inter-genic DNA (e.g., the "genome dark matter" [1][2][3]), might be important for regulation of gene expression or other processes in which DNA sequences of nearby genes are involved (replication, DNA repair, etc.). Therefore, we decided that it would make more sense to show the CSA results obtained for the whole DNA, including the repeats.

Enrichment of OP cluster from L2 group for the GO term "mitochondrion"
A subgroup corresponding to the L2 compositional group with similar OP was termed as L2-a cluster (L2-a OP group). Out of 392 genes harbored by this group, 39 were related to the GO term "mitochondrion" (Benjamini pvalue: 7.67 × 10 −4 ) and are distributed among 34 loci of 15 chromosomes (see Figure 1). More than 900 chromosomal genes encoding human proteins are considered to be targeted to the mitochondria [28]; this number may be even larger [29]. Many of the chromosomal genes encoding for these mitochondrion-targeted proteins might have originated from the ancestral alpha-proteobacteria symbiont genome (the suggested mitochondrion progenitor) and been transferred to eukaryotic chromosomes by horizontal gene transfer (HGT) [30,31]. It is now widely accepted that genes related not only to metabolism, but also to producing the eukaryotic membranes and nucleus were transferred from the endosymbiont [32,33]. The GO term "intracellular non-membrane-bounded organelle" was also significantly enriched in the L2-a OP cluster (connected to 70 genes; Benjamini p-value: 0.00771). The GO term "nuclear envelope" was connected to 11 genes within the L2-a cluster, albeit with a higher Benjamini p-value (0.0522) indicating that among these 11 genes, no more than one might be considered a false positive (see Additional file 2). Twenty-one genes in the L2-a OP cluster were found to be connected to the GO term "ribonucleoprotein complex" (Benjamini p-value: 0.00206) and were located in 10 chromosomes (19 loci). Some of these genes' products are indeed mitochondrial ribosomal proteins: MRPS14, MRPS16, MRPS18C, MRPL42, MRPL1, MRPL39, and MRPL30. However, the group also includes non-mitochondrial genes, for example, SRA1. Transcripts of SRA1 have both coding and non-coding co-activator activities; along with SRA1 protein product SRAP, it performs mixed co-activator/repressor functions in differentiation and metabolism. Other examples are DDX4, which functions in germ-cell development, and WDR12, a WD40 repeat protein that is crucial for processing 32S precursor ribosomal RNA (rRNA) and for cell proliferation. Also worth noting is that in the L2 group, RY-based CS analysis revealed an OP cluster very similar to L2-a with respect to GO-term enrichment; 9 of the 14 GO terms that were found to be enriched in this RY OP cluster coincided with the terms enriched in the L2-a OP cluster obtained using k-mer CSA in the standard four-letter alphabet (see Additional file 2).
OP cluster from L2 group enriched in G-protein-coupled receptors (GPCRs) The cluster termed L2-h included 126 genes. Twenty eight of these genes, which were connected to GO term "G-protein-coupled receptor protein signaling pathway" (Benjamini p-value: 0.0026), were located in 11 loci on seven chromosomes (see Figure 1). Moreover, 19 of these genes were also connected to GO term "sensory perception of smell" (Benjamini p-value: 0.00323) and encoded olfactory receptors (ORs). The OR subgroup of GPCRs is one of the largest mammalian genome superfamilies. In the human genome, this group includes approximately 960 genes, although~51% of them are actually pseudogenes [34]. Each OR gene is approximately 1 kb in length, intronless, and found in clusters on almost all chromosomes. In addition, like all GPCRs, each OR gene shares a common molecular architecture consisting of seven transmembrane domains.

H1 cluster related to epithelial cell differentiation
The OP cluster H1-i included 326 genes, 21 of which were connected to the GO term "epithelial cell differentiation" (Benjamini p-value: 2.83 × 10 −9 ); this group and four additional genes were connected to the GO term "epithelium development" (Benjamini p-value: 6.78 × 10 −7 ). Although the 25 genes were located in eight loci of five different chromosomes, 18 were clustered on chromosome 1q, including 16 late cornified envelope (LCE) genes that were also connected with the GO term "keratinocyte differentiation" (Benjamini p-value: 4.07 × 10 −9 ). Cluster H1-i also included genes that have additional roles in the epithelium as well as in differentiation and maturation of other tissues (see Figure 1, Additional file 3). Some examples of these include: SPRR4, induced by ultraviolet light and other environmental stresses [35]; NOTCH2, known to delay hepatoblast maturation during early hepatic organogenesis, and JAG1 playing a role in hematopoiesis [36]; GREM1, which is involved in regulating organogenesis, body patterning, and tissue differentiation [37]; ONECUT1, which encodes a transcription factor mediating complex processes in the liver and pancreas related to cell proliferation, cell-cycle regulation, cell differentiation, and organogenesis [38], and AGT, reported to be involved in the epithelial-tomesenchymal transition in renal epithelial cells [39] and in maintaining blood pressure [40] (see Additional file 3).

A developmental OP cluster in the H2 group
One of the clusters from the H2 segment group (termed H2-a) included 606 genes, 40 of which proved to be connected to the GO term "skeletal system development" (Benjamini p-value: 9.7 × 10 −10 ). These genes were located in 18 loci on 12 chromosomes, with the largest proportion represented by HOX genes (see Figure 1, Additional file 4). HOX proteins are transcription factors (TFs) with a 60amino-acid-long DNA-binding homeodomain. They can function as enhancers or repressors, and many of them participate in morphological or developmental pattern regulation. HOX genes are located in four chromosomal loci (2q31, 7p15, 12q13, and 17q21), originating from duplication of a single ancestral cluster [41,42]. The H2-a OP cluster included additional developmental genes: PAX7, a TF gene with a paired-type homeodomain that plays a critical role during fetal development; RING1, which encodes a TF associated with the multimeric polycomb protein complex; WNT3A and WNT7A implicated in oncogenesis and in several developmental processes, including cell-fate regulation and patterning during embryogenesis; LEFT2 which plays a role in organ system developmental left-right asymmetry determination; MMP9, encoding a matrix metalloproteinase involved in embryonic development, reproduction, and tissue remodeling; NOTCH4, involved in developmental processes by controlling cell-fate decisions and interaction between physically adjacent cells; RET, playing a crucial role in neural crest development; HES7, a TF implicated in correct axial skeleton patterning; and HDAC4, encoding histone deacetylase which represses transcription when tethered to a promoter. As with the L2-a OP cluster, 13 of the 17 GO terms found to be enriched in the RY OP cluster coincided with 13 of the 24 GO terms enriched in the H2-a OP cluster, obtained using the standard four-letter alphabet (see Additional file 4).

Additional interesting cases of GO-term enrichment in specific OP clusters
The following are further interesting examples of functionally significant OP clusters from the L2 and H1 compositional segment groups. More details, including the chromosomal loci and the names of included genes, can be found in Additional files 1, 2, 3 and 4.
(i) L2-b and H1-m clusters enriched for GO terms "keratin filaments" (20 genes; Benjamini p-value: 0.00210), and "intermediate filament" (19 genes, mostly keratins; Benjamini p-value: 8.40 × 10 −9 ), respectively. This example demonstrates that functionally similar OP clusters can be found in diverse GC groups, implying that in such cases, OP rather than GC content plays a role in the positioning of the corresponding genes.

Non-randomness of enrichments
The aforementioned results prompted the question of whether similar function-related detection levels would be obtained if groups of the same size as the original OP groups were built by taking the segments at random within the same GC limits. This question was addressed in the following way. Within each of the three main GC ranges (L2: 37-42%, H1: 42-47%, and H2: 47-52%), we randomly distributed the 100-kb segments into groups with the same sizes as the 31 OP groups obtained by CS comparison. This was repeated 10 times and the resultant 310 random groups of segments were checked for enrichment in genes related to specific GO terms. Such enrichments were found in 65 of the 310 groups, which is significantly less than the corresponding proportions in real OP groups, where 13 out the 31 groups were enriched by functionally related genes (p-value = 0.0106 by the Fisher two-tailed exact test). We further compared the OP groups and random groups with respect to obtained enrichment significance by GO term (Table 2), by applying the Mann-Whitney U test [43] to compare the distribution of -log 10 (p-value) scores in OP groups and in random groups. Enrichment in the OP groups was characterized by considerably higher -log 10 (p-value) scores than in the randomly obtained groups (adjusted Z = 2.719, P = 0.0065). We also analyzed three additional characteristics of the OP and random groups, which were found to be enriched in at least one group of functionally related genes. These included: (i) the number of GO terms per "non-empty" OP group or its random analogue with the assumption that GO-term enrichment for real OP groups should be higher than for randomly formed groups; (ii) the number of segments containing functionally related genes connected with a GO term(s); here we assumed that enrichment due to high numbers of such "involved" segments reflects a positive correlation between the genes' shared functional relevance and their shared residence in genome-wide distributed segments with similar sequence organization; in random clusters on the other hand, such correlations could arise only due to closely linked (within 100 kb) genes; (iii) the ratio of the involved segments to all segments in the cluster (a normalized variant of criterion ii). For all three criteria, the randomly combined groups showed significantly lower enrichment compared to the real OP groups formed from 100 kb segments using k-mer analysis to assess sequence similarity (see Table 2).

Discussion
GC-content variation within the genome displayed in the form of isochores [14,44], and its function in human and other vertebrate genomes have been targeted in many studies. In this analysis, we focused on another aspect of genome organization: possible functional correlates of human genome "organizational heterogeneity" [26], displayed in the variation in the abundances of different oligonucleotide "words" (k-mers). In particular, we were interested in testing whether region-specific "word" usage (regional variation of the genome "accent" [25]) may have functional or evolutionary implications, or both. We used CS analysis [24] to identify clusters of 100-kb segments with similar OPs within five major GC groups (with the same GC-content as in the main isochore groups). We further looked for GO-term enrichment in each of the OP clusters. We revealed that, in many cases, OP clusters were significantly enriched in genes involved in the same biological process (developmental, signaling, etc.). There were also cases of genes within OP clusters involved in the same biochemical complex, or organelle. Moreover, organizational similarity of the clustered 100-kb segments and functional similarity of genes belonging to these segments were observed despite the dispersed genome-segment locations. As the calculations were made on the whole chromosomal sequence, the underlying OPs should include both coding and non-coding DNA (introns and inter-genic DNA), the latter being much more abundant in the genome than the former (with a relative proportion of approximately 20:1). Therefore, we concluded that there might be a tendency for genes involved in the same biological process/complex/organelle to use the same OP, even far away-up to 100 kb-from the genes. We do not know the mechanism underlying this similarity, but these sites might share preferred DNA-repair mechanisms with a resulting similar bias for specific use of nucleotides and oligonucleotide "words" [25]. Another possibility is that genes of the same OP cluster are controlled by similar regulatory sequences (see also Discussion in [26]). In this case, although the intergenic DNA is non-coding, the general sequence organization (the over-represented oligonucleotide words, or the "accent" [25]) may be very important and should be conserved. The complexity of the human genome involves many layers of large-scale duplications (segmental, chromosomal, and whole genome) and expansion of the gene repertoire of the predecessor genomes by HGT. Some of our OP clusters might indeed reflect certain events during HGT and genome-expansion periods, and the subsequent evolution of genomic regions requiring coordinated regulation: (a) HGT from the alpha-proteobacteria symbiont (mitochondria) to the primitive eukaryotes [30,31] that may, to some extent, be related to the L2-a OP cluster; (b) the emergence of vertebrates from chordates by the two-round (2R) whole genome duplication (WGD) [45][46][47] relevant to the H2-a OP-type cluster; (c) the segmental duplications that enabled adaptation of vertebrates to life out of water, including expansion of olfactory receptor genes [48][49][50] relevant to the L2-h OP cluster; and (d) expansion of gene families related to the evolution of mammalian skin (H1-i OP cluster) that is better adapted to water homeostasis than that of amphibians [51,52]. Regarding the last statement, aside from the LCE genes, GO terms related to keratins were found to be significantly enriched in some other OP clusters (L2-h, L2-f and H1-m; see section "Additional interesting cases" in the Results, and Additional files 2, 3 and 4). In the following, we discuss the possible evolutionary meaning of the corresponding findings.

Evolution of genome regions harboring "mitochondrionrelated" genes
The deepest genome expansion layer potentially reflected in the CSA results is the mitochondrion-related gene enrichment in the L2-a OP cluster. HGT from the alphaproteobacteria symbiont to its eukaryotic host-cell genome happened 2 BYA. Some of the complexities and the mixedup eukaryote genome might have resulted from the fact that bacteria that gave rise to mitochondria did not shrivel up into ATP-producing factories. Instead, many of their genes were transferred to their hosts' genomes [30][31][32][33] and, therefore, might subsequently have participated in producing the eukaryotic membranes and nucleus [32,33]. In addition to the coding-gene transfers that benefitted the host, there were probably many cases of DNA sequence insertion that could not be expressed (if the insertion was to chromosomal sites lacking the correct regulation signals for the host transcription machinery).
A preferential insertion of large, mitochondrial-origin DNA fragments (average insertion size 1.3 ± 0.73 Mb) to pericentromeric and subtelomeric regions of human chromosomes was recently suggested by Moon et al. [53]. However, as about half of the mitochondrion-related genes and pseudogenes are not clustered (they are spread over different chromosomes) [54], it seems that these genes were rearranged during the subsequent genome evolution by non-homologous end-joining repair of DNA double-strand breaks [53]. Our CS analysis was performed on 100-kb-length segments; hence both genes and intergenic DNA might be included on the same segment. Therefore, with regard to mitochondrial-origin genes, the fragments transferred from the symbiont (genes and intergenic DNA) might have conserved their OP even further away from the genes themselves, at least within the range of 100 kb. Alternatively, and more likely, the extant gene locations do not directly represent the sites of ancient HGT events, but are secondary, resulting from rearrangements and biased reinsertions into the host chromosomal sites that share a similar OP. According to our results, the OP seems to be conserved in both scenarios (direct HGT or rearrangement).
If our interpretation is correct, then analyzing OP structure of additional eukaryotic genomes should reveal enrichment for the GO term "mitochondrion" in at least one OP type. However, such an analysis is beyond the scope of the reported investigation.

OPs and WGD
The refined 2R hypothesis assumes that two rounds of WGD took place after the emergence of urochordates and before the radiation of jawed vertebrates, some 550 MYA [45][46][47]55,56]. In almost all of the debates surrounding the 2R hypothesis, the HOX gene family which follows a 4:1 rule in the number of vertebrate to invertebrate genes is used by proponents as a supporting argument [46,47,57]. The duplication of genomic loci provided increased flexibility, allowing the generation of new expression patterns, and was critical for the emergence of morphological novelties [58,59]. This "regulatory evolution" of the four HOX gene clusters involved enhancer elements distant from their target promoters [47,60,61], where potential interference with modules of ancestral control was minimized [47]. We assume that the CSA results on HOX gene enrichment in the H2-a OP cluster reflect the aforementioned large-scale genomic expansion event and that OP conservation at a distance of ≤100 kb from the gene edges following these ancient duplication events is an important tier of process regulation.
OP and segmental duplications in the vertebrates' "out of water" evolution Expansion of several gene families occurred during the emergence of the class Amphibia, in the Devonian Period, about 375 MYA. This might be represented by the "signals" from cluster L2-h with its significant olfactory receptor (OR) enrichment. It has been suggested that expansion of the OR gene family was positively selected for in amphibians evolving from the fish lineage during adaptation to terrestrial environments. This assumption is based on variation in OR gene number: about 150 in zebrafish and 15 in pufferfish [50] as compared to 665 in Xenopus tropicalis and 700-1700 in terrestrial mammals [62,63]. Most of the X. tropicalis and mammalian OR genes are class II genes [62] that might be specialized for detecting airborne odorants [49], which outnumber waterborne odorants.
The duplication of skin-related LCE genes (OP cluster H1-i) seems to have occurred during the emergence of the class Mammalia about 200-120 MYA. The LCE gene cluster on 1q21 is located within a 2-Mb region called 'the epidermal differentiation complex' which also includes additional clusters of gene families encoding major proteins of late epidermal differentiation (S100 and S100fused type proteins, involucrin, loricrin, and the SPRRs (small proline-rich proteins) -a sister protein family of LCE [52]). Such protein family-clustered organization suggests duplication in ancestors adapting to changes in terrestrial conditions in the course of evolution [51,52]. It has been suggested that in mammals, the developed "grouping" of SPRRs is better adjusted to subtle cellular and environmental stimuli than a single or a few genes, as clusters could constitute an "extended promoter" [51]. This might also be true for the LCE genes responding "group-wise" to environmental stimuli, such as calcium levels and ultraviolet light, though only LCE3B and LCE3C encode proteins involved in barrier repair after injury or inflammation [64].
Similar to the afore-discussed major examples of OPfunction correlations, the examples provided in the section "Additional interesting cases", may also reflect important events in vertebrate evolution. Thus, cases (i) and (ii) (see Results) might relate to expansion events of protein/gene families related to keratin, the key structural material making up the outer layer of skin and hair. The expansion of keratin families occurred during the emergence of the Amniotes (340-306 MYA) and played essential roles in the formation of rigid and resistant hair shafts [65]. Case (iii) is an additional example of the expansion of ORs and other GPCRs. The example in case (iv) shows that histone genes from different chromosomal loci belong to the same OP clusters. Gene duplication has prevailed as the major mechanism in providing the eukaryotic cell with the required histone number and diversity [66]. Histone variability in multicellular organisms is required to accommodate the different packing needs and gene-expression regulation in different cell types and developmental stages.

Conclusions
In many examples, the similarity of the local microgeographical "accents", that is, gene-harboring segments belonging to the same OP cluster, seems to have derived from duplications of one origin. This might be a large-scale duplication, for example, WGD presumably reflected by the H2-a OP cluster, or segmental duplications presumably exemplified by the L2-h and H1-i clusters. Duplications of minor-sized chromosomal DNA stretches are more abundant and could have occurred due to unequal crossing-over between misaligned homologous chromosomes. Regardless of the exact scenario for the duplication event, the results point to conservation of DNA sequence OPs that are distant (10-100 kb) from the genes' edges. These sequences might be important in regulation of gene expression, or in other processes in which DNA sequences of nearby genes are involved (replication, DNA repair, etc.). Our results presume that many genes belonging to the same OP cluster do not share a common origin location. As we do not know the mechanisms for the preferred use of oligonucleotide "words", we can only suggest that some of these genes belong to the same "transcription factories" and, as such, may also share a DNA-repair mechanism, such as transcriptioncoupled repair. If transcription of these genes occurs at the same time, they might share the same regulation signals: short, moderate, or even long oligonucleotide "words" at a distance from the gene edges. Similar to a recent suggestion [67] that eukaryotic species' evolutionary transitions are associated with codon bias in genes encoding functionally related proteins, we suggest that not only the coding sequences, but also sequences at a distance from the duplicated genes (e.g.,~100 kb) might share the same "accent": there could be a bias in the repertoire of oligonucleotide "words" that might have been conserved in many cases during the course of eukaryote evolution.

Calculating compositional spectra
Consider a set W, including n different words (oligonucleotides), w i , with length L from the standard DNA alphabet {A, C, T, G}. For any word w i from set W and a chosen sequence S, the observed number of matches m = m (w i ) can be calculated with a certain number of allowed replacements (r), for example, 0, 1, or 2 (r = 2). Let M = ∑ i m i . We refer to the frequency distribution F(W, S) based on frequencies f i = m i /M as the compositional spectrum (CS) of sequence S relative to the set of words W. For every set of words W, it is possible to produce a set of complementary words W′, where the word w′ n is the complementary reverse of word w n [26].

Calculating CS distances between DNA sequences
We define the difference d between two sequences, S 1 and S 2, as the distance between their spectra F(W, S 1 ) and F (W, S 2 ). We use the Spearman rank correlation coefficient r s between two CSs computed on the compared sequences as a base for calculating distance d. The inter-or intragenomic similarities and dissimilarities can be displayed by distance matrices of pairwise CS comparisons. The distance matrices can be used to select similar segments and their clustering; for example, by using the neighborjoining algorithm.

Detection of OP groups of segments
To analyze the relationship between genes located in regions with the same OP, we classified all 100-kb segments into five groups according to their average GC content. We did not use the division of sequences into known isochore families [68,69], but we did use the same "borderline" GC content values for separating and designating the obtained GC groups by the isochore family names: L1: GC content < 37%, L2: 37-42%, H1: 42-47%, H2: 47-52%, and H3: GC content >52%. We conducted neighbor-joining clustering based on between-segment dissimilarities within each GC range to obtain groups of segments with relatively similar OPs. For each OP group, we detected all genes residing in the corresponding segments, according to their start and end positions obtained from the Ensembl genome browser (http://www.ensembl.org/). To identify the OP groups enriched in genes involved in the same biological process, we compared every gene list with all genes located in the corresponding GC group using DAVID (Database for Annotation, Visualization and Integrated Discovery [70]). To account for multiple testing, Benjamini corrected p-values [71] were employed for the obtained enrichments.