The footprint of metabolism in the organization of mammalian genomes
© Berná et al; licensee BioMed Central Ltd. 2012
Received: 26 July 2011
Accepted: 8 May 2012
Published: 8 May 2012
Skip to main content
© Berná et al; licensee BioMed Central Ltd. 2012
Received: 26 July 2011
Accepted: 8 May 2012
Published: 8 May 2012
At present five evolutionary hypotheses have been proposed to explain the great variability of the genomic GC content among and within genomes: the mutational bias, the biased gene conversion, the DNA breakpoints distribution, the thermal stability and the metabolic rate. Several studies carried out on bacteria and teleostean fish pointed towards the critical role played by the environment on the metabolic rate in shaping the base composition of genomes. In mammals the debate is still open, and evidences have been produced in favor of each evolutionary hypothesis. Human genes were assigned to three large functional categories (as well as to the corresponding functional classes) according to the KOG database: (i) information storage and processing, (ii) cellular processes and signaling, and (iii) metabolism. The classification was extended to the organisms so far analyzed performing a reciprocal Blastp and selecting the best reciprocal hit. The base composition was calculated for each sequence of the whole CDS dataset.
The GC3 level of the above functional categories was increasing from (i) to (iii). This specific compositional pattern was found, as footprint, in all mammalian genomes, but not in frog and lizard ones. Comparative analysis of human versus both frog and lizard functional categories showed that genes involved in the metabolic processes underwent the highest GC3 increment. Analyzing the KOG functional classes of genes, again a well defined intra-genomic pattern was found in all mammals. Not only genes of metabolic pathways, but also genes involved in chromatin structure and dynamics, transcription, signal transduction mechanisms and cytoskeleton, showed an average GC3 level higher than that of the whole genome. In the case of the human genome, the genes of the aforementioned functional categories showed a high probability to be associated with the chromosomal bands.
In the light of different evolutionary hypotheses proposed so far, and contributing with different potential to the genome compositional heterogeneity of mammalian genomes, the one based on the metabolic rate seems to play not a minor role. Keeping in mind similar results reported in bacteria and in teleosts, the specific compositional patterns observed in mammals highlight metabolic rate as unifying factor that fits over a wide range of living organisms.
As recently stated by Meyer and collaborators "structure and organization of genomes belongs among the key questions of genome biology" . One of the most crucial and largely debated questions is centered on the nature of the forces driving the base compositional variation among genomes. At present as much as five evolutionary hypotheses have been proposed to explain the great variability of the genomic GC content, which can be split in two groups, on the bases of the nature of the forces driving the genome evolution, i.e. intra- or extra-cellular . The former, including the mutational bias, the biased gene conversion (BGC) and the DNA breakpoints distribution (BPR) hypotheses, is mainly founded on stochastic events arising during intracellular processes, such as DNA replication, repair and recombination. The latter, including the thermal stability and the metabolic rate hypotheses, take into account the role of adaptive processes resulting from the interaction of the organism with the surrounding environment.
In the frame of the neutral theory, the mutational bias hypothesis [3–5] was first proposed to explain the great variation of the genomic GC content among bacteria, and later on extended to higher vertebrates . In the same frame, the BGC is based essentially on a synergy between recombination events and biased DNA repair system [7–9]. The BPR hypothesis considers that evolutionary rearrangement of breakages happen with a uniform propensity along the genome. Growing body of evidence shows a heterogeneous distribution of breakpoints in mammalian genomes, occurring more frequently in the GC-rich regions, harboring replication origin sites and characterized by high transcriptional activity .
In the frame of the adaptive point of view, several environmental factors significantly correlated with the DNA base composition, have been reported in bacteria: competition for metabolic resources , anaerobiosis , endosymbiosis , environments/habitats , growth temperature , and "aerobic respiration" . In particular, the last two papers stressed the effect on the genomic GC content of the main factors affecting the environmental dynamics: temperature and metabolism. According to the thermodynamic hypothesis, an increment of environmental or body temperature triggers a GC increment, stabilizing DNA, RNA and proteins . The hypothesis based on the metabolic rate was grounded on two DNA features, bendability  and nucleosome formation potential , both significantly correlated with the GC content. More precisely, a higher DNA bendability and a decrement of the nucleosome formation potential have been reported to be both favored by an increment of the GC content [17, 18]. Accordingly, GC-richest genomic regions showed a high transcriptional activity [19, 20].
Preliminary analysis of human genes grouped in functional classes according to the KOG database [21, 22] showed a biased distribution of the GC3 level that was significantly higher in the functional classes of genes involved in metabolic processes . In the present paper, the analysis of the KOG categories and functional classes of genes was extended to thirteen completely sequenced mammalian genomes, as well as to amphibian (X. tropicalis) and reptile (A. carolinensis) genomes.
Current results confirmed previous conclusions , further stressing the role of metabolic rate in shaping the mammalian genome organization. Indeed, a compositional hierarchy among functional categories was found, and the GC3 content of the genes involved in metabolic processes was the highest in all mammalian genomes so far analyzed. Interestingly, the mammalian compositional pattern was absent in the amphibian and the reptile genomes. The finding opened critical evolutionary questions on the compositional transition from "cold- to warm-blooded vertebrates" [24–26], more precisely from amphibian/reptile to mammals, and will be discussed in the light of the current hypotheses about genome compositional variability.
Interestingly, in all mammals the functional classes that recurrently showed a GC3 level higher than that of the whole genome were, apart from those involved in metabolic processes, those involved in: Chromatin structure and dynamics, Transcription, Signal transduction mechanisms and Cytoskeleton. In the human genome, the aforesaid functional classes showed a high probability to cluster in the GC-richest chromosomal bands. In the light of current literature this organization could reflect the needs of a coordinated response to stressing stimuli altering the normal metabolic rate.
In the KOG database http://www.ncbi.nlm.nih.gov/COG/[21, 22] human proteins were grouped in functional classes (denoted by capital letters in square brackets), in turn grouped in three large categories, namely: (i) information storage and processing; (ii) cellular processes and signaling; (iii) metabolism. The corresponding protein and coding sequences (CDS) were retrieved from NCBI http://www.ncbi.nlm.nih.gov using a batch entrez function. Proteins classified in more than one class were removed from further analysis. Genes whose function was predicted only [R] or unknown [S], representing about 19% of the all dataset were removed from further analyses. In order to avoid statistical bias, the functional classes represented by less than a hundred sequences, namely [M], [N] and [Y], accounting overall for less than one percent of the all dataset, were also removed. For sake of simplicity square brackets denoting the functional classes were not used in the other sections of the present paper.
The whole set of human CDS, as well as that of the following species (in alphabetical order): Anolis carolinensis, Bos taurus, Dasypus novemcinctus, Equus caballus, Gorilla gorilla, Loxodonta africana, Monodelphis domestica, Mus musculus, Ornithorhynchus anatinus, Oryctolagus cuniculus, Pteropus vampyrus, Pongo pygmaeus, Spermophilus tridecemlineatus, Tursiops truncatus and Xenopus tropicalis were retrieved from the Ensemble database http://www.ensembl.org. CDS were classified in the KOG functional classes through the orthology with the human proteins. In other words, for each mammalian genome each gene acquired the same KOG classification of the corresponding human gene after the identification of the orthologous pair. In order to identify orthologs, a Perl script, essentially performing reciprocal Blastp  (e-value < 1e-05) and selecting the best reciprocal hit (BRH), was compiled.
Flanking regions (2000 bp flanking at 5'and 3' the transcript) and intronic sequences of KOG human genes were retrieved respectively form Ensemble http://www.ensembl.org using Biomart tools, and from UCSC http://genome.ucsc.edu/.
CodonW (1.4.4) was used to calculate the GC content (i.e. the molar ratio of guanine plus cytosine) of coding and non-coding sequences, as well as the GC3 content (i.e. the molar ratio of guanine plus cytosine the third codon positions) of CDS. The average of GC3 level was calculated for each sequence of each genome so far analyzed. In order to determine the statistical significance of the differences in GC3 content between the three main categories of genes, a two-tale Mann-Whitney test was performed.
The de Finetti's diagram was used to assess the compositional/spatial distribution of the three categories in different genomes (see Additional file 1 for a detailed description of the analysis). Shortly, for each organism the whole GC3 range was split in three intervals of equal size, denoted as Low, Medium and High, respectively. The number of functional classes belonging to the three categories were counted in each interval and normalized to 1 for plotting.
For each species the average GC3 level of each functional class was compared with that of the genome (i.e. the average of the GC3 level calculated using all the available coding sequences of the species), and statistical significance was assessed by the t-Student's test, with Bonferroni's correction (α = 0.05) for multiple-comparisons. The data were showed as Butterfly plot.
Functional classification of human genes
INFORMATION STORAGE AND PROCESSING
RNA processing and modification
Chromatin structure and dynamics
Translation, ribosomal structure and biogenesis
Replication, recombination and repair
CELLULAR PROCESSES AND SIGNALING
Cell cycle control, cell division, chromosome partitioning
Cell wall/membrane/envelope biogenesis
Posttranslational modification, protein turnover, chaperones
Signal transduction mechanisms
Intracellular trafficking, secretion, and vesicular transport
Energy production and conversion
Amino acid transport and metabolism
Nucleotide transport and metabolism
Carbohydrate transport and metabolism
Coenzyme transport and metabolism
Lipid transport and metabolism
Inorganic ion transport and metabolism
Secondary metabolites biosynthesis, transport and catabolism
General function prediction only
Total number of genes
Average GC3 levels, standard deviation and gene number of KOG's functional categories
In the amphibian and reptile genomes the GC3 content of the Red category was never significantly different from that of the Black and Blue categories, thus the mammalian pattern (Blue < Black < Red) was not observed (Figure 2 and Additional file 2).
Since the pioneering Ikemura's papers [30, 31], the GC3 content, accounting for the base composition at the third (wobble) position in a codon, has been generally associated mainly with the codon usage and with the tRNA content. Furthermore, studies primarily performed on the human genome, showed that GC3 should be considered a keystone parameter to understand genome evolution. Indeed, GC3 turned out to be significantly correlated with the amino acid frequencies, i.e. GC1 + 2, as well as with the GC content of non-coding regions, i.e. introns and flanking regions [32–36]. Recent attempts to disregard the pivotal role of the GC3 parameter in understanding the genome organization , failed to take into consideration that "the use of indirect methods can lead to apparently conflicting conclusions" . Recently, the role of the GC3 parameter as genome marker was further confirmed by the unexpected finding of correlations with genome size and body mass of mammals . The subset of KOG human genes analyzed in the present paper followed, as expected, the well assessed rules first described in the 90's [32–36]. These rules held not only for the whole set of genes, but also when the genes were grouped in the KOG functional classes (see Additional file 8 for statistical reports).
In order to shed light on the debate around the evolutionary forces shaping the base composition among and within genomes [4, 6–10, 16–18], genes were classified in the three functional categories [namely: (i) information storage and processing (Blue); (ii) cellular processes and signaling (Black); (iii) metabolism (Red)] and their base compositional properties were analyzed. The results showed that within mammalian genomes the three functional categories were characterized by a different GC3 content, following the pattern Blue < Black < Red (Figure 2 and 3). It is worth to stress that, regarding platypus and opossum, no significant differences were observed comparing the Black vs. the Blue category (see Additional file 1). No pattern was found in the reptile and amphibian genomes (Figure 2).
It is worth to bring to mind that the keystone of the biased gene conversion hypothesis (BGC) was the strong correlation between hot spot recombination sites and GC content, establishing a cause/effect link of the first over the second parameter . Consequently, the genomic impact of the BGC would be an increment of the GC content detectable at non-synonymous sites, synonymous sites, flanking and intronic sequences . Since the BGC was reported to mimic perfectly natural selection , the compositional correlations holding in the human genome [32–36], including those reported in Additional file 8 could not be considered evidences for natural selection hypothesis.
In the light of the BGC hypothesis, the Blue < Black < Red pattern (observed in the majority of mammalian genomes) could have been explained as the result of the star-like phylogeny of mammals . However, comparative genome analyses showed that hot spot recombination sites are "highly mobile" and therefore not phylogenetically related . A result further supported by the studies conducted on the fast-evolving DNA-binding domain of PRDM9, identified as a major hotspot determinant of recombination. Indeed, the sequences and the number of PRDM9 domains were reported to vary a lot among species (reviewed in ), The lack of the Blue < Black < Red pattern in both frog and lizard and its appearance in mammalian genomes at present stands unclear. Although BGC received support from the analysis of the short sequences HARs and HACNSs in the human genome , considering that BGC was reported to be a widespread process affecting all genomes , the hypothesis was unable to explain the base compositional variability among bacterial genomes .
An interesting alternative hypothesis to the BGC was that proposed by Lemaitre and colleagues based on the analysis of the DNA breakpoint regions (BPR) . Very recently, indeed, a 3D analysis of BPR showed that "two loci distant in the human genome but adjacent in the mouse genome are significantly more often observed in close proximity in the human nucleus than expected" . The conservation of the Blue < Black < Red pattern among mammals, that started to diverge about 100 Mya , could probably be explained by the fact that 3D chromatin structure could be conserved over long evolutionary distances . The time of divergence between amniotes and amphibian and between mammals and lizard was estimated to be several orders of magnitude greater than that of mammalian radiation (340-370Mya  and ~310Mya , respectively). Therefore explaining why the pattern was not conserved in reptiles and amphibians. However, according to the BPR hypothesis, evolutionary rearrangement breakages happen with a uniform propensity along the genome , leaving unexplained how the Blue < Black < Red pattern, absent in frog and lizard, could have been evolved in mammals. Moreover, as far as we know, no evidence has been produced to explain the base compositional variability among bacterial genomes in the light of the BPR hypothesis.
The critical query (Blue < Black < Red pattern) could be explained, on the contrary, by both thermal stability and metabolic rate hypotheses [16–18]. Indeed, in situ hybridization experiments performed on both human and amphibian nuclei (i.e. Rana esculenta), showed a comparable chromatin organization [47, 48]. In both genomes, GC-poorest regions were found in closed chromatin structures localized at the nuclear periphery, while GC-richest ones were found in open chromatin structures localized more internally the nuclei [47, 48]. According to the above reports, the different living temperature experienced by amphibians and mammals, could induce an increment of the GC content in mammals, in order to stabilize the open chromatin structures . On the other hand, an increment of the metabolic rate, well known to be higher in mammals, should induce an increment of the GC content to increase DNA bendability, on one hand, and decrease nucleosome formation potential, on the other, to face an increment of transcriptional activity [17, 18]. To this regard it should be recalled that along human chromosome the GC content and the gene expression profiles showed a positive correlation .
Temperature and metabolic rate are well known to be strongly correlated . Therefore, disentangle the two variables would be not an easy task in the light of present data, also considering that terrestrial animals are living in an environment where oxygen is not a limiting factor. The problem was recently addressed analyzing the genomes of organisms living in aquatic habitats were the available oxygen in the environment is limited by the Henry's law. The analyses of teleostean fish genomes showed that: i) the genomic GC content of polar fish was higher than that of tropical fish; ii) that a positive and significant correlation holds between GC content and metabolic rate; and iii) a negative correlation was found between environmental temperature and GC content [50, 51]. The problem was tackled in the present paper analyzing the orthologous pairs of human/frog (H/F) and human/lizard (H/L) genes. In both cases, the highest ΔGC3 turned out to take place in the Red category, that is the functional category grouping genes involved in metabolic processes (Figure 4). Although not resolving the dichotomy between temperature and metabolic rate (both increasing, indeed, from frog to human ) the result was congruent with the conclusion drawn out from the comparison of teleostean fish genomes [50, 51].
The detailed investigation on the distribution of the KOG functional classes revealed that the Blue < Black < Red pattern was even more multifaceted. Indeed, in the positive side of the human butterfly plot, apart the majority of Red bars, the B and K blue bars, as well as the T and Z black bars were also observed (Figure 5, panel C). The above picture was not confined to the human genome, but commonly found in all mammals. Indeed, the B and T classes were in the positive side of the butterfly plot in the 93% of the cases, whereas the K and Z classes reached the 100% of the cases (Figure 5, panel C). The occurrence of the bars belonging to the Red category ranged from 86% of the Q class to 100% of the G, E and P classes. Needless to say, the pattern was not found in the frog and lizard genomes. All the considerations formerly drawn out in the light of the different evolutionary hypotheses regarding the Blue < Black < Red pattern, applied even more radically to the pattern of functional classes clustering in the positive side of all mammalian butterfly plots (Additional files 3, 4, 5, 6, 7 and Figure 5, panel C), showing a different chromosomal distribution.
The above result deserves a more detailed argumentation. As reported in Table 1 the genes belonging to the four classes were involved in the following task: Chromatin structure and dynamics (B), Transcription (K), Signal transduction mechanisms (T) and Cytoskeleton (Z). The fact that the GC3 content of genes belonging to the B and K classes was not surprising, since an increment of the metabolic rate affects transcription process and chromatin structure, as discussed above. More inscrutable was the result regarding the T and Z classes. Recently, an interesting paper was published on the effect of estrogen exposure in mice brain, inducing an increment of the expression level of a discrete number of genes . From their results it is possible to derive that beside a 39% of genes involved in metabolic processes, 18% belonged to the Z class and 25% to the T one, whereas only 6% of the genes belonged to the category grouping genes involved in information storage and processing. Szego's and colleagues report  was an interesting preliminary approach, pointing towards further investigations on the link between genome organization and the physiological reaction to stressing stimuli increasing the metabolic rate. Interestingly, gene clusters for metabolic pathways have been reported also in plants (reviewed in ).
All the different evolutionary hypotheses proposed till now surely contribute, with different weight, to the compositional variability observed among and within organisms . Few, however, seem to fit with the very wide range from prokaryotes to eukaryotes. Indeed, recent analysis showed that mutational bias cannot explain genome composition in bacteria, reviewed in . The BGC hypothesis, supported by the data produced on sequences HARs and HACNSs in mammals , also failed to explain the base compositional variability among bacterial genomes , and hardly explains the present results. The BPR hypothesis was very promising, especially in the light of the studies carried out on conservation of the 3D chromatin structure over long evolutionary distances . However, still remain to clarify the mechanism leading to the observed patterns in mammals.
Regarding the thermodynamic hypothesis, the extensive studies carried out on bacterial genomes has been matter of debate [57–60]. Within bacterial families a significant positive correlation between growth temperature and GC content was observed in 9 out of 20 families . However, the positive correlation failed to be observed in teleostean fish genomes, where a negative one was found indeed [50, 51]. Unfortunately, present data neither shed light in favor nor against the effect of temperature on the compositional transition from amphibian/reptile to mammals [24–26]. On the contrary, the metabolic rate hypothesis [17, 18] not only explained both the transition  and the shifting mode of evolution of vertebrate genomes [50, 51], but also the within genome patterns showed in the present paper. Moreover, a correlation between metabolic rate and GC % has been found also in bacteria [12, 15], as well as among teleostean fish [50, 51]. It is worth to bring to mind that, although the metabolic rate hypothesis is in the frame of the adaptive hypotheses, most probably there is no need to invoke the effect of the positive selection. Indeed, the shift of the of threshold for the "best-fit GC content" could account for the genome compositional shift observed comparing teleostean fish living in different habitats . Natural selection has been also proposed to explain the great compositional heterogeneity of the human genome .
biased gene conversion
DNA breakpoint distribution
best reciprocal hit
molar ratio of guanine plus cytosine
molar ratio of guanine plus cytosine at first and second positions
molar ratio of guanine plus cytosine at third codon positions
molar ratio of guanine plus cytosine in intronic sequences
human-accelerated conserved non-coding sequences
clusters of orthologous groups for eukaryotic complete genomes.
Thanks are due to Claudio Agnisola, for critically reading the manuscript, Guillermo Lamolle, for the butterfly plots, and Fernando Alvarez-Valin, for the bioinformatics facilities of the Facultad de Ciencias, Universidad de la República (Uruguay).