Human global transcriptome profile based on expression in multiple tissues
The unsupervised clustering analysis of the global expression correlation along 116 samples of 32 human normal tissues is presented in Fig. 1. This plot shows the color heatmap and dendrogram produced by the comparison of all the expressed genes (18,545) along all samples. The pair-wise distances were calculated as: (1 – Spearman correlation coefficient); that is a non-parametric approach to calculate similarity and distance. In this way, the heatmap shows a clear relationship between the samples from the same tissue (i.e. the biological replicates come together) and also presents the proximity between the tissues that have strong biological and physiological links, such as: spleen, lymph nodes and tonsils (all related to the lymphatic system); or stomach, duodenum, small intestine, colon and rectum (all related to the digestive system). A color bar with scales for the heatmap is included in Fig. 1, to indicate that dark-red corresponds to minimum distance (i.e. maximum correlation) and dark-blue to maximum distance (i.e. minimum correlation). White color corresponds to medium values and the distributions inside the color bars show that the major part of the comparisons have values around r = 0.80–0.85 (i.e. pale-red colors). The results also reveal that some tissues have transcriptomic profiles very different to the rest, being testis the most different one that produces a clearly separated branch in the dendrogram.
House-keeping and tissue-enriched genes
The transcriptomic data allowed creation of two relevant subsets of genes: House-keeping (HKg) and Tissue-enriched (TEg) genes. Figure 2a shows the number of genes expressed, with FPKM equal or higher than 1, per number of tissues. This calculation allowed the identification of 8961 ubiquitous genes present in at least one replicate of all tissues, and 7668 genes expressed in all the 116 samples (i.e. in all the biological replicates of all tissues). The intersection of these 7668 genes with a curated dataset of 3804 house-keeping genes created by Eisenberg et al. [28] gave a total of 3524 HKg (Fig. 2c), indicating a large overlap of 93 %. To identify how relevant was this, we calculated the odds ratio (OR) of such overlap and obtained a very significant value of OR = 32.09 (with 95 % confidence interval, CI, of the OR = 28.27–36.43). For the tissue-enriched genes we explored the other side of the data in Fig. 2a and considered just the genes that were expressed in only one, two or three tissues (2459 genes). We did not take only one, but also two or three tissues, because some tissues are physiologically very related and in fact presented high correlation between them, for example: colon and rectum; small intestine and duodenum, etc. These observations indicated that such tissues share a large number of common genes in their expression profile. Finally, a global comparison of the expression distributions of HKg versus TEg indicated that the Tissue-enriched genes showed significantly lower expression values than the House-keeping genes (Fig. 2b). To demonstrate this difference, we made two statistical tests, t-test and Wilcoxon. In both cases, the p-values were very low (p-value < 1e-10) and the difference between the mean expressions of TEg and HKg was 1.61 (in the log2 scale). Thus, we can conclude that this difference is not by chance but due to a real different expression regulation that should be considered in further analyses. These boxplots also show that the global variability of TEg (which presents values below 1 and above 12 in the log2 scale) was larger than the global variability of HKg. This observation indicates that the TEg present a larger range of expression values along different tissues and the HKg have a tighter regulation.
Human gene hallmarks on the evolutionary time-scale
For the evolutionary analysis, we used a phylostratigraphic method for reconstruction of macro-evolutionary trends based on the principle of “founder gene” formation [10]. Typically, these methods first identify the homologues of a given gene and then use the divergence between the two most distant to determine the gene age. Historically, such studies have been using BLAST [29] for homology searches. However, this approach was shown to introduce some biases into the analyses [30, 31]. Another approach is to use orthologous groups to determine the age of a gene. Orthologues are believed to be functionally more similar than paralogues [32] and by definition they trace back to an ancestral gene that was present in a common ancestor of the compared species [33]. The parameters used for clustering orthologous groups affect the age estimations for a gene; for example, restrictive parameters tend to limit the set of possible progenitors [9]. Nevertheless, both approaches used for dating the gene origin depend on the correct identification of homologues and/or orthologues, but in the second case the accurate reconstruction of orthologous families imposes a higher stringency and it implies a conservation along the evolutionary clades.
For our analysis, we identified the group of orthologues that corresponded to each of the 18,545 genes detected in the transcriptomic study, mapping them to the corresponding human protein-coding genes in the OMA database [20] (i.e. mapping of 18,545 genes to 17,437 proteins), and then assigning the Lowest Common Ancestor (LCA) to each human protein according to its orthologous family. The use of OMA in comparison with BLAST homology approach gives us a more detailed view of gene origin since, as we indicated above, it uses a more restrictive grouping method. Thereby, the number of genes dated on ancient clades is lower. Once we identified the LCA for each human protein-coding gene we assigned such protein/gene to the corresponding taxonomy level in the human lineage as defined in NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606), that includes 31 taxonomic groups as consecutive clades from the first one, named cellular organisms, to the last one Homo sapiens. Figure 3 presents these 31 taxonomic clades placed along the time-scale (in million years ago, Mya) from the origin to present. For each one of these taxonomy levels we represented the cumulative percentage of proteins that are dated at such level. This is done in the following way: first, plotting all the genes mapped to OMA proteins (Fig. 3, black line in the graphic, that includes 17,437 proteins); second, the same plot is produced but including only the proteins that correspond to House-keeping genes (Fig. 3, blue line includes 3393 proteins, HK); third, plot including only the Tissue-enriched genes (Fig. 3, red line includes 2157 proteins, TE).
The analyses of these plots obtained with the phylostratigraphic method revealed the presence of some major differential steps on the emergence of protein-coding genes along the evolutionary time-scale from origin to present. Looking at all the expressed coding genes, we can see the global evolutionary profile of the organism (human, in this case), but along this profile we can identify some more prominent steps in the accumulated relative number of genes (genes count) along time. For example, looking at the plot of the HK a large increase is observed at the start of the curve, from the first taxonomy level (origin, Cellular organisms) to second (Eukaryota) taxonomy level (i.e. the first phylostratum Cellular organisms - Eukaryota). By contrast, the TE have the major emergence of genes much later (around the Mammalia) (Fig. 3). The complete timeline includes 31 phylogenetic clades (named in the figure legend), but by analysing these points it was possible to identify eight major steps or stage levels (named “hallmarks”) that appear on the human gene profile along evolutionary time. Moreover, we could assign the number of human protein-coding genes that emerged along each one of these eight stages for the three categories reported: all the expressed protein-coding genes mapped to OMA, the HK and the TE. The eight stage levels identified are: st1)
Cellular organisms (Prokaryota); st2)
Cellular organisms to Eukaryota; st3)
Eukaryota to Metazoa; st4)
Metazoa to Vertebrata; st5)
Vertebrata to Euteleostomi; st6)
Euteleostomi to Mammalia; st7)
Mammalia to Primates; and st8)
Primates to Homo sapiens. All the numbers corresponding to the human protein-coding genes assigned to each of these eight evolutionary hallmarks are included in Fig. 4, that indicates how many are in each stage either considering the complete human gene set or just the HK or the TE. All the information about each one of the 17,437 human protein-coding genes including the assignment to stages is also provided as Additional file 4.
The analysis of the hallmarks also reveals that the HK genes are more ancient than TE genes. The plots in Fig. 3 and the data in Fig. 4 show clear difference between them. The HK genes present a major increase or expansion in stage 2 (Prokaryota to Eukaryota), with 1009 genes and a change of ≈ 30 % with respect to the total. By contrast, the TE genes show a major increase in stage 7 (Mammalia to Primates) with 799 genes and a change of ≈ 37 %. These observations seem to indicate that house-keeping genes emerged early in evolution and are older in age, knowing that they reflect more essential and constitutive functions. The idea that gene essentiality is associated with older genes has been reported in several studies, for example on yeast [14] and mammalian genes [16]. By contrast, the observations that human tissue-specific genes had emerged later in evolution may reveal that human specific cellular or physiological roles are implemented at molecular level by the appearance of newer genes.
Gene age data comparison
As we indicated above, there are some studies that use the phylostratigraphic method to explore the age of human genes, but most of these studies use sequence similarity search (with algorithms like BLAST) to look for the oldest homologues to the human [12, 16, 34]. To compare the results on human gene age assignment done in this work with other available age assignments, we took the published data from Domazet-Lošo (2008) [12] and from Neme (2013) [34], and we represented the information about allocation to Lowest Common Ancestor (LCA) of the human genes in phylogenetic clades of the evolutionary tree. The assignments were done using 15 common phylostratum to allow the comparison of the data. The results of this comparison are included in Additional file 1: Figure S4 and they show a general similarity but some important differences. The most significant difference corresponds to the fact that both Domazet-Lošo [12] and Neme [34] placed a very large number of genes on the first stage of the evolutionary time-scale that goes from the origin of life to first cellular organisms (i.e. pre-eukaryota): 8285 of 22,845 (36 %) [12] and 7309 of 22,154 (33 %) [34]. This result denotes a bias that, as we indicated above, can be due to the methodology of using the homology search approach. In any case, the idea suggested that one third of the human proteoma may have emerged in evolutionary time before the origin of eukaryotic cells needs deeper studies and it is not what we observed in our analyses.
Another important difference is that the proposed age mapping allocates the largest number of genes, first, to the Chordata-Vertebrata-Euteleostomi phylostratums (with 5070 genes) and, second, to the Mammalia-Eutheria (with 2172 genes) (Additional file 1: Figure S4). From the evolutionary point of view these results make a lot of sense since the time-scale of life [21, 22] reveals two large expansions of the species precisely around the vertebrates time (between 600 and 400 Mya) and around the time of the mammals appearance (between 250 and 100 Mya). These expansions are well reflected in our time-scale profile (Fig. 3). Finally, it is important to indicate that the age mapping presented in our study only considers human protein-coding genes that are included in orthologous families (mapping a total of 17,437) and in this way it has a lower coverage over human genes than the other reported studies which include more than 22,000 genes in each case [12, 34].
Functional enrichment of the genes at different evolutionary hallmarks
We performed functional enrichment analyses of the sets of protein-coding genes included in each one of eight major stages found in the evolutionary study. The full results of these analyses are provided as Additional file 5. In all the stages the functional enrichment makes clear biological sense and provides a strong support to the allocation of many biological processes in evolutionary time. In brief, we comment and discuss below some interesting functions enriched in each stage.
Stage-1, from the origin of life to first cellular organisms (i.e. Prokaryota). This stage comprises the genes occurring over two major domains of life: Archaea and Bacteria. Determining the LCA, our data shows that human has 6.76 % (1178) of the protein-coding genes assigned to Prokaryotic age. Prokaryotes are organisms that lack both membrane-bound organelles and nucleus. Functional enrichment analysis showed that this stage involved many basic metabolic processes like glycolysis (GO:0006007, glucose catabolic process), the Krebs cycle (GO:0006099, tricarboxylic acid cycle), and lipid oxidation (GO:0009062, fatty acid catabolic process). The enrichment also shows the appearance of the oldest cellular organelle, the mitochondria, and the oldest macromolecular machine, the ribosome, that are well reported to be dated to Prokaryotic times.
Stage-2, Cellular organisms to Eukaryota. According to basic literature, the defining feature of eukaryotic cells is that they have membrane-bound organelles, especially the nucleus, which contains the genetic material, and is enclosed by the nuclear envelope. Protists, fungi, animals, and plants all consist of eukaryotic cells. Eukaryotic cells also contain other membrane-bound organelles such as the Golgi apparatus. Eukaryotic organisms can be unicellular or multicellular. The functional enrichment analysis for the 2178 genes that emerged along this stage showed well the formation of the principal complexes expected in Eukaryotes. In this way it is remarkable the enrichment on nuclear pore proteins, nuclear import proteins, nucleosome and chromatin proteins, as well as many proteins involved DNA and RNA activity: mRNA and rRNA processing, mRNA splicing, DNA unwinding, DNA polymerase, DNA/RNA helicase. This stage also marks in time the appearance and biogenesis of some major molecular complexes: the proteasome (GO:0005839, proteasome core complex), the spliceosome, and the ribosome (at this stage mainly the proteins of the large subunit RPLs, in contrast to the ribosomal proteins of the small subunit RPSs, that were mostly allocated to Prokaryotic age).
Stage-3, Eukaryota to Metazoa. The third stage comprises organisms from Opisthokonta and Metazoan clades with 1395 protein-coding genes (27.25 % cumulative). The Opisthokonts are a broad group of eukaryotes, including both the animal and fungi kingdoms, sometimes referred to as the “fungi/metazoan group” [35]. This stage comprises metazoan, fungal and protistan taxa, and other multicellular taxa (such as plants, and red and brown algae) [36]. They also include known fungi and/or parasites of plants like: Ascomycota, Basidiomycota, Chytridiomycetes, Glomeromycota, Microsporidia, Urediniomycetes, Ustilaginomycetes and Zygomycota [37]. According to our functional enrichment analysis, this stage involves different genes responsible for signal transduction like the GTPases. Some of the enriched terms are: pyrophosphatase activity, nucleoside-triphosphatase activity, GTP binding, transferring phosphorus-containing groups. All these functions indicate that it may be the time when phosphorus and phospate acquired a key role in protein function and regulation. Other enriched functions, like post-translational protein modification, calcium-binding EF-hand, protein transport and localization also indicate cellular protein regulation.
Stage-4, Metazoa to Vertebrata. This stage includes organisms from Eumetazoa, Bilateria, Deuterostomia, Chordata, Craniata and Vertebrata with 2333 genes (40.63 % cumulative). The main novelties of this stage are the appearance of protein kinase activity, and the presence of growth factors and some specific signaling proteins like WNT. All biochemically characterized members of the WNT superfamily encode enzymes that transfer organic acids, typically fatty acids, onto hydroxyl groups of membrane-embedded targets [38]. Other enriched terms in this stage, like sarcomere and contractile fiber part, may indicate the emergence of the muscular structures present in vertebrates [39].
Stage-5, Vertebrata to Euteleostomi. This stage represents the largest step in the human lineage according to the number of protein-coding genes assigned (5070), that correspond to a 29 % of the total. The stage comprises organisms from Gnathostomata (jawed vertebrates), Teleostomi (bony fish and tetrapods) and Euteleostomi (bony vertebrate) [40]. The enrichment analysis shows a large functional expansion including new biological systems, like the neural and the vascular-circulatory systems, represented in enriched terms like: neurogenesis, neuron differentiation, axogenesis, voltage-gated channels, neuromuscular junction development, blood vessel development, vasculature development, mesenchymal cell development and differentiation, etc. Many other genes are assigned to biological regulation and regulation of cellular processes; including cell death and apoptosis. Finally, the appearance of the large family of homeobox proteins seems to be placed at this stage.
Stage-6, Euteleostomi to Mammalia. In this stage there are organisms from Sarcopterygii (lobe-finned fishes) [41], Dipnotetrapodomorpha (new taxon from NCBI comprising lungfishes), Tetrapoda (four-legged vertebrates), Amniota (comprising the reptiles, birds and mammals that lay their eggs on land or retain the fertilized egg within the mother) and up to Mammalia clades. With 1953 genes at this stage, the human lineage achieves 80 % of its gene composition. The most relevant enriched terms are related to the hematologic system, marking the appearance of the leukocytes and the lymphocytes. Previous phylogenetic analyses based on gene expression data also placed the date of many proteins from leukocytes around the time of the mammals’ clade [42].
Stage-7, Mammalia to Primates. This stage comprises clades of Theria, Eutheria, Boreoeutheria, Euarchontoglires and Primates, representing organisms that give birth to live young without using a shelled egg up to placental mammals [43]. There are 2821 genes emerged on this stage, adding up to 97.08 % of the cummulative profile in the human gene lineage. A large amount of these genes are enriched in the terms: regulation of gene expression and transcription. Other more specific terms are related to the skin (epidermal and epithelial cell differentiation, keratinization) or with the sexual reproductive system (male gamete generation, spermatogenesis and sexual reproduction). This stage also includes a family of cytochrome P450 proteins (that are around 23) and the mammalian defensins (that are 6): DEFA1B, DEFA3, DEFA4, DEFA5, DEFA6, and DEFB4A. Defensins are a family of antimicrobial peptides and vital contributors to host immune response. Being constitutive or inducible expressed genes, they have been shown to contribute to innate host defense via direct bactericidal activity, as well as to adaptive immunity through effector and regulatory functions [44].
Stage-8, Primates to Homo sapiens. The last stage of human development, with 509 genes, presents a group of quite specific functions played by specific protein families, such as: somatotropin hormone, cytochrome P450, GTPase activator activity, defense response to fungus and bacterium provided by histatins. HIS1 and HIS3 (histatin proteins) have been found only in saliva of humans, macaques and some other primates but not in any other mammals [45]. They are a family of histidine-rich polypeptides that probably function as part of the non-immune host defense system and appeared very late in evolution [45]. Cytochromes P450 constitute a superfamily of proteins that existed in virtually all species from prokaryotes to humans. Most of these proteins in the CYP1, CYP2, CYP3 and CYP4 families encode enzymes involved in the metabolism and elimination of potential toxic compounds like drugs or foreign xenobiotics, and are inducible by various environmental stimuli [46]. This last stage includes a small subset of six cytochromes P450 that seem to be very specific of the primates-human clades: CYP2A7, CYP2C9, CYP2D6, CYP2J2, CYP2S1 and CYP3A43. The appearance late in evolution of some of these genes may reflect their functional specificity and it is known that they play a key role in human health [46].
Network analysis reveals evolutionary age conservation of coexpressed proteins
The global coexpression analysis of the human protein-coding genes allowed the construction of a network including highly correlated protein pairs. The integration of these data with the data from the evolutionary analysis –that provided the identification of the eight stages along evolutionary time– did allow mapping the age of the genes on the network according to such stages. These results are presented in Fig. 5 that shows a complex network –like a galaxy– that includes 1691 protein nodes associated by 19,615 interactions. This network corresponds to a subset of the coexpression data, build as indicated in Methods, which included 2298 proteins and 20,005 interactions (this full coexpression data is provided in Additional file 2, and the network in Additional file 3). The subset is done with only the groups that had at least five linked proteins, since we wanted to provide in Fig. 5 a visible representation of the network with a clear color mapping of the eight stages that were identified in the evolutionary study. The colors of the stages are also presented in the illustrated table, Fig. 4, to allow better identification of the number and % of proteins at each age hallmark. The analysis of the network (Fig. 5) done with the algorithm MCODE revealed the existence of 11 major subnetworks, which can be considered as major constellations in the galaxy of relational nodes. The color code of this large graph indicates that there is enrichment in certain colors in each subnetwork. To prove this in a more accurate way, we built a graphic representation for each one of the 11 subnetworks found to show the proportion of proteins assigned to each of the eight evolutionary stages with their corresponding color code (colors as in Figs. 4 and 5). This graphic is presented in Additional file 1: Figure S5, that shows each subnetwork with its specific color pattern, indicating that there are always some predominant colors: subnetwork 5 is the oldest with red predominant colors and subnetwork 11 is the newest with blue predominant colors. As a conclusion, these results revealed that in the groups of highly coexpressed proteins there is a tendency to include proteins of the same evolutionary age.
Finally, we did a functional enrichment analysis of the proteins forming the 11 subnetworks which again showed a coherent biological enrichment in specific functions: (subnetwork 1) immune response; (2) cell cycle; (3) cytoskeleton; (4) RNA splicing; (5) ribosome; (6) extracellular matrix; (7) muscle and contraction; (8) gametes and reproductive process; (9) cell junction and cell adhesion; (10) mitochondria and ATP synthesis; (11) angiogenesis and vasculogenesis. More detailed results for this analysis are included in Additional file 1: Figure S6. Thus, we observed that age-related proteins are predisposed to present expression coregulation and to have close functional links.
Combining age data, functional data and coexpression data can provide a deeper view about the links and roles of the human protein-coding genes. In this way, we observed for example that subnetwork 5 (which contains proteins related with ribosome and translation) presented as expected an overwhelming majority of ancient genes from the Prokaryotic or Eukaryotic age (stages 1 and 2). On the other hand, subnetworks 1 (immune response, leukocyte/lymphocyte activations), 8 (gametes and reproductive process) and 11 (angiogenesis and vasculogenesis) showed a higher proportion of recent genes dated after Vertebrata (Additional file 1: Figure S5). These results agree with studies based on yeast protein physical interaction networks, arguing that proteins preferentially interact with proteins of same age and origin [47]. Moreover, it was previously shown that coexpression networks can be conserved over the evolutionary history, and these genes tend to be functionally related and provide selective advantages [48]. It has been also reported that coexpression networks are found associated to functions like cell adhesion, cell cycle, DNA replication and DNA repair [49], and this is in agreement with functions found enriched in subnetworks of our analyses: subnetwork 2 and 9 (Additional file 1: Figure S5).