Differences in duplication age distributions between human GPCRs and their downstream genes from a network prospective
© Huang et al. 2009
Published: 7 July 2009
Skip to main content
© Huang et al. 2009
Published: 7 July 2009
How gene duplication has influenced the evolution of gene networks is one of the core problems in evolution. Current duplication-divergence theories generally suggested that genes on the periphery of the networks were preferentially retained after gene duplication. However, previous studies were mostly based on gene networks in invertebrate species, and they had the inherent shortcoming of not being able to provide information on how the duplication-divergence process proceeded along the time axis during major speciation events.
In this study, we constructed a model system consisting of human G protein-coupled receptors (GPCRs) and their downstream genes in the GPCR pathways. These two groups of genes offered a natural partition of genes in the peripheral and the backbone layers of the network. Analysis of the age distributions of the duplication events in human GPCRs and "downstream genes" gene families indicated that they both experienced an explosive expansion at the time of early vertebrate emergence. However, we found only GPCR families saw a continued expansion after early vertebrates, mostly prominently in several small subfamilies of GPCRs involved in immune responses and sensory responses.
In general, in the human GPCR model system, we found that the position of a gene in the gene networks has significant influences on the likelihood of fixation of its duplicates. However, for a super gene family, the influence was not uniform among subfamilies. For super families, such as GPCRs, whose gene basis of expression diversity was well established at early vertebrates, continued expansions were mostly prominent in particular small subfamilies mainly involved in lineage-specific functions.
Gene duplications at genomic and local levels are believed to have played important roles in the evolution of vertebrates [1–4]. Waves of gene duplication events were found to have happened at approximately the time of the emergence of early vertebrates and mammals . Massive gene duplications would bring great disturbance to the gene regulatory networks in the cell. How gene duplications impacted and reshaped the gene networks was still not well understood. Nevertheless, several recent theoretical analyses have shed some light on the issue [5–8]. It was shown that the scale-free properties of the gene networks were necessary consequences under the assumption of asymmetric retention of duplicated genes in favor of the genes in the periphery of the network, which was supported by the family sizes of genes with different connectivity in genetic or protein-protein interaction (PPI) networks in yeast and worm [5, 7].
However, these studies did not provide information as to how the duplication-divergence process  proceeded along the time axis during major speciation events, such as the emergence of vertebrates, as their model species were all invertebrates. Meanwhile, the genetic or PPI networks offered only snapshot information about the relationship between family sizes and connectivity of genes, which was often found to be inaccurate. Independent evidences not directly based on genetic or PPI networks were needed for cross examination.
In view of these problems, in this study, we used human G-protein coupled receptors (GPCRs) and their downstream genes in the pathways ("downstream genes") as the model system to examine the impact of gene duplication on the evolution of genes in different layers of the network. It has been shown that the gene regulatory network roughly maps to the cellular organization, with the genes on the periphery of the cell maps to the peripheral layer of the gene network . In this sense, GPCRs and their "downstream genes" offered a natural partition of the peripheral layer and the backbone layer of the gene network. Meanwhile, GPCRs also form one of the largest known groups of signaling proteins in mammalian genomes , and GPCR pathways cover a good portion of the gene network and influence a wide range of physiological activities such as neurotransmission, metabolism, secretion, differentiation and growth, learning and memory, and immune responses [11–13]. The results from the GPCR model system were thus highly representative of the general gene regulatory network in human cells.
In this study, we estimated the ages of the duplication events in human GPCRs and the "downstream genes" gene families. Comparison of the age distributions of GPCRs vs. the "downstream genes" provided a more detailed view of the duplication-divergence process along the time axis in the context of major speciation events in vertebrates. Furthermore, GPCRs were partitioned according to the GRAFS system  into subfamilies, and the age distributions of major subfamilies of GPCRs were estimated and compared. We also examined the expression profiles of GPCRs and downstream genes of different duplication ages, for their contribution to the tissue complexity at different evolutionary stages. In general, we found that most of the GPCR pathways, which cover a substantial portion of the gene network, have been established at the time of early vertebrate emergence. Continued expansions in GPCR families were to a large extent contributed by several small subfamilies involved in immune responses and sensory responses. Our study of the GPCR pathways suggested that the position of a gene in the gene network has great influence on the likelihood of fixation of its duplicates. However, the influence was not uniform. Instead, expansion of a large gene family may be attributed to strong expansions of some particular subfamilies, when it was favorable in particular species, or at particular evolutionary stages. The generality of these principles could be further examined in other super gene families when the data become available.
We studied the human non-olfactory GPCRs in this study which were classified based on the GRAFS classification system . The olfactory GPCRs were not included in this study, as their evolutionary patterns in mammals was peculiar . We also excluded 23 human non-olfactory GPCRs that were not classified by the GRAFS system , 11 GPCRs in the delta group of Rhodopsin that were likely misclassified, and several predicted GPCR genes that were no longer supported by current genome annotations. In total, our GPCR data set covered 52 human GPCR families containing 302 human non-olfactory GPCRs. Vertebrate and invertebrate homologues of the human GPCRs were identified and linked to the families (see Methods).
For the "downstream genes", the process was essentially the same. The ages were estimated using the "nearest neighbor clock" approach as well for each family and were summarized in Additional file 2.
We further examined the age distribution of duplication events of different subclasses of GPCRs to see if they contributed equally to the age distribution of GPCRs. Shown in Figure 3C were the age distributions of the GRAFS subclasses of GPCRs (and the "downstream genes (D)" in a dashed line as reference). The densities were multiplied by the number of genes in each class to reflect the differences in sizes among classes. Different classes of GPCRs contributed differently to the overall distribution of GPCRs. Rhodopsins, which makes up the majority of GPCRs, had a similar age distribution to GPCRs overall, with the peak slightly moved right to over 600 Myrs. The Secretin and Glutamate receptor classes had only an obvious expansion between 400 to 600 Myrs. In contrast, duplication events in the Adhesion and Frizzled/Taste2 receptor classes happened more recently and were mostly after 300 Myrs. Actually, the continued expansion of GPCR families after 400 Myrs were largely contributed by several small classes such as Adhesion and Frizzled/Taste2 receptors and the chemokine receptors of the Rhodopsin class (see Additional file 1 for the actually ages). This result showed that within the GPCR superfamily, and even within the Rhodopsin subclass, the expansion of different subfamilies was asymmetric.
In a recent report , GPCRs with non-peptide ligands were reported to have a significantly higher retention rate than GPCRs with peptide ligands after a lineage-specific whole genome duplications in the pufferfish Tetraodon nigroviridis more than 230 Myrs ago. We examined if the same was true for GPCRs in general. Based on the report , GPCRs that bind non-peptide ligands included GPCRs in the Glutamate receptor class and A1-A4 subclasses of the Rhodopsin class (see Additional file 1), while the rest GPCRs bind peptide ligands. As was shown in Figure 3D, it was obvious that more duplication events were found among the GPCRs that bind peptide ligands than those bind non-peptide ligands in more recent evolutionary stages (after 400 Myrs). This was in direct contrast to the results of the lineage-specific GPCRs in pufferfish by the report . This may have reflected that the selective pressure driving the fixation of different types of duplicated GPCRs were different in different species and at different evolutionary stages. Actually, it was also shown in Figure 3D that, before 600 Myrs, more duplicated events were observed in GPCRs with non-peptide ligands than GPCRS with peptide-ligands. This was not anti-intuitive as many of the ancestor species emerged at that evolutionary stage were simple marine invertebrates.
In general, our results have shown that more duplication events were found for the GPCRs than the "downstream genes", particularly in more recent evolutionary stages when the duplication events were mostly local. This was consistent with the current theory of gene duplication and gene network. However, we have found several aspects that have not been covered in the current theory. First, at the time of emergence of early vertebrate, both GPCRs and the "downstream genes" families experienced explosive expansion. The asymmetric duplication-divergence process may be a good model for gene duplication at normal times, but more factors were likely to be in play at that evolutionary stage when massive genomic duplications and explosive increase in tissues complexity happened. Second, the expansion of gene families in the peripheral layer of the gene network may also be asymmetric. Certain small branches of the big family may get disproportional expansion in particular species at particular evolutionary stages, as exemplified by the type 2 taste receptors in human.
Shown in Figure 4A was the result for GPCRs. Genes duplicated in the 400-0 Myrs interval were denoted as group 400, and genes duplicated in the 800-400 Myrs interval were denoted as group 800. The tissues were ranked by the number of group 400 genes that were highly expressed in a tissue. Our result showed that expression of the more recently duplicated genes (group 400) was enriched in blood and spleen, both of which important tissue for immune response. The chemokine receptor family contributed greatly to these two tissues. Enrichment of expression in the connective tissue was also observed, which was connected to the adhesion receptor family. On the other hand, group 800 genes were expressed in a much wider range of tissues. Interesting, most of the GPCRs expressed in the brain were in group 800. Relatively few GPCR genes expressed in brain or nerve tissues were recently duplicated.
Shown in Figure 4B was the result for the "downstream genes". The data series and the ranking of the tissues were defined the same as in Figure 4A. As has been shown earlier, relatively fewer "downstream genes" were duplication after 400 Myrs, which was reflected in the small number of genes in group 400 in Figure 4B. Similar to GPCRs, the "downstream genes" in group 800 were expressed in a wider range of tissues. Expression in brain and nerve again was most enriched in group 800.
In general, these results of tissue distribution indicated that a substantial enrichment of both GPCRs and the "downstream genes" expressed in brain and nerve tissues happened during the 800-400 Myrs interval. However, similar surge were not observed during the 400-0 Myrs interval. Instead, an enrichment of the GPCRs that expressed in the immune-related tissues of blood and spleen were observed, and this again was contributed mostly by several small GPCRs subfamilies.
Current theories of gene duplication and gene networks suggest that genes on the periphery of the network are preferentially retained after gene duplication, in comparison to the genes form the backbones of the network. This is actually necessary for the gene network to remain scale-free after rounds of gene duplication. However, the data supporting these theories were mostly based on the genetic network or the PPI network in yeast and worm. Similar data on gene networks in more advanced species, including vertebrates, were not yet complete and reliable enough. Our model system of GPCRs and the "downstream genes", took the advantage of the knowledge that the gene network roughly maps to the cellular organization , offered an opportunity to get some insight into the relationship of gene duplication and gene network in the context of vertebrate evolution. Our result showed that GPCRs families had significantly more continued expansion after 400 Myrs, in comparison to the "downstream genes". Under the assumption that all genes have equal opportunities for duplication, this result confirmed that duplicated GPCR genes were preferentially fixed during the 400-0 Myrs interval, compared to the "downstream genes". However, this preferential retention was time dependent, as during the 800-400 Myrs interval we found both GPCR and "downstream gene" families experienced explosive expansion. One explanation for the result of the 800-400 Myrs interval might be that tissue complexity of the species might also experience an explosive increase during that interval, which might have driven fixation of duplicated genes in all the layer of the network. The expression profiles of the genes with duplication ages in the 800-400 Myrs interval offered partial support for the explanation, as genes duplicated at that stage were found to be expressed broadly in a wide range of tissues in human, including brain and nerve. Our results suggested that the gene basis of tissue diversity was largely established by gene duplications in the 800-400 Myrs interval.
There may also be a species-dependence influencing the preference of fixation of duplicated genes within GPCRs. This was reflected in the difference in the preference of retention of GPCRs of peptide ligands vs. non-peptide ligands in human vs. in fish . In fish, the retention rate of GPCRs of non-peptide ligands was higher than that of GPCRs of peptide ligands after a whole genome duplication 230 Myrs ago. In contrast, our results showed that, among human GPCRs, more GPCRs that bind peptide ligands were fixed after 400 Myrs than GPCRs of non-peptide ligands. In human, the subfamilies that contributed most to the continued expansion of GPCRs after 400 Myrs were those involved in immune responses and sensory responses. This may have reflected differences in environmental influences between human and fish on the selective pressure that drove the fixation of GPCR duplicates.
In this study, we have kept the model system simple by including only the downstream genes in the classical GPCR pathways. Many other genes that were indirectly influenced by GPCRs were not included in the study, which are mostly kinases and transcription factors in the signaling pathways. However, one of our previous studies  on the human tyrosin kinase super family found similar patterns with the "downstream genes".
In general, using the human GPCR model system, our results confirmed that the position of a gene in the gene networks has great influences on the likelihood of expansion of its gene family in evolution. However, we also found that the influence was asymmetric among subfamilies of GPCRs. We found that the gene basis of expression diversity of most GPCR pathways, which cover a substantial portion of the gene network, have been established at the time of early vertebrate emergence. Continued expansions in human GPCR families were to a large extent contributed by several small subfamilies involved in immune responses and sensory responses in human. Exactly which subfamilies may see extra expansion may be contingent on environmental factors for different species at different evolutionary stages, as was exemplified by our comparison of the differences in retention preference of GPCRs binding peptide or non-peptide ligands between humans and fishes.
In the future, we shall further investigate the association between gene family expansions and GPCR-related networks [22–27]. For instance, we may study the functional divergence between the duplicated GPCR-related proteins [22, 23], the role of alternative splicing isoforms , as well as the tissue-specific effects [25, 26] on the gene evolution. Recently, Gu  proposed an evolutionary model for the origin of modularity in a complex gene network, suggesting that new (protein-protein) interactions after the gene duplication may be favored to link preexisted backbone of signaling pathway, while loss of interactions are at random. The vertebrate GPCR networks may be an ideal model to study the design principle of signaling networks.
The GPCR gene families analyzed in this study were classified based on the GRAFS classification system , which covers 342 human non-olfactory GPCRs in five major classes including Glutamate (15), Rhodopsin (241), Adhesion (24), Frizzled/taste2 (24), Secretin (15). The Rhodopsin class is further divided into four groups: alpha, beta, gamma and delta. The sequences of vertebrate GPCR genes were retrieved from the Hovergen database . In cases where the GRAFS families were further split in Hovergen, the split families were used for age estimation separately. The invertebrate homologues, as well as some vertebrate homologues missed by Hovergen, were identified by extensive BLAST  searching of the Swiss-prot protein database . Redundant sequences were removed either based on UniGene annotations when available, or based on chromosome positions otherwise.
The "downstream genes" families were constructed similarly. Families were constructed for each entry in Figure 1. The vertebrate homologues in the gene families were retrieved from Hovergen , the invertebrate homologues and the vertebrate homolgues missed by Hovergen were identified by homology searching in the Swiss-prot database. Redundant sequences were removed similarly as for GPCRs.
In bifurcate trees, each new human homologue is created by a duplication event. As a result, for a gene family with n human genes, there will be n-1 duplication events. We used the "nearest-neighbor clock" approach  to date the duplication events. For a gene family, we first reconstructed the phylogenetic tree. We used Clustal W  to carry out multiple sequence alignment of the protein sequences of the genes in the family. Phylogenetic trees (Neighbour-joining) were reconstructed using the MEGA program . Linearized trees  were also calculated using MEGA. Next, the age of a duplication event was estimated based on its distances to the nearest bracketing species-split times in the phylogenetic tree under the molecular clock hypothesis [17, 18]. We used the widely accepted species split times for calibration, including primate-rodent (80 Myrs ago), mammal-bird (310 Myrs ago), mammal-amphibian (350 Myr ago), tetrapod-teleost (430 Myr ago) and vertebrate-fly splits (830 Myr ago) . In cases that a bracketing pair of species-split times can be found for a duplication event, the age of the event was just the linear interpolation based on the distances of the event in the linearized tree to the species-split times. In cases that no bracketing species-split times can be found for a duplication event (usually for very ancient or recent duplication events), linear extrapolation was used to calculate the age.
We used the EST counts in UniGene  as the expression profiles of the human genes in the tissues. The ratio of the number of EST clones of a gene vs. the total number of clone in the library in a tissue was treated as the expression level of the gene in the tissues. For each gene, we calculated its mean expression level and the standard deviation across the 45 tissues covered in UniGene. The distribution of the expression levels of a gene across species generally fit a gamma distribution with small mean and large variation. We defined a gene highly expressed in a tissue if its expression level was one standard deviation greater than the mean expression level.
This work was partially supported by grants from Iowa State University (to X.G.), a grant from National Natural Science Foundation of China (30700140 to Z.S.), and supported by Shanghai Leading Academic Discipline Project, Project Number: B111.
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.