A shift in understanding the evolutionary forces that shape the human genome architecture took place when the retention of duplicate genes, a major factor in fostering genome complexity, was recognized to be primarily promoted by random genetic drift [1, 2]. Thus, the evolution of genetic redundancy in human and in other higher eukaryotes is enabled by subfunctionalization, a preservation process driven by mildly degenerative mutations that cause complementary loss of subfunctions in different gene copies. These typically dissimilar effects promote the separation of duplicates across cell types or developmental phases, thus making them indispensable to maintain the functional requirements of the ancestral locus. In subfunctionalization, expression-regulatory elements are essentially lost through complementary loss-of-function mutations in paralogs, leading to a partitioning of the function across tissues or developmental phases. Thus, this nonadaptive mechanism is essentially constructive , and is enabled by selection inefficiency, which is expected given the small size of the human population [1, 3, 4].
Yet, as shown in this work, the retention of gene duplicates through subfunctionalization must also encompass adaptive elements. This is so because dosage imbalances arise in the concentrations of the encoded proteins as a result of gene duplication events and the deleterious effects of such imbalances can be mitigated when paralogs are physically separated by subfunctionalization. Dosage imbalances occur when protein concentration levels at specific tissue locations do not fit the stoichiometry of the complexes in which the proteins are involved [5, 6]. The complexes may be transient or obligatory with regards to maintaining the structural integrity of the protein. Therefore, dosage sensitivity, that is, the fitness impact of dosage imbalance, must be determined by the extent of functional reliance of the protein on associations .
In this work we hypothesize that duplication of dosage-sensitive genes imposes a selection pressure on the fate of the duplicates that is buffered through subfunctionalization. Thus, although originated in random drift, subfunctionalization cannot, and in effect does not, escape the selection forces but rather becomes adaptive to mitigate the fitness bottleneck imposed by the gene duplication event. To validate this hypothesis, we identify a molecular attribute of proteins that is indicative of their dosage sensitivity, thereby quantifying the impact of dosage imbalance effects on the evolution of genetic redundancy. Thus, this work is devoted to prove the following assertion: If subfunctionalization is indeed adaptive, its effect on paralog segregation should scale with the dosage sensitivity of the duplicated genes. As shown in this work, this is indeed the case, and in this way, the adaptive nature of subfunctionalization is shown to arise from the imbalance-buffering nature of the process.
Since unicellular organisms lack the buffer of expression diversification, selection pressure on duplicate genes is frequently enough to eliminate one of the duplicates, especially for genes with high dosage sensitivity. Proof of this is the significant decrease in family size with dosage sensitivity encountered in unicellular eukaryotes when compared with higher eukaryotes . Thus, gene duplicates in unicellular organisms are subject to higher purifying selection than their counterparts in multicellular eukaryotes. The scope of this work is to show that subfunctionalization is one of the buffering mechanisms that enable paralog survival in multicellular eukaryotes.
To assess the adaptive contribution to subfunctionalization, it becomes essential to introduce a molecular indicator of dosage sensitivity. As shown in previous work , dosage imbalance effects are quantified by under-wrapping (ν), a measure of the packing quality of soluble gene products that determines the extent of reliance of the protein on binding partnerships to maintain its structural integrity [8–12]. Specifically, ν defines in a structure-averaged way the level of hindrance of structure-disruptive backbone hydration. This parameter can be determined directly from protein structure by identifying the percentage of backbone hydrogen bonds (BHBs) that are unburied -the so-called dehydrons- and hence poorly protected from competing hydration of the amide and carbonyl . Dehydrons constitute packing deficiencies since they are incompletely "wrapped" by the side-chain nonpolar groups that promote exclusion of surrounding water. Thus, for an individual gene, we get ν = (#dehydrons)/(#BHBs) where the quotient extends over all gene products or encoded proteins. Dehydrons are markers of compulsory protein associations that play a structure-protective role by promoting their inter-molecular dehydration [9–12]. Upon protein-protein association, the side-chain nonpolar groups of the binding partner penetrate the microenvironment of the dehydron, contributing to improve its wrapping . This dehydration stabilizes the hydrogen bond in -3.9 kJ/mol .
In practice, given the dearth of structurally reported structures when compared with proteome size, dehydrons are often identified from protein sequence using machine-learning methods of inference (Materials and Methods). The rationale for this approach is that, being local indicators of structural disruption, dehydrons belong to a twilight zone between order and disorder that can be identified using a reliable sequence-based predictor of disorder propensity such as PONDR .
Recent cross-examination of structural and evolutionary data revealed that duplicates of genes encoding for under-wrapped proteins are exposed to higher deleterious pressure than gene duplicates coding for well-wrapped products. Thus, ν serves as a proxy for dosage sensitivity, as confirmed by a statistically significant negative correlation between family-averaged ν (<ν>) and family size .
Paralog survival is dependent on ν with P < 10-17 in unicellular organisms, P < 10-6 in fly and worm, but P < 6.7 × 10-3 in human (Wilcoxon rank test) . This contrast between simple and complex organisms is likely to arise due to the higher complexity of expression regulation in higher eukaryotes. The translation complexity may enable a buffer to dosage imbalance not likely to be found in unicellular organisms. By focusing on evolution-related dosage imbalances, our results corroborate this hypothesis.
The validation of the results asserting the adaptive component of subfunctionalization rests squarely on the legitimacy of under-wrapping as a proxy for dosage sensitivity. Evidence inversely correlating gene family size and under-wrapping , evidence arising from analysis of the mechanisms that buffer dosage imbalances in humans , and evidence on the reliance of under-wrapped proteins on binding partnerships to maintain their structural integrity , all uphold the validity of under-wrapping as a molecular indicator of dosage sensitivity. Nevertheless, a control becomes essential to validate the conclusions of this study. As it turns out, this is the same control that serves to validate the molecular marker adopted  and arises from the following rationale: If a specific gene duplication is actually part of a macro-scale event of whole genome duplication (WGD), we expect little or no selection pressure arising from dosage sensitivity since a WGD does not generate a dosage imbalance. Hence, the expression divergence brought about by subfunctionalization of gene duplicates arising from a WGD should result only from random genetic drift, with a minor adaptive contribution. This is indeed the case, as shown in this work.