As a protein-coding gene evolves, non-synonymous substitutions do not accumulate uniformly along its sequence. There is heterogeneity among the rates at which individual sites within a protein evolve, and part of that heterogeneity is induced by structural and functional constraints. Though the structural and functional domains that comprise proteins are contingent upon tertiary folding, there is enrichment within domains for residues that are contiguous along the primary sequence. As such, within proteins there exists rate autocorrelation that can be, and has been, exploited to annotate regions of putative importance.
In a pairwise comparison of protein-coding genes, rate heterogeneity manifests in the non-random placement of non-synonymous changes. One expects a dearth of such changes in regions of structural and functional importance and a relative excess where the intensity of selection is less. The aggregation of changes outside of important regions may lead to the appearance that non-synonymous changes are clustering. We speculated that the appearance of clustering would increase with an increasing intensity of selection, and we developed the dispersion ratio to test that hypothesis. Confirming our speculation, we found a highly-significant log-linear relationship between the dispersion ratio and evolutionary rate. This relationship was observed to be robust to both choice of species and degree of evolutionary divergence.
Just as purifying selection acts to cluster substitutions along the sequence of a protein, there is evidence that diversifying selection leads to clustering as well. This led us to consider the case of genes whose modes of evolution differ on sister lineages. In cases when the evolutionary trajectory spanned by a pairwise comparison contains a mixture of purifying and diversifying selection, we hypothesized an effect on the relationship between the dispersion ratio and evolutionary rate. Having already observed that the degree to which non-synonymous changes cluster is predictive of the rate at which a protein is evolving, we reasoned that for mixed regimes such predictions would be biased downward. At least for the data we examined, this turned out to be the case: for genes under positive selection in the human lineage, the evolutionary rate estimated from a human/chimpanzee comparison was greater than what the degree of clustering would predict.
To place in perspective the contribution of the dispersion ratio as a predictor of evolutionary rate, we compared its explanatory power to those of a diverse set of protein-related attributes. In doing so, we found log(ρ) to be a highly-significant and non-redundant correlate of the logarithmic rate, log(ω). The correlation between log(ρ) and log(ω), and its persistence after conditioning on other correlates of evolutionary rate, speaks to either a determinant of evolutionary rate that has not yet been characterized or a deficiency in the way evolutionary rate has been quantified in this particular set of studies. Whatever the case, it appears that non-synonymous clustering is a reliable, non-redundant, sequence-based predictor of ω.
Because the dispersion ratio behaves differently under neutrality and under purifying selection, and because permutations can be used to populate a sensible null distribution, one can envision using the dispersion ratio in a test of selection. Nevertheless, we did not devise ρ as a statistic to test the behavior of individual genes, and such tests, though conceivable, would likely be underpowered and inferior to existing methods (e.g. [12, 31]). These methods, unlike ours, were specifically designed to identify the presence of clustered substitutions and test their significance against an appropriate null hypothesis about a specific gene. By contrast, we were motivated by simplicity and proposed the dispersion ratio as an intuitive means of testing the existence of genome-wide evolutionary trends, without regard to any particular gene. Other measures of clustering are likely to perform similarly, and indeed we observe similar results to those presented when ρ is replaced by a model-based measure of autocorrelation (taken from ; data not shown).
The intuition behind our statistic and its relationship to evolutionary rate is grounded in dependencies induced by protein tertiary structure. Though ρ is a function of sequence and not structure, the dispersion ratio, like the methods from which it was inspired (e.g. ET, ESF, SWAKK), leverages the fact that adjacent residues in the sequence are structurally proximal. It seems reasonable that a structurally-informed analog of the dispersion ratio would be superior to ρ in validating the hypotheses of this manuscript, but we did not find this to be the case (data not shown). This may be due to, among other possibilities, the limited number of structures available or the manner in which we extended our statistic.
In interpreting the results presented here, it should be noted that all of our analyses were contingent upon sequence alignment. Because alignment uncertainty tends to increase with sequence divergence, to the extent that alignment errors affect neighboring sites, one expects a spurious non-biological correlation between ω and ρ. While alignment error may indeed contribute to the signal we observe, we do not believe it to play a major role. Several of our analyses feature very closely related species whose orthologous proteins are predominantly the same length. For these proteins, the alignment is unambiguous, unless there was both an insertion and deletion event.