Evolutionary anatomies of positions and types of disease-associated and neutral amino acid mutations in the human genome
© Subramanian and Kumar; licensee BioMed Central Ltd. 2006
Received: 21 June 2006
Accepted: 05 December 2006
Published: 05 December 2006
Amino acid mutations in a large number of human proteins are known to be associated with heritable genetic disease. These disease-associated mutations (DAMs) are known to occur predominantly in positions essential to the structure and function of the proteins. Here, we examine how the relative perpetuation and conservation of amino acid positions modulate the genome-wide patterns of 8,627 human disease-associated mutations (DAMs) reported in 541 genes. We compare these patterns with 5,308 non-synonymous Single Nucleotide Polymorphisms (nSNPs) in 2,592 genes from primary SNP resources.
The abundance of DAMs shows a negative relationship with the evolutionary rate of the amino acid positions harboring them. An opposite trend describes the distribution of nSNPs. DAMs are also preferentially found in the amino acid positions that are retained (or present) in multiple vertebrate species, whereas the nSNPs are over-abundant in the positions that have been lost (or absent) in the non-human vertebrates. These observations are consistent with the effect of purifying selection on natural variation, which also explains the existence of lower minor nSNP allele frequencies at highly-conserved amino acid positions. The biochemical severity of the inter-specific amino acid changes is also modulated by natural selection, with the fast-evolving positions containing more radical amino acid differences among species. Similarly, DAMs associated with early-onset diseases are more radical than those associated with the late-onset diseases. A small fraction of DAMs (10%) overlap with the amino acid differences between species within the same position, but are biochemically the most conservative group of amino acid differences in our datasets. Overlapping DAMs are found disproportionately in fast-evolving amino acid positions, which, along with the conservative nature of the amino acid changes, may have allowed some of them to escape natural selection until compensatory changes occur.
The consistency and predictability of genome-wide patterns of disease- associated and neutral amino acid variants reported here underscores the importance of the consideration of evolutionary rates of amino acid positions in clinical and population genetic analyses aimed at understanding the nature and fate of disease-associated and neutral population variation. Establishing such general patterns is an early step in efforts to diagnose the pathogenic potentials of novel amino acid mutations.
The association of mutations with specific human inherited diseases has been known for over five decades . These mutations can be single nucleotide changes (point mutations), insertion or deletion of nucleotides (indels), or gross chromosomal rearrangements; furthermore, they may occur in protein-coding and in non-coding regulatory regions. Of all known gene lesions associated with disease, approximately half are point mutations that change the encoded amino acid . Statistical analyses of these amino acid mutations are the most tractable, because their properties and tendencies can be predicted based on the long-term evolutionary history of their locations by comparative genomics [3–9].
Evolutionary analyses of disease-associated mutations (DAMs) have revealed a number of trends. They are over-abundant at positions that have remained unchanged in species that diverged hundreds of millions of years ago [4–6, 8, 9], and there is a general under-abundance of DAMs in positions that show any potential to change [3, 5]. DAMs are more radical in terms of the differences in their biochemical properties from the normal amino acids, as compared to the differences observed between species [3, 5, 10]. Furthermore, only a very small fraction of known DAMs are identical to the inter-specific substitutions at the same positions [5, 11–13]. In addition, the evolutionary history of amino acid positions and the long-term substitution patterns observed in the proteins have been employed with varying degrees of success in predicting the disease propensity of mutations [3, 5, 7, 9, 14–16].
However, many of the patterns mentioned above have been elucidated from the analysis of a limited number of proteins or mutations. With the recent expansion of genome and population variation data, it is now possible to establish molecular evolutionary anatomies of DAMs at a genome-scale and to use different measures of the intensity of natural selection at amino acid positions over the long term history of proteins. Although the significance of the biochemical severity of amino acid changes and their association with human diseases is well-appreciated, the possible relationship between the extent of biochemical dissimilarity of DAMs and the severity of human diseases (in terms of the time of onset of diseases) needs to be further explored. Similarly, the pattern of occurrence of non-synonymous polymorphisms (nSNPs) at sites evolving with vastly different intensities of natural selection are yet to be contrasted with those seen for DAMs.
Therefore, we undertook a genomic-scale analysis to elucidate the global evolutionary trends of rare Mendelian DAMs and nSNPs present in human proteins. We specifically examined the following questions: (1) What is the relative distribution of DAMs and nSNPs at positions that evolve with different rates? (2) Does the degree of retention of amino acid positions in non-human vertebrates show a relationship with the frequency of occurrence of DAMs and nSNPs? (3) How are the allele and genotype frequencies of nSNPs modulated by evolutionary variability of amino acid positions? (4) What is the relationship between the severity of inter-specific amino acid substitutions and the level of evolutionary conservation of positions harboring them? (5) What is the relationship between the biochemical severity of DAMs and the timing of the onset of different diseases? (6) To what extent does the evolutionary rate of a position explain the observed overlap between the inter-specific substitutions and DAMs? In order to answer these questions, we compared and contrasted available information on disease-associated amino acid mutation data, human population polymorphism data, and inter-specific amino acid difference data.
The evolutionary conservation of an amino acid position in a protein was measured in two ways. First, we estimated the rate at which amino acid substitutions have occurred at each position (Rate index). Secondly, we assessed the existence of a position in homologous proteins in species distantly and closely related to humans (Indel index). These two indices were estimated for human proteins where at least one DAM or nSNP has been reported in the public databanks (see Methods).
Frequency of disease-associated and non-synonymous (nSNPs) mutations in different evolutionary rate categories
Rate index (estimated rate range)
Disease-associated genes (523)
Other genes (2264)
No. of positions
No. of positions
Frequency of disease-associated and non-synonymous (nSNP) mutations in different Indel index categories
Disease-associated genes (541)
Other genes (2592)
No. of positions
No. of positions
Opposite patterns of occurrence of DAMs and nSNPs
Earlier onset diseases associate with more radical amino acid mutations
The biochemical severity of disease-associated mutations, differences among species, and nonsynonymous mutations
Number of observations
Differences among species
The average biochemical severity of DAMs was also analyzed in the context of the evolutionary rate of the positions, which do not show a significant monotonic trend (P = 0.21). In contrast, a positive relationship was observed between the evolutionary rate and the biochemical distances for inter-specific differences (Figure 3B), indicating that radical changes in highly variable positions are more tolerated than in positions with low evolutionary variability. This happens because more dissimilar amino acid changes will experience a higher intensity of purifying selection than the changes that involve highly similar amino acids.
We have observed opposite patterns of the distribution of disease-associated and non-synonymous variation in amino acid positions with different evolutionary rates, as well as indel propensities. These patterns are consistent with the predictions of the neutral theory of molecular evolution, because the purifying selection will eliminate mutations from functionally important positions more effectively. Both the rate index and indel index produce similar trends, because the natural selection will maintain the amino acid type and will retain the amino acid position among species.
It is important to note that we have considered different types of amino acids that are associated with disease mutations at different amino acid positions, and we have not considered the population frequency of DAMs. This is because allele frequencies for a vast majority of DAMs are either very small or are not known with great precision . In order to make a direct comparison between DAMs and nSNPs, we repeated our analyses by using only lower-frequency HapMap nSNPs (allele frequency < 0.10), which confirmed the patterns reported in Figure 1B (P < 0.01).
Conversely, we looked for DAMs that occur in appreciable frequencies (> 10-6; ) and found them to be largely associated with late-onset diseases (post puberty). These mutations are not over-abundant at evolutionarily conserved positions (P = 0.7; 430 mutations), and they are biochemically less radical (Figure 3A). They are often associated with common diseases such as the hypertension, diabetes, and osteoporosis. The late onset of these diseases will result in a small affect on fecundity, which may explain why the positions harboring these DAMs do not have evolutionary imprints similar to those observed for other DAMs.
The occurrence of homozygotes of minor alleles is also expected to correlate positively with evolutionary rates, because of the low minor allele frequencies and the heterozygous buffering effect when the minor alleles are deleterious. Therefore, we estimated the fraction of nSNPs for which the homozygous recessive genotypes occur with a non-zero frequency in the human populations examined. This proportion is the smallest for the highly conserved sites, and the highest for the most variable positions (Figure 4B).
The overlap between the DAMs and inter-specific differences is often considered to be caused by the compensatory mutations, where the negative effects of the mutation(s) at one site of the same or different proteins compensates for the negative effects of the other mutation [11, 13, 21]. It is clear that such mutations need to escape natural selection for a period of time before the compensatory mutations can occur. This may only be possible for mutations that have very small negative fitness effects, in general. In terms of evolutionary rate, the overabundance of overlapping DAMs in fast-evolving positions is consistent with this expectation. The biochemical difference between the overlapping DAMs and the reference human amino acid is also consistent with this requirement, because the biochemical distance of overlapping DAMs is 38% lower than that observed for all other DAMs. In fact, overlapping DAMs are 14% more conservative than even the inter-specific variation (see also, [21, 22]). Even though some of the overall patterns mentioned above are consistent with the compensatory mutation hypothesis, this is by no means the only or the primary explanation, because it is unclear what fraction of overlapping DAMs can be explained by compensatory mutation hypothesis. In the future, there will be a need to develop statistical approaches to examine contrasting hypothesis concerning the existence of overlapping DAMs, including the change in function of the protein or the position where the overlapping DAMs are seen.
Our results emphasize the importance of the long-term evolutionary history of the amino acid positions and their influence in modulating the short-term history of the DAMs and nSNPs. Although our studies are restricted only to the protein-coding regions, the patterns reported here will hold true for DAMs and SNPs present in non-coding DNA containing conserved regulatory regions. Future studies using more species to examine the evolutionary conservation of amino acid positions could further improve the understanding of the mutations associated with human diseases and population variations. In particular, use of only the species that are closely related to humans, such as mammals, or, more specifically, primates, will prove to be more useful due to the similarity in their physiology and metabolism.
Protein sequences of human (Homo sapiens) were obtained from GenBank build 34.1 , and mouse (Mus musculus), chicken (Gallus gallus) and fugu (Takifugu rubripes) were obtained from ensemble . For each human gene, the putative orthologous gene, or closest sequence homolog, in the other three vertebrates were identified using a local BLASTP search with BLOSUM62 substitution matrix . The threshold score (bit score S in BLASTP program) was set according to protein length (L): S = 150 for L ≥ 170 amino acids, S = L-20 for 55 <L < 170 and S = 35 for L < 55 amino acids . We used the reciprocal BLASTP search in which pairs of genes were considered orthologous only if they were mutually the best matches in their respective counterpart genomes . We included only the human genes for which an orthologous counterpart was available in all the three vertebrate species. Each orthologous gene set was aligned with CLUSTAL-W using default settings . Only the genes for which either a disease mutation or nSNP data was available were included for further analysis. We concatenated all the disease-associated genes and all the other genes (for which SNP data was available) separately, and all the sites containing any missing data or indels were excluded from the two concatenated alignments. The rate of evolution of individual sites was estimated by Maximum Likelihood analysis using PAML  with the JTT model of evolution. We used a discrete gamma model (with eight categories) to describe the distribution of evolutionary rates among sites. The complete amino acid alignments, including indels, were used for computing the indel index.
Disease mutation and SNP data
We obtained 20,309 disease-associated mutations in 1,307 human genes from the HGMD database , and 29,856 non-synonymous SNPs in 11,753 known human genes were obtained from the HapMap project, Perlegen (March, 06), HGVBase (Human Genome Variation Database 17) and TSC (The SNP Consortium, 1). The HGVBase and TSC data were obtained through BioMart . We excluded 1,215 mutations that were present in DAM as well as the SNP data, because public SNP resources may contain disease mutations even when "healthy" individuals are screened, especially when we consider DAMs for common, late-onset diseases. (We plan to conduct an analysis of these mutations in the future.) We have included only the disease mutation or nSNPs for which the wild-type amino acid and the amino acid in the human reference sequence (build 34) were the same. We were able to map 8,627 DAMs (in 541 genes) and 5,308 nSNPs (in 2,592 genes) to the concatenated alignments of the orthologous sequences using the positional information obtained from the respective databases. The allele and genotype frequencies (used in Figure 4) were available only for the HapMap data. The allele and genotype frequencies of synonymous and non-synonymous SNPs are available for four different ethnic populations (Utah Caucasian-Americans, Yoruba Africans, Hans Chinese, and Tokyo Japanese), and we took the average frequencies of the four populations. The HapMap SNPs with no variation in all the four populations (which was determined from the allele frequencies) were excluded in the analysis.
We thank Drs. Alan Filipski and Christine Kuslich for providing insightful comments on a preliminary version of this manuscript, and three anonymous reviewers for their very useful constructive comments. We thank Ms. Kristi Garboushian for editorial support. This work was supported by a grant from National Institutes of Health to SK.
- Pauling L, Itano HA, et al: Sickle cell anemia a molecular disease. Science. 1949, 110 (2865): 543-548. 10.1126/science.110.2865.543.PubMedView ArticleGoogle Scholar
- Cooper DN, Ball EV, Krawczak M: The human gene mutation database. Nucleic Acids Res. 1998, 26 (1): 285-287. 10.1093/nar/26.1.285.PubMedPubMed CentralView ArticleGoogle Scholar
- Miller MP, Kumar S: Understanding human disease mutations through the use of interspecific genetic variation. Hum Mol Genet. 2001, 10 (21): 2319-2328. 10.1093/hmg/10.21.2319.PubMedView ArticleGoogle Scholar
- Miller MP, Parker JD, Rissing SW, Kumar S: Quantifying the intragenic distribution of human disease mutations. Ann Hum Genet. 2003, 67 (Pt 6): 567-579. 10.1046/j.1529-8817.2003.00072.x.PubMedView ArticleGoogle Scholar
- Briscoe AD, Gaur C, Kumar S: The spectrum of human rhodopsin disease mutations through the lens of interspecific variation. Gene. 2004, 332: 107-118. 10.1016/j.gene.2004.02.037.PubMedView ArticleGoogle Scholar
- Mooney SD, Klein TE: The functional importance of disease-associated mutation. BMC Bioinformatics. 2002, 3: 24-10.1186/1471-2105-3-24.PubMedPubMed CentralView ArticleGoogle Scholar
- Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res. 2001, 11 (5): 863-874. 10.1101/gr.176601.PubMedPubMed CentralView ArticleGoogle Scholar
- Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31 (13): 3812-3814. 10.1093/nar/gkg509.PubMedPubMed CentralView ArticleGoogle Scholar
- Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov AS, Bork P: Prediction of deleterious human alleles. Hum Mol Genet. 2001, 10 (6): 591-597. 10.1093/hmg/10.6.591.PubMedView ArticleGoogle Scholar
- Tang H, Wyckoff GJ, Lu J, Wu CI: A universal evolutionary index for amino acid changes. Mol Biol Evol. 2004, 21 (8): 1548-1556. 10.1093/molbev/msh158.PubMedView ArticleGoogle Scholar
- Kondrashov AS, Sunyaev S, Kondrashov FA: Dobzhansky-Muller incompatibilities in protein evolution. Proc Natl Acad Sci U S A. 2002, 99 (23): 14878-14883. 10.1073/pnas.232565499.PubMedPubMed CentralView ArticleGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420 (6915): 520-562. 10.1038/nature01262.PubMedView ArticleGoogle Scholar
- Gao L, Zhang J: Why are some human disease-associated mutations fixed in mice?. Trends Genet. 2003, 19 (12): 678-681. 10.1016/j.tig.2003.10.002.PubMedView ArticleGoogle Scholar
- Capriotti E, Calabrese R, Casadio R: Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006, 22 (22): 2729-2734. 10.1093/bioinformatics/btl423.PubMedView ArticleGoogle Scholar
- Sunyaev S, Hanke J, Aydin A, Wirkner U, Zastrow I, Reich J, Bork P: Prediction of nonsynonymous single nucleotide polymorphisms in human disease-associated genes. J Mol Med. 1999, 77 (11): 754-760. 10.1007/s001099900059.PubMedView ArticleGoogle Scholar
- Kondrashov FA, Ogurtsov AY, Kondrashov AS: Bioinformatical assay of human gene morbidity. Nucleic Acids Res. 2004, 32 (5): 1731-1737. 10.1093/nar/gkh330.PubMedPubMed CentralView ArticleGoogle Scholar
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13 (5): 555-556.PubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8 (3): 275-282.PubMedGoogle Scholar
- Grantham R: Amino acid difference formula to help explain protein evolution. Science. 1974, 185 (4154): 862-864. 10.1126/science.185.4154.862.PubMedView ArticleGoogle Scholar
- Jimenez-Sanchez G, Childs B, Valle D: Human disease genes. Nature. 2001, 409 (6822): 853-855. 10.1038/35057050.PubMedView ArticleGoogle Scholar
- Kulathinal RJ, Bettencourt BR, Hartl DL: Compensated deleterious mutations in insect genomes. Science. 2004, 306 (5701): 1553-1554. 10.1126/science.1100522.PubMedView ArticleGoogle Scholar
- Ferrer-Costa C, Orozco M, Cruz XD: Characterization of Compensated Mutations in Terms of Structural and Physico-Chemical Properties. J Mol Biol. 2006Google Scholar
- GenBank (Build 34.1). [ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ARCHIVE/BUILD.34.1/]
- ENSEMBLE. [http://www.ensembl.org]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1006/jmbi.1990.9999.PubMedView ArticleGoogle Scholar
- Subramanian S, Kumar S: Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics. 2004, 168 (1): 373-381. 10.1534/genetics.104.028944.PubMedPubMed CentralView ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680.PubMedPubMed CentralView ArticleGoogle Scholar
- HGMD. [http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html]