Volume 14 Supplement 3
A protein domain-centric approach for the comparative analysis of human and yeast phenotypically relevant mutations
© Peterson et al.; licensee BioMed Central Ltd. 2013
Published: 28 May 2013
The body of disease mutations with known phenotypic relevance continues to increase and is expected to do so even faster with the advent of new experimental techniques such as whole-genome sequencing coupled with disease association studies. However, genomic association studies are limited by the molecular complexity of the phenotype being studied and the population size needed to have adequate statistical power. One way to circumvent this problem, which is critical for the study of rare diseases, is to study the molecular patterns emerging from functional studies of existing disease mutations. Current gene-centric analyses to study mutations in coding regions are limited by their inability to account for the functional modularity of the protein. Previous studies of the functional patterns of known human disease mutations have shown a significant tendency to cluster at protein domain positions, namely position-based domain hotspots of disease mutations. However, the limited number of known disease mutations remains the main factor hindering the advancement of mutation studies at a functional level. In this paper, we address this problem by incorporating mutations known to be disruptive of phenotypes in other species. Focusing on two evolutionarily distant organisms, human and yeast, we describe the first inter-species analysis of mutations of phenotypic relevance at the protein domain level.
The results of this analysis reveal that phenotypic mutations from yeast cluster at specific positions on protein domains, a characteristic previously revealed to be displayed by human disease mutations. We found over one hundred domain hotspots in yeast with approximately 50% in the exact same domain position as known human disease mutations.
We describe an analysis using protein domains as a framework for transferring functional information by studying domain hotspots in human and yeast and relating phenotypic changes in yeast to diseases in human. This first-of-a-kind study of phenotypically relevant yeast mutations in relation to human disease mutations demonstrates the utility of a multi-species analysis for advancing the understanding of the relationship between genetic mutations and phenotypic changes at the organismal level.
The study of human genomic variations, in particular those in protein coding regions, can lead to new hypotheses about the molecular mechanisms of human diseases and might provide critical knowledge about individual response to therapy [1, 2]. The advent of large-scale experimental techniques is providing new phenotypic associations for genomic variations [3–5]. However, genomic association studies are limited by the molecular complexity of the phenotype being studied and the cohort size needed to have adequate statistical power. One way to circumvent this problem, which is critical in the study of rare diseases, is to investigate the molecular patterns emerging from functional studies of existing disease mutations. In current large association studies, such as GWAS or upcoming whole-exome and whole-genome sequencing, this is accomplished by aggregating mutations that disrupt the same gene [6, 7], pathway , or network . In many cases, these molecular variations associated with human diseases have patterns that are similar to those producing a phenotypic change in other species. For example, the comparison between close species has made significant contributions to the biomedical field, such as the use of mice  and rats  for genetics and drug discovery. In addition, studies across species with longer evolutionary distances to human have many advantages and could bring new perspectives into the study of molecular mechanisms of human phenotypic variations. For instance, the functional analysis of variations in yeast, an organism that can be easily genetically manipulated, has shed light on variations in their human gene orthologs, as shown in McGary et al. . The authors demonstrated the potential of a systematic study of phenotypes produced by variations in human and their orthologs in yeast or other distantly related species, providing novel hypotheses about human diseases, which have already resulted in valuable leads for drug discovery.
The vast majority of studies related to human disease mutations are performed by comparison of whole proteins, which here will be denoted by the genes that encode them. However, these whole-protein approaches are of limited applicability to the study of disease mutations due to the fact that they mostly fail to account for protein modularity. Most proteins contain multiple domains that can be recombined in different arrangements to create proteins with different functions [13–15]. As a consequence, not all protein regions have the same function or produce similar phenotypic changes if disrupted. Thus, the specific location of a particular mutation within the protein could be crucial to understanding the mutation's functional effect. The relevance of studying protein domains in the context of disease was also discussed by Zhong et al.  in their study of protein interactions and their relation to diseases. The authors showed that mutations resulting in complete loss of the protein product (removal of a node in the network) could be different from those disrupting only a protein region or domain (edgetic perturbations). Furthermore, Zhong et al. conclude that these edgetic perturbations can cause clinically distinct phenotypes when disrupting different protein domain regions of the same protein. Thus, a domain-centric study of disease mutations has the potential to differentiate among genomic variations by accounting for protein modularity that would have otherwise been grouped together by whole-protein studies.
To capture the disruption of domains by genetic mutations, we have previously created a database to visualize the aggregation patterns of disease mutations at the protein and domain levels for human genomics data (Domain Mapping of Disease Mutations database (DMDM), freely available at http://bioinf.umbc.edu/dmdm/) . More recently, we have developed a statistical approach, the domain significance score (or DS-Score), for finding significantly mutated positions for individual protein domains . We demonstrated that significant DS-Scores indicate that a mutation at a specific position is highly likely to be a contributor to disease in any protein containing the domain in which the mutations are located. In particular, we have shown that Mendelian disease mutations form clusters at protein domain sites . In addition, results from Yue et al. , Nehrt et al. , and Peterson et al.  have further shown that inherited and somatic cancer mutations cluster at specific sites at the protein domain level. Thus, these studies show how the domain analysis enables the discovery of domain hotspots of mutations with phenotypic relevance by aggregating mutations that share the same domain location but are localized in different genes. However, the discovery of these highly deleterious domain sites by aggregation of mutational data with known phenotypic effect is limited by the availability of such mutational data. As a result, the DS-Score method based on human data has low coverage when analyzing mutations from large-scale sequencing studies. To address this issue, more annotated disease mutation data will need to be incorporated into the analysis, preferably from other species in which the phenotypic effect of putative deleterious mutations could be experimentally tested.
In this paper, we describe the first inter-species analysis of mutations of phenotypic relevance at the protein domain level for human and yeast genomes. We perform the comparison between these species by mapping human and yeast mutations into the corresponding domain sites. Protein domains, such as those defined by CDD  and Pfam , are protein sequence regions that are highly conserved across distantly related species. For instance, when comparing yeast and human domains, we estimate that 87% of all the protein domains found in yeast are also found in human while, using the Homologene database to compare genes, only 20% of the yeast genes have a human ortholog . Similarly, 58% of the human domains are shared with yeast while only 5% of the human genes have yeast orthologs. Since yeast and human analyses show a significant number of common domains, the protein domain framework facilitates the comparison of a significant number of mutations producing phenotypic changes in both species. Using a domain-centric approach, we show that phenotypically relevant mutations in yeast form hotspots at the protein domain level, and that a significant number of these hotspots map to known human disease mutations. Furthermore, our results show that the feature-based DS-Score, a modification of our statistical method that explicitly incorporates annotation from the protein domain models, was most successful at capturing functional commonalities between human and yeast mutations affecting these organisms' phenotypes.
In summary, the work described in this paper demonstrates that domain-centric, inter-species mutation analyses lead to the identification of new domain sites of relevance to human diseases even when performed among species separated by long evolutionary distances. The patterns of evolutionarily conserved and functional mutations associated with phenotypic changes emerging from this study represent a step towards a new paradigm for the analysis of large-scale genomic studies of human diseases.
Materials and methods
A human protein database containing 54,372 proteins was created with 33,963 proteins from RefSeq  and 20,409 proteins from Swiss-Prot  downloaded via NCBI's E-utilities . Since the RefSeq and Swiss-Prot databases contain many redundant protein entries, we selected only one representative protein for each unique Entrez gene ID, either the longest Swiss-Prot protein, or the longest RefSeq protein if no Swiss-Prot protein was listed for the gene ID. A database of 6,717 verified and hypothetical open reading frame yeast reference proteins was downloaded from the Saccharomyces Genome Database (SGD)  on September 28th, 2012 (http://downloads.yeastgenome.org/sequence/S288C_reference/orf_protein/). The Homologene database  was downloaded from NCBI's FTP site on September 12th, 2011. A protein domain set was obtained from the Conserved Domain Database (CDD version 2.25) , which includes domains from CDD and the SMART , COG , and Pfam  databases, with a total of 23,632 protein domains, 10,925 of which map to at least one human protein, and 7,369 map to at least one yeast protein. Functional feature information was collected for CDD domains from the "cddannot.dat" file located in the CDD FTP directory (ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd), totaling 1,727 unique functional features. The non-overlapping set of human, non-synonymous disease mutations was created from the OMIM  and Swiss-Prot variant databases obtained from E-utilities and UniProt's FTP directory (http://www.uniprot.org/docs/humsavar) respectively. The resulting human mutation dataset consists of 32,653 mutations related to human diseases. The set of phenotypic yeast mutations was downloaded from SGD (the phenotype_data.tab database obtained from http://www.yeastgenome.org/download-data/curation/) and was filtered to exclude records without allelic information and records listing the phenotypic change as "normal," as these records refer to mutations with no phenotypic change. Finally, the yeast mutation database was manually curated to extract single point mutations and to ensure that each mutation record referred to a single point mutation. Mutation records referring to multiple mutations for a single phenotype were separated into multiple records. The final yeast mutational database is comprised of 1,490 unique mutations associated with phenotypic changes and is available upon request.
Mapping mutations to protein domains
Hidden Markov models for protein domains from SMART, COG, CDD, and Pfam were built using multiple sequence alignments from CDD with the hmmerbuild tool (HMMer version 2.3.2) . HMMer's hmmpfam tool was then used with the global option to search for complete domains in human proteins from the RefSeq and Swiss-Prot databases. Protein mutations were distributed to protein domain positions by using HMMer's alignment output for all domains aligning to the protein with an E-Value ≤ 0.001 and by assigning mutations on gap regions of the domain model to the last position before the gap. Each mutation was mapped only to the representative protein for each unique gene in the dataset. The methods for mapping domains to human proteins and disease mutations to their domain positions were previously described for our DMDM tool . After distributing each mutation to all domains that map to the protein position in which the mutation was located, 4,283 human protein domains contained at least one disease mutation and 1,687 yeast protein domains contained at least one phenotypic yeast mutation.
Determining the level of conservation at each domain position
where p(a i,j ) is the frequency of amino acid ai at position j. We then estimated a threshold for identifying highly conserved positions by adding one standard deviation to the average of all AL2CO scores on all domain positions. As a result, a threshold of entropy less than or equal to 0.533 was used to determine conservation. The average entropy of each domain model was determined by estimating the mean of the entropy scores for all positions in the domain model.
Estimating domain significance scores (DS-Scores)
Mutations and protein domain redundancy
Due to large domain superfamily hierarchies and duplicate domains from different sources, a single protein mutation can be mapped to many redundant domains. As a consequence, multiple domain hotspots can be identified on different domains that have originated from the same cluster of mutations. On one hand, to ensure that the DS-Scores and domain hotspots are estimated and identified for all protein domain models, all mutations in our analysis were distributed to all domains that map to the mutated protein position. The results of our analysis for each individual domain model, even for those domains from superfamily hierarchies and disparate sources for which we expect high redundancy, are available upon request. On the other hand, to prevent overestimating the number of hotspots shared between yeast and human, we designed a procedure for the non-redundification of domain hotspots. As a result, the numbers reported for all domain hotspots and multi-species domain hotspots originated from a unique set of mutation clusters. We excluded hotspots that originated from identical sets of mutations using the following method to select a unique representative domain for cluster of mutations. The representative domain is only used for visualization and internal calculations and does not affect the reported results. To select the representative domain for a mutation cluster, all domains were ordered alphanumerically from lowest to highest accession identifier. Preference is given to domains that are listed first within the list and were defined as root in the domain hierarchy. If none of the domains in the list is the root domain of a hierarchy, the representative domain is the first domain in the list that contain known functional annotated sites. In addition, when comparing hotspots using the multi-species DS-Score, the representative domain model is selected only among those that are shared among the species.
Assessing the co-occurrence of human diseases and yeast phenotypes
The significance of overlapping human diseases and yeast phenotypes was calculated using a right-sided Fisher's exact test. The Fisher's test for each possible pair of human diseases and yeast phenotypic changes was estimated using the following values: the number of times the human disease and yeast phenotypic change (H and Y respectively) overlap, the number of times H overlaps with a yeast phenotype that is not Y, the number of times Y overlaps with a human disease that is not H, and the total number of overlaps between yeast and human. Human diseases and yeast phenotypes were considered to overlap if the associated mutations were found to localize at the same position of an identical domain. To avoid overestimation due to domain model redundancy, no protein mutation was counted more than once as overlapping with a single human disease or yeast phenotype.
Distribution of mutations in protein domains
Distribution of yeast and human mutations at functional and conserved sites
Total phenotypically relevant protein mutations
Total phenotypically relevant protein mutations inside of domain regions
Total phenotypically relevant domain mutations
Domain positions with at least one mutation
Domain mutations at functional feature domain sites
3,992 (36%, p-value: ≈ 0)
58,096 (18%, p-value: ≈ 0)
Domain mutations at conserved domain sites
5,950 (45%, p-value: ≈ 0)
152,524 (47%,p-value: ≈ 0)
Transferring mutational information across species through protein domains
Yeast and human mutations form hotspots at protein domain positions
Using a variation of the DS-Score in which we emphasize the similarity of domain sites with the same functional annotation as depicted in Figure 2, we studied the feature-based domain hotspots in yeast and human. Our results show that yeast mutations on domains formed 791, 869, and 1,022 feature-based domain hotspots when using DS-Score thresholds greater than or equal to 1.6, 1.3, and 1.0 respectively, while human mutations yielded 3,197, 3,446, and 3,968 feature-based domain hotspots for these thresholds.
Distribution of phenotypic and disease mutations at conserved and functional domain sites
Distribution of yeast and human domain hotspots at functional features and conserved sites
Position-based domain hotspots at conserved sites
Feature-based domain hotspots at conserved sites
Position-based domain hotspots at functional features
Feature-based domain hotspots at functional features
Phenotypically relevant mutations tend to cluster at domain positions in yeast and human
Shared and unique domain hotspots between the yeast and human datasets.
Domain Hotspot Count (1.6)
Domain Hotspot count (1.3)
Domain Hotspot Count (1.0)
Position-based domain hotspots only found in yeast
Position-based domain hotspots only found in human
Feature-based domain hotspots only found in yeast
Feature-based domain hotspots only found in human
Position-based domain hotspots shared between yeast and human
Feature-based domain hotspots shared between yeast and human
Multi-species domain hotspots
Domain Hotspot Count (1.6)
Domain Hotspot count (1.3)
Domain Hotspot Count (1.0)
Total multi-species position-based domain hotspots
Total multi-species feature-based domain hotspots
Multi-species position-based domain hotspots not identified in yeast or human
Multi-species feature-based domain hotspots not identified in yeast or human
Linking domain hotspots with mutations across organisms
Mapping of domain hotspots from yeast to known disease mutations in human.
Domain Hotspot Count (1.6)
Domain Hotspot count (1.3)
Domain Hotspot Count (1.0)
Position-based domain hotspots in yeast
Feature-based domain hotspots in yeast
Position-based domain hotspots in yeast that hit at least one human mutation
54 (53.5%, p-value: 2e-26)
56 (49.1%, p-value: 6e-25)
65 (48.8%, p-value: 5e-28)
Feature-based domain hotspots in yeast that hit at least one human mutation
562 (71.1%, p-value ≈ 0)
592 (68.1%, p-value ≈ 0)
666 (65.2%, p-value ≈ 0)
Our findings highlight the advantages of using protein domains to transfer information related to genetic mutations across species. We show that protein domain models provide a powerful framework for aggregating known phenotypically relevant mutation data across large evolutionary distances, i.e., from human and yeast. As a model organism, yeast is highly studied, well annotated, and easy to manipulate genetically. Thus, it is advantageous to transfer known information from genetic disruptions in yeast for analyzing human mutations. To infer relationships between mutations in different organisms, most studies use orthologous genes as reference to analyze mutations . However, our analysis shows that yeast and human data share more common protein domains than they do orthologous genes. As a result, we show that mutations in both the yeast and human databases are better mapped across organisms when using shared protein domains than when using orthologous genes. For instance, we found that of the 40% of the human mutations that can be related to yeast, only 9% are through gene orthology while 39% can be related using a protein domain framework with an overlap of 5% of mutations that can be related by either domain or gene comparisons. This suggests that transferring mutational information by common protein domains not only vastly increases the number of mutations that can be transferred but also loses very few mutations that would have otherwise been transferred using only gene orthologs. The latter corresponds to, for instance, human disease mutations in genes for which there is a yeast ortholog but located outside a protein domain (only 1% of the human disease mutations in our analysis). Additionally, the domain approach allows the aggregation of mutations from multiple genes in each organism and the identification of relations between mutations located in non-orthologous genes by their functional annotation, which would normally be missed when analyzing the problem using a gene-centric approach.
Our study of phenotypically relevant mutations using a protein domain framework confirms that both yeast and human mutations show a significant tendency to fall within conserved and annotated functional protein domain sites. This is in agreement with the conclusions by Miller et al.. In their study, the authors analyzed human disease mutations on seven disease-associated genes, cystic fibrosis transmembrane conductance regulator (CFTR), glucose-6-phosphate dehydrogenase (G6PD), neural cell adhesion molecule L1 (L1CAM), phenylalanine hydroxylase (PAH), paired box 6 (PAX6), the X-linked retinoschisis gene, and a gene associated with tuberous sclerosis (TSC2). From the study of mutations in these seven genes and their conservation across 20 organisms, including human, the authors concluded that these mutations are in highly conserved protein positions. Here, we reach similar conclusions, but we estimated conservation based on the protein domain models and not at the gene level. Additionally, our findings at the domain level are consistent with Mooney et al. . The authors conducted a study on a set of 231 human genes with known disease mutations, showing that human disease mutations are statistically more likely to be localized within conserved or functionally relevant positions. To summarize, our domain-centric analysis confirms findings from gene-centric studies about enrichment of human disease mutations with respect to conserved and functionally annotated sites while identifying the same characteristics for phenotypically relevant mutations in yeast.
To analyze and compare yeast and human mutations we used the DS-Score method  and identified domain hotspots of human and yeast phenotypically relevant mutations. The DS-Score method was previously developed by our team to study human disease mutations and modified in this work to include mutations from both species resulting in the identification of multi-species domain hotspots. We also adapted the method for a multi-species analysis by removing redundant domain hotspots. As an extreme example of the effect of domain redundancy, a single cluster of mutations in the yeast IRE1 gene was propagated to over a hundred domains within the catalytic protein kinase domain family (cl09925 from the CDD database ), resulting in 120 domain hotspots having originated from the same cluster of domains. Similarly, domains from multiple sources (such as an identical domain from CDD and Pfam databases) could yield redundant domain hotspots counts. These redundant domain hotspots are correctly estimated and are of great relevance for the analysis of mutations in the context of individual domains. However, when comparing two species using redundant domain hotspots, if the cluster of mutations in the kinase family happens to be common to both species, we would reach the conclusion that there were 120 additional hotspots in common between yeast and human. To avoid overestimation of clusters of mutations that are aggregated at domain level, we defined domain hotspots as those having originated from a unique cluster of mutations and applied this method to the comparison of position-based and feature-based domain hotspots in both species. Using the catalytic protein kinase domain family as an example, each of the 120 domains in which the hotspot was found will retain this information, but only one representative hotspot, cd00180, which is at the top of the hierarchy in that kinase family from CDD , was considered for the final domain hotspot count for each species.
In this first-of-a-kind study of yeast mutations at the domain level, we demonstrate that phenotypically relevant mutations in yeast cluster at the domain level just as human disease mutations do, forming yeast and human domain hotspots that are the focus of this study. The hotspots in yeast present the same patterns as human domain hotspots in terms of enrichment at protein domain sites that are conserved and also in sites with known functional annotation. Neither the yeast nor human DS-Scores were found to correlate with conservation (as measured by entropy of the domain site), making the DS-Score method a complement to other methods for prioritization of mutations with putative phenotypic relevance such as SIFT , that use conservation as principal feature for their predictions.
Selected functional feature sites containing domain hotspots in yeast and human
Functional Feature Name
Position-based Domain Hotspots in Yeast
Position-based Domain Hotspots inHuman
ABC transporter signature motif
Activation loop (A-loop)
ATP binding site
Ca2+ binding site
GTP/Mg2+ binding site
Substrate binding site
An example of a domain with several domain hotspots in human and yeast that highlights the advantages of using feature-based domain hotspots is the Ras-like GTPase domain (cd00882 from the CDD  database). While no position-based domain hotspots in the Ras-like GTPase domain from yeast and human are located at the same domain position, two hotspots are located at domain sites with the same functional annotation. The GTP/Mg++ binding site (highlighted in orange in the structure of the domain shown in Figure 2) contains position-based domain hotspots at position ten for the yeast mutations and at position five for the human mutations. The yeast hotspot in position ten is an example of mutations related at the domain level, originating from several genes, ARF2, SEC4, GTR1, and NOG1 that may not have been identified without analyzing mutations at the domain level due to the low sequence similarity of NOG1 (i.e. when using BLAST with E-Value < = 10−3). The yeast mutations in position ten of cd00882 in genes ARF2, SEC4, GTR1, and NOG1 were associated with increased sensitivity to cold , decreased rates of cytokinesis , decreased nutrient uptake , and inviability due to malformation of the large ribosomal subunit , respectively. In human, this domain position on the ARL6 gene contains a mutation associated with Bardet-Biedl syndrome type 3  but not a domain hotspot. On the other hand, in a different position of the binding site, position five of the GTP/Mg++ binding region of the Ras-like GTPase domain contains a human hotspot that aggregates several positions in human genes that have been heavily studied due to their prominence in human diseases. This domain position corresponds to position 12 of both the human HRAS and KRAS genes, from which many mutations have been implicated in diseases such as Costello syndrome [43–46] and Congenital myopathy [47, 48] and have also been found to be mutated frequently in somatic tumor samples from patients with follicular thyroid carcinoma , pancreatic carcinoma , and Schimmelpenning-Feuerstein-Mims syndrome , as well as bladder , lung , and gastric cancers . While both HRAS and KRAS belong to the same protein family and are thus often implicated in the same studies, domain position five also aligns to position 38 of a gene from a different family, GNAT1, which is not similar in sequence to HRAS (i.e., HRAS-GNAST1 E-value of 0.53 using BLAST ) or KRAS (i.e., KRAS-GNAST1 BLAST E-value of 0.42). The GNAT1 mutation has been associated with congenital stationary night blindness . Additionally, other mutations were found in the GTP/Mg++ binding pocket that were not members of position-based domain hotspots in either organism that we were able to identify using our feature-based domain hotspots. These mutations, sharing common functional annotation with position-based domain hotspots in both species, have been associated with autoimmune lymphoproliferative syndrome , somatic pilocytic astrocytoma , Noonan syndrome , and chylomiccron retention disease . Thus, by extrapolating hotspots in human and yeast to common functional feature positions, we were able to identify a common functional disruption of the GTP/Mg++ binding pocket that causes different phenotypes when mutated in different genes sharing the same domain in the same organism as well as across organisms.
Human diseases and yeast phenotypic changes that co-occur at domain sites
Yeast Phenotypic Change
Number of Co-occurrences
Wilson's disease (WD) (OMIM:277900)
Gain of function; metal resistance: increased (PMID: 10743563)
6 (p-value: 2e-14)
Hereditary non-polyposis colorectal cancer type 2 (OMIM:609310)
Mutation frequency: increased (PMID: 16492773)
5 (p-value: 1e-13)
Susceptibility to Breast-Ovarian Cancer, Familial (OMIM:604370)
Reduction of function; protein/peptide accumulation: increased (PMID: 10218484)
4 (p-value: 4e-11)
Nemaline myopathy type 3 (OMIM: 161800)
Conditional; protein/peptide modification: absent (PMID: 16221887)
4 (p-value: 4e-11)
Familial hyperinsulinemic hypoglycemia type 1 (OMIM: 256450)
Reduction of function; replicative lifespan: decreased (PMID: 21931558)
8 (p-value: 1e-10)
Costello syndrome (OMIM:190020)
6 (p-value: 1e-09)
Methemoglobinemia, type 1 (OMIM:250800)
Reduction of function; heat sensitivity: increased (PMID: 19194512)
4 (p-value: 8e-09)
Crouzon syndrome (OMIM: 123500)
Resistance to chemicals: decreased (PMID: 17237519)
6 (p-value: 8e-09)
Kallman syndrome 2 with bimanual synkinesia (OMIM: 136350)
Resistance to chemicals: increased (PMID: 1715094)
4 (p-value: 4e-08)
Friedreich Ataxia (OMIM: 229300)
Protein activity: decreased (PMID: 19884169)
3 (p-value: 1e-06)
Conclusion and future work
This first-of-a-kind study demonstrates the aggregation of mutations from species spanning large evolutionary distances such as yeast and human. Using the DS-Score method as the framework for the integration of molecular characteristics, such as domain location and functional annotation of phenotypically relevant mutations, we were able to identify common mutation patterns from two distantly related species. The domain-centric approach introduced in this paper provides an ideal framework for the analysis of mutational data across species since the number of mutations that can be related from one species to another is much higher than what could be related through gene orthology. The feature-based method to compare mutations across species introduced here represents a unique way to integrate functional annotation of domains into the statistical analysis of mutations, shown here to be extremely advantageous for capturing similarities between mutations across distantly related species. This analysis also suggests that the approach is useful in relating phenotypes from yeast and human resulting from a particular pattern of mutations, such as being localized at the same domain position or functional site. We plan to perform a detailed analysis of the molecular basis of these related phenotypes. Hypotheses derived from this analysis have great potential for discovering new relationships between pathways and networks in both species. In addition, we plan to extend this study to other species to identify patterns across more closely related species including mouse, and to increase the number of known phenotypically relevant domain hotspots by including all mutational data available for a wide range of organisms.
Authors would like to acknowledge Amy Voltz for her valuable comments and her help obtaining the yeast data of protein mutations. The National Institutes of Health (NIH) 1K22CA143148 to MGK (PI) and R01LM009722 to MGK (co-investigator) and the American Cancer Society, ACS-IRG grant to MGK (PI) supported this work.
The publication costs for this article were funded by the above grant.
This article has been published as part of BMC Genomics Volume 14 Supplement 3, 2013: SNP-SIG 2012: Identification and annotation of SNPs in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S3
- Kann MG: Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Brief Bioinform. 2010, 11 (1): 96-110. 10.1093/bib/bbp048.PubMed CentralView ArticlePubMedGoogle Scholar
- Mooney SD, Krishnan VG, Evani US: Bioinformatic tools for identifying disease gene and SNP candidates. Methods Mol Biol. 2010, 628: 307-19. 10.1007/978-1-60327-367-1_17.PubMed CentralView ArticlePubMedGoogle Scholar
- Collins FS, Barker AD: Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Sci Am. 2007, 296 (3): 50-7. 10.1038/scientificamerican0307-50.View ArticlePubMedGoogle Scholar
- The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. The Diabetes Control and Complications Trial Research Group. N Engl J Med. 1993, 329 (14): 977-86.
- The Age-Related Eye Disease Study: a clinical trial of zinc and antioxidants--Age-Related Eye Disease Study Report No. 2. J Nutr. 2000, 130 (5S Suppl): 1516S-9S.
- Cirulli ET, Goldstein DB: Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature reviews Genetics. 2010, 11 (6): 415-25. 10.1038/nrg2779.View ArticlePubMedGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. American journal of human genetics. 2008, 83 (3): 311-21. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K, Li M, Hakonarson H: Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010, 11 (12): 843-54. 10.1038/nrg2884.View ArticlePubMedGoogle Scholar
- Califano A, Butte AJ, Friend S, Ideker T, Schadt E: Leveraging models of cell regulation and GWAS data in integrative network-based association studies. Nat Genet. 2012, 44 (8): 841-7. 10.1038/ng.2355.PubMed CentralView ArticlePubMedGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420 (6915): 520-62. 10.1038/nature01262.View ArticlePubMedGoogle Scholar
- Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, Scherer S, Scott G, Steffen D, Worley KC, Burch PE et al: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 2004, 428 (6982): 493-521.View ArticlePubMedGoogle Scholar
- McGary KL, Park TJ, Woods JO, Cha HJ, Wallingford JB, Marcotte EM: Systematic discovery of nonobvious human disease models through orthologous phenotypes. Proc Natl Acad Sci USA. 2010, 107 (14): 6544-9. 10.1073/pnas.0910200107.PubMed CentralView ArticlePubMedGoogle Scholar
- Ekman D, Bjorklund AK, Frey-Skott J, Elofsson A: Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 2005, 348 (1): 231-43. 10.1016/j.jmb.2005.02.007.View ArticlePubMedGoogle Scholar
- Doolittle RF: The multiplicity of domains in proteins. Annu Rev Biochem. 1995, 64: 287-314. 10.1146/annurev.bi.64.070195.001443.View ArticlePubMedGoogle Scholar
- Bornberg-Bauer E, Beaussart F, Kummerfeld SK, Teichmann SA, Weiner J: The evolution of domain arrangements in proteins and interaction networks. Cell Mol Life Sci. 2005, 62 (4): 435-45. 10.1007/s00018-004-4416-1.View ArticlePubMedGoogle Scholar
- Zhong Q, Simonis N, Li QR, Charloteaux B, Heuze F, Klitgord N, Tam S, Yu H, Venkatesan K, Mou D et al: Edgetic perturbation models of human inherited disorders. Mol Syst Biol. 2009, 5: 321-PubMed CentralView ArticlePubMedGoogle Scholar
- Peterson TA, Adadey A, Santana-Cruz I, Sun Y, Winder A, Kann MG: DMDM: domain mapping of disease mutations. Bioinformatics. 2010, 26 (19): 2458-9. 10.1093/bioinformatics/btq447.PubMed CentralView ArticlePubMedGoogle Scholar
- Peterson TA, Nehrt NL, Park D, Kann MG: Incorporating molecular and functional context into the analysis and prioritization of human variants associated with cancer. J Am Med Inform Assoc. 2012, 19 (2): 275-83. 10.1136/amiajnl-2011-000655.PubMed CentralView ArticlePubMedGoogle Scholar
- Yue P, Forrest WF, Kaminker JS, Lohr S, Zhang Z, Cavet G: Inferring the functional effects of mutation through clusters of mutations in homologous proteins. Human mutation. 2010, 31 (3): 264-71. 10.1002/humu.21194.View ArticlePubMedGoogle Scholar
- Nehrt NL, Peterson T, Park D, Kann MG: Domain landscapes of somatic mutations in cancer. BMC Genomics. 2012, 13 (Suppl 4): S9-PubMed CentralView ArticlePubMedGoogle Scholar
- Derbyshire MK, Lanczycki CJ, Bryant SH, Marchler-Bauer A: Annotation of functional sites with the Conserved Domain Database. Database (Oxford). 2012, 2012: bar058-10.1093/database/bar058.View ArticleGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J et al: The Pfam protein families database. Nucleic Acids Res. 2012, 40 (Database): D290-301.PubMed CentralView ArticlePubMedGoogle Scholar
- Geer LY, Marchler-Bauer A, Geer RC, Han L, He J, He S, Liu C, Shi W, Bryant SH: The NCBI BioSystems database. Nucleic Acids Res. 2010, 38 (Database): D492-6. 10.1093/nar/gkp858.PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35 (Database): D61-5. 10.1093/nar/gkl842.PubMed CentralView ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1): 365-70. 10.1093/nar/gkg095.PubMed CentralView ArticlePubMedGoogle Scholar
- Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR et al: Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 2012, 40 (Database): D700-5.PubMed CentralView ArticlePubMedGoogle Scholar
- Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006, 34 (Database): D257-60.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.PubMed CentralView ArticlePubMedGoogle Scholar
- McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet. 2007, 80 (4): 588-604. 10.1086/514346.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Hidden Markov models. Curr Opin Struct Biol. 1996, 6 (3): 361-365. 10.1016/S0959-440X(96)80056-X.View ArticlePubMedGoogle Scholar
- Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics. 2001, 17 (8): 700-12. 10.1093/bioinformatics/17.8.700.View ArticlePubMedGoogle Scholar
- Yue P, Forrest WF, Kaminker JS, Lohr S, Zhang Z, Cavet G: Inferring the functional effects of mutation through clusters of mutations in homologous proteins. Hum Mutat. 2010, 31 (3): 264-71. 10.1002/humu.21194.View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Lanczycki CJ et al: CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013, 41 (D1): D348-52. 10.1093/nar/gks1243.PubMed CentralView ArticlePubMedGoogle Scholar
- Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C: OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011, 39 (Database): D289-94. 10.1093/nar/gkq1238.PubMed CentralView ArticlePubMedGoogle Scholar
- Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010, 38 (Database): D196-203. 10.1093/nar/gkp931.PubMed CentralView ArticlePubMedGoogle Scholar
- Miller MP, Kumar S: Understanding human disease mutations through the use of interspecific genetic variation. Hum Mol Genet. 2001, 10 (21): 2319-28. 10.1093/hmg/10.21.2319.View ArticlePubMedGoogle Scholar
- Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31 (13): 3812-4. 10.1093/nar/gkg509.PubMed CentralView ArticlePubMedGoogle Scholar
- Peyroche A, Paris S, Jackson CL: Nucleotide exchange on ARF mediated by yeast Gea1 protein. Nature. 1996, 384 (6608): 479-81. 10.1038/384479a0.View ArticlePubMedGoogle Scholar
- VerPlank L, Li R: Cell cycle-regulated trafficking of Chs2 controls actomyosin ring stability during cytokinesis. Mol Biol Cell. 2005, 16 (5): 2529-43. 10.1091/mbc.E04-12-1090.PubMed CentralView ArticlePubMedGoogle Scholar
- Gao M, Kaiser CA: A conserved GTPase-containing complex is required for intracellular sorting of the general amino-acid permease in yeast. Nat Cell Biol. 2006, 8 (7): 657-67. 10.1038/ncb1419.View ArticlePubMedGoogle Scholar
- Fuentes JL, Datta K, Sullivan SM, Walker A, Maddock JR: In vivo functional characterization of the Saccharomyces cerevisiae 60S biogenesis GTPase Nog1. Mol Genet Genomics. 2007, 278 (1): 105-23. 10.1007/s00438-007-0233-1.View ArticlePubMedGoogle Scholar
- Fan Y, Esmail MA, Ansley SJ, Blacque OE, Boroevich K, Ross AJ, Moore SJ, Badano JL, May-Simera H, Compton DS et al: Mutations in a member of the Ras superfamily of small GTP-binding proteins causes Bardet-Biedl syndrome. Nat Genet. 2004, 36 (9): 989-93. 10.1038/ng1414.View ArticlePubMedGoogle Scholar
- Yoshida R, Fukushima Y, Ohashi H, Asoh M, Fukuyama Y: The Costello syndrome: are nasal papillomata essential?. Jpn J Hum Genet. 1993, 38 (4): 437-44. 10.1007/BF01907992.View ArticlePubMedGoogle Scholar
- Smith LP, Podraza J, Proud VK: Polyhydramnios, fetal overgrowth, and macrocephaly: prenatal ultrasound findings of Costello syndrome. Am J Med Genet A. 2009, 149A (4): 779-84. 10.1002/ajmg.a.32778.View ArticlePubMedGoogle Scholar
- Gripp KW, Stabley DL, Nicholson L, Hoffman JD, Sol-Church K: Somatic mosaicism for an HRAS mutation causes Costello syndrome. Am J Med Genet A. 2006, 140 (20): 2163-9.View ArticlePubMedGoogle Scholar
- Sol-Church K, Stabley DL, Demmer LA, Agbulos A, Lin AE, Smoot L, Nicholson L, Gripp KW: Male-to-male transmission of Costello syndrome: G12S HRAS germline mutation inherited from a father with somatic mosaicism. Am J Med Genet A. 2009, 149A (3): 315-21. 10.1002/ajmg.a.32639.PubMed CentralView ArticlePubMedGoogle Scholar
- van der Burgt I, Kupsky W, Stassou S, Nadroo A, Barroso C, Diem A, Kratz CP, Dvorsky R, Ahmadian MR, Zenker M: Myopathy caused by HRAS germline mutations: implications for disturbed myogenic differentiation in the presence of constitutive HRas activation. J Med Genet. 2007, 44 (7): 459-62. 10.1136/jmg.2007.049270.PubMed CentralView ArticlePubMedGoogle Scholar
- Lo IF, Brewer C, Shannon N, Shorto J, Tang B, Black G, Soo MT, Ng DK, Lam ST, Kerr B: Severe neonatal manifestations of Costello syndrome. J Med Genet. 2008, 45 (3): 167-71.View ArticlePubMedGoogle Scholar
- Dajee M, Lazarov M, Zhang JY, Cai T, Green CL, Russell AJ, Marinkovich MP, Tao S, Lin Q, Kubo Y et al: NF-kappaB blockade and oncogenic Ras trigger invasive human epidermal neoplasia. Nature. 2003, 421 (6923): 639-43. 10.1038/nature01283.View ArticlePubMedGoogle Scholar
- Motojima K, Urano T, Nagata Y, Shiku H, Tsurifune T, Kanematsu T: Detection of point mutations in the Kirsten-ras oncogene provides evidence for the multicentricity of pancreatic carcinoma. Ann Surg. 1993, 217 (2): 138-43. 10.1097/00000658-199302000-00007.PubMed CentralView ArticlePubMedGoogle Scholar
- Rijntjes-Jacobs EG, Lopriore E, Steggerda SJ, Kant SG, Walther FJ: Discordance for Schimmelpenning-Feuerstein-Mims syndrome in monochorionic twins supports the concept of a postzygotic mutation. Am J Med Genet A. 2010, 152A (11): 2816-9. 10.1002/ajmg.a.33635.View ArticlePubMedGoogle Scholar
- Di Micco R, Fumagalli M, Cicalese A, Piccinin S, Gasparini P, Luise C, Schurra C, Garre M, Nuciforo PG, Bensimon A et al: Oncogene-induced senescence is a DNA damage response triggered by DNA hyper-replication. Nature. 2006, 444 (7119): 638-42. 10.1038/nature05327.View ArticlePubMedGoogle Scholar
- Santos E, Martin-Zanca D, Reddy EP, Pierotti MA, Della Porta G, Barbacid M: Malignant activation of a K-ras oncogene in lung carcinoma but not in normal tissue of the same patient. Science. 1984, 223 (4637): 661-4. 10.1126/science.6695174.View ArticlePubMedGoogle Scholar
- Haigis KM, Kendall KR, Wang Y, Cheung A, Haigis MC, Glickman JN, Niwa-Kawakita M, Sweet-Cordero A, Sebolt-Leopold J, Shannon KM et al: Differential effects of oncogenic K-Ras and N-Ras on proliferation, differentiation and tumor progression in the colon. Nat Genet. 2008, 40 (5): 600-8. 10.1038/ng.115.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Dryja TP, Hahn LB, Reboul T, Arnaud B: Missense mutation in the gene encoding the alpha subunit of rod transducin in the Nougaret form of congenital stationary night blindness. Nat Genet. 1996, 13 (3): 358-60. 10.1038/ng0796-358.View ArticlePubMedGoogle Scholar
- Oliveira JB, Bidere N, Niemela JE, Zheng L, Sakai K, Nix CP, Danner RL, Barb J, Munson PJ, Puck JM et al: NRAS mutation causes a human autoimmune lymphoproliferative syndrome. Proc Natl Acad Sci USA. 2007, 104 (21): 8953-8. 10.1073/pnas.0702975104.PubMed CentralView ArticlePubMedGoogle Scholar
- Sharma MK, Zehnbauer BA, Watson MA, Gutmann DH: RAS pathway activation and an oncogenic RAS mutation in sporadic pilocytic astrocytoma. Neurology. 2005, 65 (8): 1335-6. 10.1212/01.wnl.0000180409.78098.d7.View ArticlePubMedGoogle Scholar
- Schubbert S, Zenker M, Rowe SL, Boll S, Klein C, Bollag G, van der Burgt I, Musante L, Kalscheuer V, Wehner LE et al: Germline KRAS mutations cause Noonan syndrome. Nat Genet. 2006, 38 (3): 331-6. 10.1038/ng1748.View ArticlePubMedGoogle Scholar
- Jones B, Jones EL, Bonney SA, Patel HN, Mensenkamp AR, Eichenbaum-Voline S, Rudling M, Myrdal U, Annesi G, Naik S et al: Mutations in a Sar1 GTPase of COPII vesicles are associated with lipid absorption disorders. Nat Genet. 2003, 34 (1): 29-31. 10.1038/ng1145.View ArticlePubMedGoogle Scholar
- Shiraishi E, Inouhe M, Joho M, Tohoyama H: The cadmium-resistant gene, CAD2, which is a mutated putative copper-transporter gene (PCA1), controls the intracellular cadmium-level in the yeast S. cerevisiae. Curr Genet. 2000, 37 (2): 79-86. 10.1007/s002940050013.View ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-90. 10.1101/gr.849004.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.