In silico comparative characterization of pharmacogenomic missense variants
- Biao Li†1,
- Chet Seligman†1,
- Janita Thusberg1,
- Jackson L Miller1,
- Jim Auer1,
- Michelle Whirl-Carrillo2,
- Emidio Capriotti3,
- Teri E Klein2 and
- Sean D Mooney1Email author
© Li et al.; licensee BioMed Central Ltd. 2014
Published: 20 May 2014
Missense pharmacogenomic (PGx) variants refer to amino acid substitutions that potentially affect the pharmacokinetic (PK) or pharmacodynamic (PD) response to drug therapies. The PGx variants, as compared to disease-associated variants, have not been investigated as deeply. The ability to computationally predict future PGx variants is desirable; however, it is not clear what data sets should be used or what features are beneficial to this end. Hence we carried out a comparative characterization of PGx variants with annotated neutral and disease variants from UniProt, to test the predictive power of sequence conservation and structural information in discriminating these three groups.
126 PGx variants of high quality from PharmGKB were selected and two data sets were created: one set contained 416 variants with structural and sequence information, and, the other set contained 1,265 variants with sequence information only. In terms of sequence conservation, PGx variants are more conserved than neutral variants and much less conserved than disease variants. A weighted random forest was used to strike a more balanced classification for PGx variants. Generally structural features are helpful in discriminating PGx variant from the other two groups, but still classification of PGx from neutral polymorphisms is much less effective than between disease and neutral variants.
We found that PGx variants are much more similar to neutral variants than to disease variants in the feature space consisting of residue conservation, neighboring residue conservation, number of neighbors, and protein solvent accessibility. Such similarity poses great difficulty in the classification of PGx variants and polymorphisms.
In humans missense variants that introduce amino acid substitutions in proteins have received extensive study as the rich knowledge we have accumulated on protein function and structure greatly facilitates such investigation. One intensive research area in this field currently focuses on identifying and annotating variants associated with important functional or phenotypic consequences [1, 2]. In the latter, particularly for disease variant prediction, numerous powerful predictive tools using a variety of feature sets have been proposed. For instance, SIFT  largely explored sequence conservation information, PolyPhen2  used physicochemical and crystal structure properties, SNPs&GO included functional information , and MutPred  integrated a comprehensive list of changes derived from protein sequence. These tools all operate similarly, a classification model based on two curated data sets, e.g., disease and neutral variant data (the latter being usually referred as polymorphism) is constructed, or alternatively, from functional and non-functional variant data. Compared with the many efforts invested in predicting disease variants , the prediction of missense PGx variants by similar approaches has not been studied nearly as thoroughly.
PGx variants may affect one's response to a drug through influencing either PK or PD machinery. The clinical procedure for identifying PGx variants usually involves measuring drug concentration and several response parameters within a cohort, such as blood pressure or body temperature, during a period of time . Also PGx variants can be identified from GWAS studies or through candidate gene approaches in large retrospective cohorts. However these approaches are either time-consuming or resource-intensive, and computational alternatives are highly valuable and may aid in characterization of whole genome sequence interpretation . An attempt to build a counterpart to MutPred in making missense PGx variant prediction revealed two issues.
The first issue revolves around the usefulness of the recognized features that have proven to be beneficial in disease-neutral variant classification. One of the salient distinctions between disease and neutral variants is that disease variants generally occur on much conserved sites whereas the positions of neutral variants lack such restriction [3, 10]. Therefore, it is not a surprise to see that protein sequence conservation serves as the most important single feature in discriminating disease variants from neutral ones . Additionally, structural information, especially conservation of neighboring residues in 3D protein structures, has been demonstrated to be very useful in classifying catalytic residues [11, 12] and other functional sites [13, 14]. Unlike disease or other functional variants, PGx variants have undergone much less selective pressure as the corresponding evolutionary environment--modern medicine--is relatively recent. Hence, PGx variants may not possess the same strong sequence conservation patterns as disease variants do.
The second issue is that it is not clear about what variants may constitute a biologically sound non-PGx group. The selection of the negative set is potentially affected by the presence of functional variants among the polymorphisms . The most prominent data repository for PGx variants is PharmGKB (The Pharmacogenomics Knowledgebase; http://www.pharmgkb.org)  which has curated about 5,000 variants from more than 900 genes. Only a small portion of these variants is missense. The relatively small coverage of PharmGKB suggests that available PGx variants are less representative and many variants outside PharmGKB at present may exhibit PGx effect in the future studies.
To address the above issues, we collected a set of high-quality and likely causative PGx variants from PharmGKB and performed comparison of PGx with neutral and disease variants.
We applied a rigorous process to search for high-quality and likely causative PGx variants in PharmGKB (see Methods), and identified 55 PD and 71 PK missense variants with high confidence. Due to the small size of these two variant sets, we merged them into one group to increase statistical power in the analysis. One important issue in the classification of functional variants is the selection of an accurate negative set of non-functional ones. At the moment there is no common practice to address this issue. Thus, in this work we consider all the variants annotated in UniProt  as "Polymorphism" are functionally neutral. Accordingly, in the rest of the paper we use the word neutral to represent the missense variants annotated in UniProt as "Polymorphism".
Protein and variant distribution across protein types.
Feature generation from protein structure and sequence
Sequence conservation has been shown to be powerful in discrimination of disease and neutral variants [3, 10, 18–20]. Here, we measured sequence conservation for each amino acid site by calculating a normalized evolutionary rate from a global multiple species alignment in UCSC genome browser . Because an earlier version of such data had been successfully explored in concentrating on benign with damaging missense variants , we followed this approach but used the Rate4Site tool  for higher accuracy of conservation estimation. Through searching the Protein Data Bank (PDB) , we identified 31 proteins with X-ray structures that covered a total of 416 variants (187 neutral, 174 disease, and 55 PGx). The following features based on sequence conservation and protein structure were derived for 416 variants: average conservation scores of neighboring residues in 3D, number of neighbors, accessibility, and secondary structure. Other features, which could be generated from sequence, include predicted B-value, protein stability, hydrophobic property, and change in molecular weight, where B-value is a normalized B-factor of C-α of each residue and reflects the structural flexibility around it . Two sets of variants were created: a structure data set with 416 variants and 10 features, and a sequence data set with 1,265 variants and six features (see Additional files 1 and 2 respectively).
Feature difference among three groups of variants
Features showing different distribution among three groups of variant.
6.1 × 10-9
7.6 × 10-8
4.8 × 10-6
1.2 × 10-5
7.4 × 10-25
Classification performance measurements from two-group comparison (%).
Allele frequency of three groups of variants
Comparison of predicted pathogenesis among three groups of variants
Although our sequence and structural feature-based random forest models are less successful in classifying PGx and neutral variants than in classifying disease and the other two groups of variants, it may be informative to bring in some well-established tools for pathogenesis prediction to this task. We then applied three widely-used pathogenesis prediction tools, i.e., SIFT , PolyPhen2 , and MutPred , to all 1,265 variants (see Additional file 3). Unlike PolyPhen2 and MutPred, SIFT outputs for each non-synonymous variant a score measuring the level of tolerance and thus a lower score denotes higher probability of being pathogenetic (Figure 3B). We observed that both neutral and PGx variants are significantly less pathogenetic than disease variants (disease vs. neutral: P = 6.3 × 10-53; disease vs. PGx: P = 6.7 × 10-13). Both PolyPhen2 and MutPred generate probability-like score in estimating how likely a variant is to be pathogenetic, and both tools show similar score distributions among disease, neutral, and PGx variants (Figure 3C and Figure 3D), where scores for disease variants are significantly higher than scores for either neutral or PGx variants. Across all three tools, the average score for PGx variants sits between those for neutral and disease variants, and one plausible explanation is that PGx variants could possess certain weak deleterious effects compared to neutral variants but these effects are far less severe than effects of disease variants.
Classification performance measurements from three tools (%).
From this study PGx variants appear at sites with relatively low sequence conservation and are common in global human populations, and their physicochemical properties resemble those of neutral variants. An appealing explanation for such observations may be that PGx variants have come to be under selective pressure very recently in evolutionary history. The story, however, may be more complex than this as the finial phenotypic effect of PGx variants is not as straightforward as that of disease variants. While typically variants involved in Mendelian diseases cause loss of function and manifest narrow phenotypic spectra, PGx variants have subtler effects on the PK and PD of drugs. For instance, only 6% of disease variants in current study have been annotated with more than one disease, while about half of PGx variants are related to two or more drugs and their effects depend on the very drugs and the conditions they target. Another thread to make things more complex is that the exact effects of a PGx variant may largely depend on the drug dosage and increments in the plasmatic concentration of the drug resulting from such a variant may have divergent effects.
Imbalanced data are common in biomedical research. In our data, the ratio of the number of variants from the most frequent group to the least frequent is 3.4 in the structure data set and 5.2 in the sequence data set. Without tuning the classification algorithms, which by default assume a balanced training set, we observe a higher error rate in the minority group. Unbalanced performance on the two subsets is due to the higher contribution of the most abundant class to the overall accuracy. In two-group discrimination, we achieved approximately balanced sensitivity and specificity across six classification settings by properly assigning higher weight to the PGx variant group. The benefit of such weighting was more noticeable in three-group discrimination. Without modification to the default random forest, the PGx group in the sequence data is almost completely misclassified, rendering the entire classification approach far less sensible.
In current study sequence conservation and structural information have shown limited capability in discriminating PGx variants from neutral variants. One approach to improve classification without a dramatic increase in the size of the PGx variant dataset might lie in integrating novel features, such as those derived from 3D structure. One of such potential alternative features is free energy change (∆∆G) caused by a missense variant, which can be well estimated by FoldX  through protein structural calculations. A recent study in the context of dynamic pathway analysis found that ∆∆G calculated from FoldX is directly linked to cell growth and such link can be expressed in ordinary differential equations accurately . Thus ∆∆G may serve as an indicator to the potential influence a variant could cause. Comparing with the classification of disease and neutral variants, the similar problem between PGx and neutral variants is less amenable to otherwise well-established methodologies. While structure-based features are not as accessible as those derived from sequences, we did observe that they help improve the performance of classification slightly. This observation may in turn suggest that exploring more structure-related feature might prove to be profitable.
We have determined that PGx variants possess a higher degree of similarity to neutral genetic polymorphisms than to disease variants in the feature space consisting of residue conservation, neighboring conservation, number of neighbors, and accessibility. Such similarity poses a great difficulty in the classification of PGx and neutral variants. Future studies focusing on predicting PGx variants may need a better definition of non-PGx variants or to explore novel features outside those that have been demonstrated to be useful in disease-neutral variant classification.
All protein variants annotated as associated with a PGx trait were extracted from PharmGKB (http://www.pharmgkb.org). Two investigators then examined each variant further through curating the original publication from PharmGKB for a causative relationship between the variant and PK/PD effects. The selected variants were then additionally confirmed by PharmGKB scientific curators.
Calculating residue conservation
We downloaded from UCSC genome browser multiple alignments of 45 vertebrate genomes with Human that includes amino acid alignment from hg19/GRCh37 RefSeq genes (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz46way/alignments/refGene.exonAA.fa.gz) . We then applied the tool Rate4Site  to the multiple protein sequence alignment to estimate the relative evolutionary rate for each site, with the supplied phylogenetic tree that was used to calculate PhastCons scores in UCSC genome browser at the nucleotide level. The mapping between RefSeq and UniProt sequences was carried out by the global alignment program ggsearch in FASTA .
Three-dimensional structures were downloaded from the PDB . For each protein, we chose the structure with the largest coverage or the highest resolution if multiple structures with the same coverage were available, and then searched for neighboring residues and extracted secondary structure for each variant based on such structure. To define the neighboring residues to a given amino acid in a protein structure we followed the definition of Cilia & Passerini . There every residue is first represented by the centroid of its side-chain heavy atoms and then any residue falling in a sphere with a radius of eight angstroms centered on the given residue is defined as a neighbor. For glycine, the alpha-carbon was used as the centroid. Secondary structure and accessibility (in terms of number of water molecules) of each variant were directly read from the output of DSSP [29, 30] as applied to the corresponding PDB file. The PDB residue sequences were mapped to the UniProt sequences as we did in mapping the RefSeq and UniProt sequences. Protein stability changes due to variants were estimated by MUpro  which estimates both energy changes and the reliability of underlying prediction.
All computation was carried out in R (http://www.r-project.org). The Kendall correlation coefficient and test was computed by the function cor.test, and Wilcoxon rank sum test was computed by the function wilcox.test. Area under the curve (AUC) was calculated through the R package ROCR . Classification performance measurements are defined below:
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
Accuracy = (Sensitivity + Specificity)/2
Precision = TP/(TP + FP)
F-measure = 2 × Precision × Sensitivity/(Precision + Sensitivity)
Weighted random forest
The source code for implementing the random forest was obtained from Leo Breiman's web site (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_examples/prog.f) and compiled as a single program for each experiment by the GNU Fortran compiler with the following parameters: -std=legacy -ffixed-form -ffixed-line-length-none -fno-range-check. The out-of-bag predictions from each experimented were weighted internally according to supplied control parameters and normalized to produce prediction scores, which were then used in computing individual measurements.
We thank two anonymous reviewers for their helpful suggestions that substantially improved this manuscript. This work is supported by NIH R01 LM009722 (PI: Mooney), NIH U54-HG004028 (PI: Musen), NIH T32-AG000266 (PIs: Campisi and Ellerby), NIH UL1DE019608 supporting the Interdisciplinary Research Consortium on Geroscience (PI: Lithgow), NIH RL9AG032114 (U54 Geroscience), the NCBO, the Buck Trust and NIH NIGMS R24 GM061374 (PIs: Altman and Klein). EC is supported by start-up funds from the Department of Pathology at the University of Alabama at Birmingham.
The publication costs for this article were funded by a grant from NIH R01 LM009722 (PI: Mooney).
This article has been published as part of BMC Genomics Volume 15 Supplement 4, 2014: SNP-SIG 2013: Identification and annotation of genetic variants in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S4
- Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB: Bioinformatics challenges for personalized medicine. Bioinformatics. 2011, 27 (13): 1741-1748.PubMedPubMed CentralView ArticleGoogle Scholar
- Capriotti E, Nehrt NL, Kann MG, Bromberg Y: Bioinformatics for personal genome interpretation. Briefings in bioinformatics. 2012, 13 (4): 495-512.PubMedPubMed CentralView ArticleGoogle Scholar
- Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome research. 2001, 11 (5): 863-874.PubMedPubMed CentralView ArticleGoogle Scholar
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nature methods. 2010, 7 (4): 248-249.PubMedPubMed CentralView ArticleGoogle Scholar
- Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R: Functional annotations improve the predictive score of human disease-related mutations in proteins. Human mutation. 2009, 30 (8): 1237-1244.PubMedView ArticleGoogle Scholar
- Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P: Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009, 25 (21): 2744-2750.PubMedPubMed CentralView ArticleGoogle Scholar
- Genomes Project C, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65.View ArticleGoogle Scholar
- Wu R, Lin M: Statistical and computational pharmacogenomics. 2009, Boca Raton: Chapman & Hall/CRCGoogle Scholar
- Lahti JL, Tang GW, Capriotti E, Liu T, Altman RB: Bioinformatics and variability in drug response: a protein structural perspective. Journal of the Royal Society, Interface / the Royal Society. 2012, 9 (72): 1409-1437.PubMedPubMed CentralView ArticleGoogle Scholar
- Kumar S, Suleski MP, Markov GJ, Lawrence S, Marco A, Filipski AJ: Positional conservation and amino acids shape the correct diagnosis and population frequencies of benign and damaging personal amino acid mutations. Genome research. 2009, 19 (9): 1562-1569.PubMedPubMed CentralView ArticleGoogle Scholar
- Cilia E, Passerini A: Automatic prediction of catalytic residues by modeling residue structural neighborhood. BMC bioinformatics. 2010, 11: 115-PubMedPubMed CentralView ArticleGoogle Scholar
- Xin F, Myers S, Li YF, Cooper DN, Mooney SD, Radivojac P: Structure-based kernels for the prediction of catalytic residues and their involvement in human inherited disease. Bioinformatics. 2010, 26 (16): 1975-1982.PubMedPubMed CentralView ArticleGoogle Scholar
- Janda JO, Meier A, Merkl R: CLIPS-4D: a classifier that distinguishes structurally and functionally important residue-positions based on sequence and 3D data. Bioinformatics. 2013Google Scholar
- Capriotti E, Altman RB: Improving the prediction of disease-related variants using protein three-dimensional structure. BMC bioinformatics. 2011, 12 (Suppl 4): S3-PubMedPubMed CentralView ArticleGoogle Scholar
- Bromberg Y, Kahn PC, Rost B: Neutral and weakly nonneutral sequence variants may define individuality. Proceedings of the National Academy of Sciences of the United States of America. 2013, 110 (35): 14255-14260.PubMedPubMed CentralView ArticleGoogle Scholar
- Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, Altman RB, Klein TE: Pharmacogenomics knowledge for personalized medicine. Clinical pharmacology and therapeutics. 2012, 92 (4): 414-417.PubMedPubMed CentralView ArticleGoogle Scholar
- UniProt C: Activities at the Universal Protein Resource (UniProt). Nucleic acids research. 2014, 42 (1): D191-198.Google Scholar
- Bottema CD, Ketterling RP, Ii S, Yoon HS, Phillips JA, Sommer SS: Missense mutations and evolutionary conservation of amino acids: evidence that many of the amino acids in factor IX function as "spacer" elements. American journal of human genetics. 1991, 49 (4): 820-838.PubMedPubMed CentralGoogle Scholar
- Ng PC, Henikoff S: Predicting the effects of amino acid substitutions on protein function. Annual review of genomics and human genetics. 2006, 7: 61-80.PubMedView ArticleGoogle Scholar
- Capriotti E, Calabrese R, Casadio R: Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006, 22 (22): 2729-2734.PubMedView ArticleGoogle Scholar
- Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, et al: The UCSC Genome Browser database: extensions and updates 2013. Nucleic acids research. 2013, 41 (Database): D64-69.PubMedPubMed CentralView ArticleGoogle Scholar
- Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N: Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002, 18 (Suppl 1): S71-77.PubMedView ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic acids research. 2000, 28 (1): 235-242.PubMedPubMed CentralView ArticleGoogle Scholar
- Radivojac P, Obradovic Z, Smith DK, Zhu G, Vucetic S, Brown CJ, Lawson JD, Dunker AK: Protein flexibility and intrinsic disorder. Protein science : a publication of the Protein Society. 2004, 13 (1): 71-80.View ArticleGoogle Scholar
- Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. University of California, Berkeley. 2004Google Scholar
- Benedix A, Becker CM, de Groot BL, Caflisch A, Bockmann RA: Predicting free energy changes using structural ensembles. Nature methods. 2009, 6 (1): 3-4.PubMedView ArticleGoogle Scholar
- Cheng TM, Goehring L, Jeffery L, Lu YE, Hayles J, Novak B, Bates PA: A structural systems biology approach for quantifying the systemic consequences of missense mutations in proteins. PLoS computational biology. 2012, 8 (10): e1002738-PubMedPubMed CentralView ArticleGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America. 1988, 85 (8): 2444-2448.PubMedPubMed CentralView ArticleGoogle Scholar
- Joosten RP, te Beek TA, Krieger E, Hekkelman ML, Hooft RW, Schneider R, Sander C, Vriend G: A series of PDB related databases for everyday needs. Nucleic acids research. 2011, 39 (Database): D411-419.PubMedPubMed CentralView ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637.PubMedView ArticleGoogle Scholar
- Cheng J, Randall A, Baldi P: Prediction of protein stability changes for single-site mutations using support vector machines. Proteins. 2006, 62 (4): 1125-1132.PubMedView ArticleGoogle Scholar
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21 (20): 3940-3941.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.