Volume 14 Supplement 3
Identifying Mendelian disease genes with the Variant Effect Scoring Tool
© Carter et al.; licensee BioMed Central Ltd. 2013
Published: 28 May 2013
Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease.
We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency > 1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome.
Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us
The identification of mutations and genes underlying human genetic disease continues to be a very active area of research. Advances in DNA sequencing technology have made it possible to rapidly identify all genetic variants in an individual exome. This large-scale enumeration of variants poses considerable difficulty for the identification of disease-causing variants, as they must be singled out from among a large pool of candidates.
Rare non-synonymous single nucleotide variants (NS-SNVs) that alter protein sequence are particularly strong candidates for disease-causing variants [1–3]. Experimental assessment of protein activity for mutated proteins is very difficult, and is further impeded by the large number of NS-SNVs revealed by exome sequencing studies; in general, sequencing identifies 3000 - 6000 NS-SNVs per exome. This has motivated the development of many statistical and computational methods for evaluating the functional impact of non-synonymous changes on proteins. The methods fall broadly into two categories, those that score mutations on the basis of biological principles (SIFT , MutationAssessor , MAPP , PANTHER , among others), and methods that use existing knowledge about the functional effects of mutations in the form a training set for supervised machine learning (PolyPhen2 , SNAP , SNPs3D , MutPred , MutationTaster , among others) . These methods are known to perform well at distinguishing Mendelian disease mutations from common single nucleotide polypmorphisms  and usually offer either a numeric score that represents the predicted functional impact of an amino acid substitution, or a probability that the substitution is deleterious to the protein. Mutation scores can be used to substantially reduce the number of candidate disease-causing mutations detected in exome sequencing studies, but additional evidence is still needed to identify the causal mutation unequivocally. One recent strategy that has had considerable success in identifying both causal genes and mutations underlying Mendelian disorders is to assess several disease exomes together in order to reduce the number of candidate disease genes. Sequencing several exomes from unrelated individuals with the same Mendelian disorder can substantially reduce the list of candidate mutations/genes since all cases are expected to result from mutations affecting the same gene. Ng et al. sequenced the exomes of four individuals with Miller syndrome  and four individuals with Freeman Sheldon syndrome  to identify the subset of genes mutated in all individuals in each group. In both studies, mutations in the remaining genes were filtered against dbSNP and variants detected in 8 individuals from the HapMap project in order to eliminate neutral human variation. Using this strategy, Ng et al. were able to identify DHODH, encoding the enzyme dihydroorotate dehydrogenase, as the causal gene underlying the four Miller syndrome exomes and MYH3, encoding myosin-3, as the gene underlying the four Freeman Sheldon syndrome exomes. MYH3 had previously been identified as the gene underlying Freeman Sheldon syndrome via careful selection of candidate causal genes for sequencing based on similarity to other Mendelian disorders with known genetic causes . The work by Ng et al. demonstrated that a data-driven approach can be useful in identifying disease genes when there is no a priori list of candidate genes.
We postulated that it should be possible to identify causal genes on the basis of enrichment for functional mutations across disease exomes without filtering out common variants first. Our reasoning was that enrichment for functional mutations in a gene underlying a common phenotype would provide a stronger signal than functional mutations occurring in genes unrelated to the phenotype. To test this hypothesis, we applied our supervised machine learning-based method for predicting functional mutations to the variants observed in the four Miller syndrome exomes and three of the four Freeman Sheldon syndrome exomes described by Ng et al. Our Variant Effect Scoring Tool (VEST) was trained on a positive class of missense variants from the Human Gene Mutation Database  (2012v2) and a negative class of common missense variants detected in the Exome Sequencing Project population . VEST performed well at identifying functional mutations in careful benchmark experiments, with an area under the Receiver Operating Characteristic of 0.92. One difference between VEST and other existing methods for predicting functional mutations is that VEST uses a statistical hypothesis testing framework to assign p-values to predictions. In order to identify the causal genes, we aggregated VEST p-values for all mutations by gene across all disease exomes to create a gene-level statistic. Ranking genes based on this statistic placed both MYH3 and DHODH among the top 2 candidate causal genes.
Our gene prioritization approach differs from current statistical methods for gene burden testing (reviewed in [19, 20]) in that we do not require a control population to prioritize genes, although a matched control population, if available, is useful for eliminating genes that are artifactually significant as a result of sequencing and variant calling errors. In addition, we show that the causal genes can be identified without including allele frequency information, although we expect that using allele frequency information to filter the list of variants detected in exome sequencing should make it easier to detect disease genes.
The effectiveness of gene burden algorithms is often evaluated on the basis of statistical power to detect simulated disease genes . We evaluated our gene score using this approach by simulating disease genes over a range of parameters representative of real exome sequencing data. We also performed simulations to evaluate the potential consequences of VEST misclassification error for correct disease gene prioritization. The results of the simulations support the potential of our p-value-based gene score to correctly identify Mendelian disease genes using VEST functional predictions.
Results and discussion
The Variant Effect Scoring Tool (VEST) is a new method for prioritizing missense mutations that alter protein activity. VEST uses a supervised machine learning algorithm, Random Forest [22, 23], to identify likely functional missense mutations. The training set is a positive class of missense variants from the Human Gene Mutation Database and a negative class of common missense variants detected in the Exome Sequencing Project (ESP) population. An allele frequency of 1% or higher is considered common for this work. All mutations are described by a set of 86 quantitative features.
Although the VEST training set is designed to identify missense mutations that alter protein activity, its construction includes the additional context of selective difference between disease and common variants. The disease class from HGMD is enriched for variants under purifying selection, whereas the neutral class from ESP common variants includes variants under positive selection in the human population in addition to truly neutral variants . If we assume that the VEST classifier can generalize to mutations not in its training set, it should be able to take a set of variants from a sequenced exome and prioritize functional mutations which are similar to disease mutations over those which are similar to neutral mutations or to functional mutations under positive selection.
We compared VEST to two popular methods for predicting functional mutations, SIFT 4.0  and PolyPhen2  using a simple holdout strategy to achieve an unbiased comparison. The holdout strategy is necessary to ensure that trained classifiers do not have an unfair advantage in predicting benchmark set mutation class membership as can occur when benchmark mutations are included in the classifier training set. PolyPhen2 uses supervised machine learning to train a classifier, while SIFT4.0 does not. It was therefore necessary to design a benchmark set that did not include any mutations in the PolyPhen training set. Once again, we assumed that a higher correlation between the feature vectors of mutations in the same gene could lead to an advantage for correctly predicting mutation class, so we constructed our benchmark set such that only mutations in genes that had no mutations in the PolyPhen2 classifier training set were selected for the benchmark set.
Benchmark test set
Benchmark training set
Benchmark mutation coverage.
Classifier performance. Classifier performance on the benchmark test set. Area under the curve (AUC) statistics for Receiver Operating Characteristics and Precision-Recall curves are shown
Performance on benchmark
Performance on benchmark (subset scored by all methods)
Predicting functional mutations with VEST
VEST assigns variants a score between 0 and 1, where 1 indicates a confident prediction of a functional mutation. We use a statistical hypothesis testing framework to assess the significance of each score and adjust for multiple testing as described previously . Briefly, a set of missense mutations representative of neutral human variation provide a null model for testing the null hypothesis that a given VEST score indicates a neutral mutation. If the VEST score falls into the right tail (since large VEST scores indicate a prediction of disease) we can uses the null model to assess the probability of a neutral mutation from our null model receiving a VEST score of the same magnitude by chance. In general we expect to make a large number of comparisons to the null distribution because the number of missense mutations detected in exome sequencing is large, therefore we estimate a false discovery rate (FDR) cutoff for each p-vale using the Benjamini-Hochberg method . The p-value and FDR associated with each mutation can be used to determine a score cutoff for accepting a functional prediction. The precise cutoff is subjective, allowing the user to determine an acceptable tradeoff between the number of candidate functional mutations identified, and the estimated rate of false positive predictions. For this work, we modeled neutral missense mutations using common variants (AF ≥ 1%) observed in the 1000 Genomes Project , and used the VEST score distribution for these mutations to estimate p-values.
We attempted to improve VEST classifier performance using feature selection to remove features that were uninformative for discriminating between disease and neutral mutation classes. The predictive power of individual features was assessed using mutual information (MI), which provides a measure of the amount by which the knowledge provided by the feature decreases the uncertainty about the mutation class label . To acquire an unbiased estimate of improved performance, we partitioned the training set into 3 parts: one for estimating mutual information for the 86 features, one for training a classifier with selected featuers, and one for evaluating the resulting change in classifier performance. We selected 47 features based on a threshold of MI ≥ 0.001 bits. Using only the 47 selected features for classifier training gave a slight boost to sensitivity, but reduced classifier specificity (ROC AUC = 0.94 PR AUC = 0.89). MI thresholds of MI ≥ 0 bits and MI ≥ 0.01 bits were also tested (ROC AUC = 0.81 PR AUC = 0.72 and ROC AUC = 0.94 PR AUC = 0.88 respectively). Based on these findings, we elected to use the full set of 86 features for VEST classifier construction.
Gene prioritization in Mendelian Disorders
Mendelian exome processing
Variant calling results.
Common mutated genes
Miller Syndrome 1
Miller Syndrome 2
Miller Syndrome 3
Miller Syndrome 4
Freeman Sheldon 1
Freeman Sheldon 2
Freeman Sheldon 3
Mendelian exome variant scoring
We mapped variants detected in the four Miller syndrome exomes and the three Freeman Sheldon syndrome exomes onto proteins. Variants that resulted in an amino acid substitution were grouped into missense mutations and truncating mutations. For this work, four types of mutation were considered truncating: single nucleotide changes resulting in nonsense and nonstop mutations, frameshift mutations, and deletions eliminating a splice site. Variants resulting in silent changes or in-frame insertions and deletions were discarded. Missense mutations detected in each exome were scored with VEST. Truncating mutations were not initially considered since VEST is designed specifically for missense mutations. For the purpose of identifying DHODH and MYH3 from the missense mutations detected in the disease exomes, we trained two custom VEST classifiers. The first classifier used a training set filtered to eliminate all MYH3 mutations; and the second was filtered to eliminate all DHODH mutations. Using these classifiers to score missense mutations ensured that DHODH and MYH3 mutations did not receive artificially significant VEST scores because of overlap with the training set. This was necessary to ensure an unbiased assessment of our gene score for identifying candidate causal genes.
Selecting a method to aggregate VEST scores by gene
Gene rank and p-value assigned to causal gene in two Mendelian disorders.
Mean VEST Rank
Mean VEST score
Freeman Sheldon syndrome/MYH3
Freeman Sheldon syndrome/MYH3
Dominant model including truncating
Freeman Sheldon syndrome/MYH3
Recessive model including truncating
Accounting for Mendelian disease models
Mendelian disease gene detection under locus heterogeneity.
MYH3 mutation VEST scores
Fisher Rank (MYH3)
Stouffer Rank (MYH3)
Fisher Rank (DHODH)
Stouffer Rank (DHODH)
DHODH mutation VEST scores
Fisher Rank (MYH3)
Stouffer Rank (MYH3)
Fisher Rank (DHODH)
0.33, 0.98, 0.859
0.33, 0.952, 0.523
0.33, 0.98, 0.859
Including truncating mutations
where AF is allele frequency and M represents the VEST scores of all missense mutations observed in the sequenced exomes. Thus, low frequency truncating events received high VEST scores. We then estimated p-values for the truncating mutation scores using the 1000 Genomes null model. We then recalculated gene scores using p-values for both missense and truncating mutations. When truncating mutations were included, DHODH and MYH3 dropped in rank among prioritized genes to 3 and 12 respectively (Table 5). This is not surprising, since a number of the truncating mutations are present at low frequency or result from what appear to be private mutations and therefore receive VEST scores close to 1. It should be noted that indel calling is difficult  and indel calling algorithms are thought to have a high false positive rate for calling frameshift events . The possibility that novel frameshift mutations detected by variant calling pipelines may be enriched for false mutations suggests that it will be important to develop a better scoring algorithm for truncating mutations when including them for gene prioritization.
Effect of null distribution on gene score
MYH3 and DHODH rankings using different null models.
Swissprot null model
1000 genomes null model
Complete genomics null model
Interestingly, the null model based on SwissProt polymorphisms showed more enrichment for functional VEST scores than the other null models. This suggests that some SwissProt polymorphisms may not be functionally neutral, or that VEST misclassifies a subset of them because of some difference between the SwissProt polymorphisms and the common variants used to train the VEST classifier.
Assessing the power of the gene score to detect disease genes
Contrasting gene scores in Freeman Sheldon syndrome and Miller syndrome
Mutations detected in exome sequencing include artifactual mutations resulting from DNA sequencing and variant calling errors. These false variants can confound the search for true disease causing mutations/genes. Since sequencing errors are not associated with disease phenotype, it is often possible to filter them out by comparing to a control population. Specifically, systematic sequencing error should be apparent because it will result in similar rates of false mutations in both disease exomes and controls, while the true causal mutations should be present in only the disease exomes. To effectively use this approach, both disease exomes and control exomes must be processed using the same DNA sequencing technologies and variant calling pipelines .
Top genes in Freeman Sheldon syndrome exomes after comparison to Miller syndrome Exomes.
Whole exome sequencing studies for the characterization of disease variation uncover thousands of candidate causal mutations per individual. In this work, we have developed a new, accessible tool, VEST, for prioritizing functional mutations among those detected through exome sequencing. We show that simple aggregation of VEST prediction p-values across genes and disease exomes can be used to identify disease genes in Mendelian disorders. In two Mendelian disorders with known causation, our gene score places the causal genes among the top 5 candidate genes with no prior filtering to reduce the list of candidates, and in the top 2 when only genes that meet Mendelian criteria are considered. This serves as proof of principle that bioinformatics methods for scoring mutations can be used directly for disease gene discovery in exome sequencing studies, through a simple application that could easily be automated. Disease related mutations and genes are more difficult to identify in complex disorders because there is less concentration of causal variants in individual genes, disease causing variants often have variable penetrance and environmental factors can greatly modify disease risk. The approach described here may nonetheless help to identify genes, or pathways consisting of multiple genes, that are enriched for functional mutations among individuals with complex disorders. Our method is distinct from existing methods that rely on differing allele frequencies in cases versus controls to identify causal genes. Allele frequencies are not required in our approach, but can be used to filter variants prior to VEST scoring and p-value aggregation. In addition, while matched controls are useful for removing genes that are functionally impaired but not associated with the specific phenotype, our approach is still useful when controls are not available. Further experiments are planned to determine the power of our approach for identifying genes involved in complex disease.
Read mapping and variant calling
Four Miller syndrome exomes  and three Freeman-Sheldon syndrome exomes  were acquired in the form of short read archive (SRA) files from dbGaP  (Miller syndrome study accession: phs000244.v1.p1 Freeman Sheldon syndrome study accesion: phs000204.v1.p1) and converted to fastq format using fastq-dump version 2.1.12 from the SRA Toolkit http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software. The Miller syndrome and Freeman-Sheldon syndrome 76bp single end Illumina Genome Analyzer reads were pre-processed using the fastx toolkit  to only include reads with a Phred-like consensus quality 20 for 50 percent of the bases. The reads were then mapped to the reference genome (UCSC hg19) using the Burrows-Wheeler Aligner (BWA v0.6.1) . The resulting outputs were sorted and indexed using samtools . Unmapped reads and duplicates were then removed using the Bamtools  and picard software http://picard.sourceforge.net. The aligned positions with a Phred-like consensus quality value 30 were retained for downstream variant calling. The Genome Analysis Toolkit (v1.6-11-g3b2fab9) then realigned intervals around potential indels and re-calibrated the base quality scores [48, 49]. GATK Analysis was limited to the human exome from the list of RefSeq exons plus an additional 10 bp at each splice site. GATK's Unified Genotyper module performed variant calling resulting in variant call format (VCF) files. The resulting VCF files (v4.1) were filtered through GATK's variant filter module using hard filter settings. SNP filtering removed calls based of the following criteria: (i) MQ 0 ≥ 4&&((MQ 0 = (1.0 * DP)) > 0.1); (ii) QUAL < 30.0 || QD < 5.0 || H Run > 5 || SB >-0.10); and (iii) cluster window size 10. Phasing for the SNP calls was determined using the GATK ReadBackPhasing algorithm restricted to the exon ranges at a phasing quality threshold of 20.0. Indel filtering removed calls based on the following criteria (i) MQ 0 ≥ 4&&((MQ 0/(1.0 * DP)) > 0.1); (ii) SB ≥ -1.0; (iii) QUAL < 10.
Single nucleotide variants were mapped onto coding regions using the SNVBox genomic coordinate mapping tool and designated as missense, nonsense, nonstop or silent. Indels were mapped onto RefSeq proteins using the refGene table in the UCSC Genome Browser . Indels were designated inframe, frameshift or splice altering.
VEST classifier training
A Random Forest classifier (ntrees = 1000, mtry = 9) was trained on 47724 missense mutations directly implicated in human inherited disease from the Human Gene Mutation Database (HGMD Professional v2012.2), and 45818 likely neutral missense mutations from the Exome Sequencing Project (ESP6500 accessed 07/2012) using parf Parallel Random Forest software http://code.google.com/p/parf/. Each mutation was described by a vector of 86 quantitative features available through the SNVBox database . HGMD disease mutations were filtered so as to exclude polymorphisms, low-confidence disease mutations and common variants (AF ≥ 1%). This filtering strategy removed 3592 polymorphisms, 4739 low confidence mutations and 492 common variants. An additional 14181 variants that did not cause missense changes were also removed. Twenty-three of the remaining variants did not have a RefSeq accession associated with the variant, 321 consisted of different genomic events that resulted in the identical missense mutation and 47 could not be annotated with features from the SNVBox because the transcript identifier was not supported. ESP6500 mutations were filtered to remove rare variants (AF < 1%), and any mutations occurring at the same codon as an HGMD disease mutation were dropped. Removing overlap with HGMD resulted in 689823 missense variants, only 46303 of which were present at (AF < 1%) and mapped onto a RefSeq NM identifier. An additional 485 ESP6500 mutations could not be annotated with features from SNVBox because the transcript identifier was not supported.
The training set was partitioned in to 3 approximately equal parts, each with a balanced number of mutations from HGMD and ESP6500. The three partitions were constrained such a that all mutations in the same gene were included in the same partition. We used mutual information between feature values and class labels estimated from the first partition to select all features that provided at least 0.01 bits, 0.001 bits or any positive amount of information about the label. We then trained a VEST classifier using partition 2 and the subset of features selected using partition 1. The 3rd partition was scored with the classifier to determine whether classifier performance improved relative to a classifier trained using all the features.
Estimating classifier generalization error
To assess VEST's generalization error, we used 5-fold cross-validation. The VEST training set was divided into 5 partitions such that each partition contained a balanced number of disease and neutral mutations. All mutations occurring in the same gene were constrained to occur in the same partition in order to avoid overly optimistic estimates of the generalization error. We did not constrain homologous genes to occur in the same partition. Receiver Operator Characteristic (ROC) and Precision-Recall (PR) curves were constructed from the predicted class labels and the AUC statistic was used as a measure of classifier performance.
Benchmark Construction and Benchmark Variant Scoring with SIFT4.0 and PolyPhen2
For benchmark set construction, we downloaded the PolyPhen2.2.2 training set release dated December 2011 (http://genetics.bwh.harvard.edu/pph2/dokuwiki/downloads). HUGO gene symbols were associated with PolyPhen2 training set variants using a local mirror of the UCSC Genome Browser . The VEST training set was then partitioned into a benchmark test set and a benchmark training set. Any mutation in the VEST training set that occurred in a gene represented in the PolyPhen2 training set was placed into the benchmark training set, and all remaining mutations were placed in the benchmark test set. More HGMD disease mutations overlapped the PolyPhen2 training set than neutral mutations, so additional mutations were removed from the benchmark test partition and placed in the benchmark training partition until a balanced number of each class was reached. To ensure unbiased benchmarking, we moved all mutations in the same gene together such that no gene was split across the benchmark training and test partitions. A VEST classifier was trained on the benchmark training set and used to score the benchmark test set. Genomic coordinates in hg19 for the benchmark test set missense mutations were submitted to the PolyPhen2 batch webserver http://genetics.bwh.harvard.edu/pph2/bgi.shtml as well as to a local installation of SIFT4.0. Since PolyPhen2 offers 2 trained classifiers, HumDiv and HumVar, this procedure was repeated twice.
Statistical hypothesis testing framework and empirical null distributions
VEST assigns missense mutations a score between 0 and 1, representing the fraction of decision trees in the Random Forest that voted for the disease mutation class. We used an empirical null model to estimate p-values for each missense mutation based on VEST scores. We constructed three empirical null distributions consisting of CHASM scores for 13639 polymorphisms from the Swissprot variant pages, 28509 variants from the 1000 Genomes Project with AF ≥ 1%, and 3421 high quality variants from the Complete Genomics diversity panel with allele count ≥5in the 69 individuals (Complete Genomics Assembly Software Version 2.0, CGATools version 1.6). P-values for each variant were estimated as the fraction of null distribution VEST scores greater than or equal to the VEST score of the variant. Each empirical null was filtered to remove overlap with the VEST training set. When scoring multiple missense mutations (n > 10), false discovery rates were estimated using the Benjamini-Hochberg method.
Truncating mutation scores
Scores for truncating mutations were calculated as 1-allele frequency multiplied by the maximum VEST score for any missense mutation. For this work, four types of mutation were considered truncating: single nucleotide changes resulting in nonsense and nonstop mutations, frameshift mutations, and deletions eliminating a splice site. We used the maximum allele frequency for the variant in the 1000 Genomes dataset, the Exome Sequencing Project European-American dataset or the Exome Sequencing Project African-American dataset.
We assessed the sensitivity of the gene scoring methods to three parameters: the magnitude of VEST scores for functional mutations, the number of mutations per gene and the fraction of mutations per gene using power simulations. We generated a population of 1000 disease genes by randomly selecting Z-scores from a null distribution (the standard normal distribution) and an alternative distribution (the standard normal distribution with a mean shift) to represent neutral and functional mutations respectively. We used effect size to represent the magnitude of VEST scores. Here, effect size is the distance between the mean of a Z-score distribution for neutral mutations and the mean of a Z-score distribution for functional mutations. If disease mutations receive strong functional scores with VEST, the distances between the means of these distributions will be large, whereas for modestly scoring functional mutations, the distance will be smaller. We tested mean shifts of 0.5. 1.0, 1.5 and 2.0 and varied the number of mutations from 0 to 50. The fraction of mutations sampled from the alternative distribution was varied from 20% to 100% at intervals of 20%. Z-scores sampled for each gene were aggregated according to Stouffer's method, or converted to p-values then aggregated according to Fisher's method, and the resulting statistic was used to assign the gene a p-value. Power was estimated as the fraction of the disease genes that received a p-value < 0.05. To simulate the effect of variant misclassification on the gene score power to detect disease genes, we simulated genes with misclassification rates based on VEST generalization error estimates. We selected four points along the ROC curve generated using 5-fold gene holdout cross validation of the VEST classifier (TPR = 60%, FPR = 5%; TPR = 70%, FPR = 10%; TPR = 80%, FPR = 15%; TPR = 90%, FPR = 25%)(see Methods: Estimating classifier generalization error).
We tested the sensitivity of our gene scoring method to locus heterogeneity by simulating a set of normal exomes and introducing disease mutations into distinct genes in subsets of the exomes. In order to ensure that simulated exomes reciprocated the variability in mutation count per gene observed in real sequencing data, we used gene-specific distributions of mutation counts from the Freeman Sheldon syndrome and Miller syndrome exomes. Approximately 7000 distinct genes were observed to harbor mutations when all seven real exomes were considered. To simulate a new exome, we sampled a gene-specific mutation count for each of the ~7000 distinct genes and then, for each gene, generated that number of non-disease associated mutations by sampling Z-scores from the standard normal distribution. Next, we selected a gene in each simulated exome to represent a Mendelian disease gene, and sampled an additional Z-score from a mean-shifted standard normal distribution to represent a disease mutation. For recessive cases, two additional mutations were generated per disease gene. Mutation scores were then aggregated for each gene, resulting in a gene level p-value and a gene rank. We ran each simulation 100 times, and varied the number of exomes created, the number of disease genes, and the disease model (dominant versus recessive).
Estimating effect size from the VEST training set
In order to estimate the expected effect size for functional mutations versus neutral mutations, we determined Z-score distributions for the disease and neutral classes of the VEST training set using VEST scores acquired by 5-fold gene holdout cross-validation. The 1000 Genomes null model was used to estimate a p-value for each mutation. The p-values were then converted to Z-scores and the distributions for disease and neutral class mutations compared. We found that the distance between the means of the two distributions was 1.85, with the mean of the neutral class Z-scores at 0.12 and the mean of the disease class Z-scores at -1.73.
Mendelian gene scores
For each Mendelian disorder dataset, we pooled all variants detected in the exomes, then grouped all variants by gene. Many Mendelian disorders are monogenic, so we removed genes that were not mutated in all case exomes. This assumption can be relaxed to include all genes or genes mutated in at least a subset of disease exomes if locus heterogeneity is suspected. Miller syndrome has previously been observed to follow a recessive inheritance pattern; therefore we also filtered the Miller syndrome data to remove genes that did not show evidence of mutation in both copies in all exomes. Evidence that both copies of a gene were mutated included homozygous mutations, or two heterozygous mutations in the same gene. Where possible, phasing information was used to rule out multiple heterozygous mutations if they occurred in the same copy of the gene.
Contrasting gene scores across groups
We used R statistical software to fit a linear model to 1640 genes that were scored in both Freeman Sheldon syndrome and Miller syndrome exomes. Freeman Sheldon syndrome gene p-values generated with the Stouffer Z-score method were used as the dependent variable, and Miller syndrome gene p-values generated with the Stouffer Z-score were used as the independent variable. The residuals resulting from the linear model were used to assess which of the top ranked genes were more likely to result from sequencing artifacts either due to the sequencing technology or because of properties of the reference genome. A large positive residual indicated p-values that were larger in Freeman Sheldon syndrome than would have been expected by the corresponding p-values in the Miller syndrome data while large negative residual indicated p-values in Freeman Sheldon syndrome that were much smaller than would have been expected in Miller data.
Funding: National Institutes of Health CA 152432, National Science Foundation DBI 0845275
We would like to thank Dr. Joel Bader and Noushin Niknafs for contributing ideas and scientific discussion.
The publication costs for this article were funded by grants from National Institutes of Health CA 152432 and National Science Foundation DBI 0845275.
This article has been published as part of BMC Genomics Volume 14 Supplement 3, 2013: SNP-SIG 2012: Identification and annotation of SNPs in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S3
- Kryukov G, Pennacchio L, Sunyaev S: Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. The American Journal of Human Genetics. 2007, 80 (4): 727-739. 10.1086/513473.View ArticlePubMed
- Thusberg J, Vihinen M: Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Human mutation. 2009, 30 (5): 703-714. 10.1002/humu.20938.View ArticlePubMed
- Cooper G, Shendure J: Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Reviews Genetics. 2011, 12 (9): 628-640. 10.1038/nrg3046.View ArticlePubMed
- Kumar P, Henikoff S, Ng P: Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature protocols. 2009, 4 (7): 1073-1081.View ArticlePubMed
- Reva B, Antipin Y, Sander C: Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research. 2011, 39 (17): e118-e118. 10.1093/nar/gkr407.PubMed CentralView ArticlePubMed
- Stone E, Sidow A: Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome research. 2005, 15 (7): 978-986. 10.1101/gr.3804205.PubMed CentralView ArticlePubMed
- Thomas P, Campbell M, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome research. 2003, 13 (9): 2129-2141. 10.1101/gr.772403.PubMed CentralView ArticlePubMed
- Adzhubei I, Schmidt S, Peshkin L, Ramensky V, Gerasimova A, Bork P, Kondrashov A, Sunyaev S: A method and server for predicting damaging missense mutations. Nature methods. 2010, 7 (4): 248-249. 10.1038/nmeth0410-248.PubMed CentralView ArticlePubMed
- Bromberg Y, Rost B: SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic acids research. 2007, 35 (11): 3823-3835. 10.1093/nar/gkm238.PubMed CentralView ArticlePubMed
- Yue P, Moult J: Identification and analysis of deleterious human SNPs. Journal of molecular biology. 2006, 356 (5): 1263-1274. 10.1016/j.jmb.2005.12.025.View ArticlePubMed
- Li B, Krishnan V, Mort M, Xin F, Kamati K, Cooper D, Mooney S, Radivojac P: Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009, 25 (21): 2744-10.1093/bioinformatics/btp528.PubMed CentralView ArticlePubMed
- Schwarz J, Rödelsperger C, Schuelke M, Seelow D: MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods. 2010, 7 (8): 575-576. 10.1038/nmeth0810-575.View ArticlePubMed
- Ng P, Henikoff S: Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet. 2006, 7: 61-80. 10.1146/annurev.genom.7.080505.115630.View ArticlePubMed
- Ng S, Buckingham K, Lee C, Bigham A, Tabor H, Dent K, Huff C, Shannon P, Jabs E, Nickerson D et al: Exome sequencing identifies the cause of a mendelian disorder. Nature genetics. 2009, 42: 30-35.PubMed CentralView ArticlePubMed
- Ng S, Turner E, Robertson P, Flygare S, Bigham A, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler E et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276. 10.1038/nature08250.PubMed CentralView ArticlePubMed
- Toydemir R, Rutherford A, Whitby F, Jorde L, Carey J, Bamshad M: Mutations in embryonic myosin heavy chain (MYH3) cause Freeman-Sheldon syndrome and Sheldon-Hall syndrome. Nature genetics. 2006, 38 (5): 561-565. 10.1038/ng1775.View ArticlePubMed
- Stenson P, Mort M, Ball E, Howells K, Phillips A, Thomas N, Cooper D et al: The human gene mutation database: 2008 update. Genome Med. 2009, 1: 13-10.1186/gm13.PubMed CentralView ArticlePubMed
- Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA. [http://evs.gs.washington.edu/EVS/]
- Luo L, Boerwinkle E, Xiong M: Association studies for next-generation sequencing. Genome research. 2011, 21 (7): 1099-1108. 10.1101/gr.115998.110.PubMed CentralView ArticlePubMed
- Zeggini E: Next-generation association studies for complex traits. Nature genetics. 2011, 43 (4): 287-10.1038/ng0411-287.PubMed CentralView ArticlePubMed
- Liu DJ, Leal SM: A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 2010, 6 (10): e1001156-10.1371/journal.pgen.1001156.PubMed CentralView ArticlePubMed
- Amit Y, Geman D: Shape quantization and recognition with randomized trees. Neural computation. 1997, 9 (7): 1545-1588. 10.1162/neco.1922.214.171.1245.View Article
- Breiman L: Random forests. Machine learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View Article
- Tennessen J, Madeoy J, Akey J: Signatures of positive selection apparent in a small sample of human exomes. Genome research. 2010, 20 (10): 1327-1334. 10.1101/gr.106161.110.PubMed CentralView ArticlePubMed
- Capriotti E, Altman R: A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. 2011
- Fawcett T: ROC graphs: Notes and practical considerations for researchers. ReCALL. 2004, 31 (HPL-2003-4): 1-38.
- Davis J, Goadrich M: The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning. 2006, ACM, 233-240.
- Carter H, Chen S, Isik L, Tyekucheva S, Velculescu V, Kinzler K, Vogelstein B, Karchin R: Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer research. 2009, 69 (16): 6660-10.1158/0008-5472.CAN-09-1133.PubMed CentralView ArticlePubMed
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological). 1995, 289-300.
- Durbin R, Altshuler D, Abecasis G, Bentley D, Chakravarti A, Clark A, Collins F, Francisco M, Donnelly P, Egholm M et al: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticlePubMed
- Battiti R: Using mutual information for selecting features in supervised neural net learning. Neural Networks, IEEE Transactions on. 1994, 5 (4): 537-550. 10.1109/72.298224.View Article
- Mailman M, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L et al: The NCBI dbGaP database of genotypes and phenotypes. Nature genetics. 2007, 39 (10): 1181-1186.PubMed CentralView ArticlePubMed
- Fisher R, Bennett J, Yates F: Statistical Methods, Experimental Design, And Scientific Inference: A Re-issue Of Statistical Methods For Research Worker. 1990
- Stouffer S, Suchman E, Devinney L, Star S, Williams R: The American soldier: adjustment during army life.(Studies in social psychology in World War II, Vol. 1.). 1949
- Albers C, Lunter G, MacArthur D, McVean G, Ouwehand W, Durbin R: Dindel: Accurate indel calls from short-read data. Genome research. 2011, 21 (6): 961-973. 10.1101/gr.112326.110.PubMed CentralView ArticlePubMed
- Challis D, Yu J, Evani U, Jackson A, Paithankar S, Coarfa C, Milosavljevic A, Gibbs R, Yu F: An integrative variant analysis suite for whole exome next-generation sequencing data. BMC bioinformatics. 2012, 13: 8-10.1186/1471-2105-13-8.PubMed CentralView ArticlePubMed
- Yip Y, Famiglietti M, Gos A, Duek P, David F, Gateau A, Bairoch A: Annotating single amino acid polymorphisms in the UniProt/Swiss-Prot knowledgebase. Human mutation. 2008, 29 (3): 361-366. 10.1002/humu.20671.View ArticlePubMed
- Drmanac R, Sparks A, Callow M, Halpern A, Burns N, Kermani B, Carnevali P, Nazarenko I, Nilsen G, Yeung G et al: Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010, 327 (5961): 78-81. 10.1126/science.1181498.View ArticlePubMed
- R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2008, [ISBN 3-900051-07-0], [http://www.R-project.org]
- Taub M, Bravo H, Irizarry R: Overcoming bias and systematic errors in next generation sequencing data. Genome medicine. 2010, 2 (12): 1-5.View Article
- Johnson J, Gibbs J, Van Maldergem L, Houlden H, Singleton A: Exome sequencing in Brown-Vialetto-van Laere syndrome. American journal of human genetics. 2010, 87 (4): 567-10.1016/j.ajhg.2010.05.021.PubMed CentralView ArticlePubMed
- Hoischen A, Van Bon B, Gilissen C, Arts P, Van Lier B, Steehouwer M, De Vries P, De Reuver R, Wieskamp N, Mortier G et al: De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nature genetics. 2010, 42 (6): 483-485. 10.1038/ng.581.View ArticlePubMed
- Jia P, Li F, Xia J, Chen H, Ji H, Pao W, Zhao Z: Consensus Rules in Variant Detection from Next-Generation Sequencing Data. PloS one. 2012, 7 (6): e38470-10.1371/journal.pone.0038470.PubMed CentralView ArticlePubMed
- Pearson W, Wood T, Zhang Z, Miller W: Comparison of DNA sequences with protein sequences. Genomics. 1997, 46: 24-36. 10.1006/geno.1997.4995.View ArticlePubMed
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R et al: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMed
- Barnett D, Garrison E, Quinlan A, Strömberg M, Marth G: BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011, 27 (12): 1691-1692. 10.1093/bioinformatics/btr174.PubMed CentralView ArticlePubMed
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M et al: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.PubMed CentralView ArticlePubMed
- DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, Del Angel G, Rivas M, Hanna M et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics. 2011, 43 (5): 491-498. 10.1038/ng.806.PubMed CentralView ArticlePubMed
- Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, Haussler D et al: The human genome browser at UCSC. Genome research. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMed
- Wong W, Kim D, Carter H, Diekhans M, Ryan M, Karchin R: CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics. 2011, 27 (15): 2147-2148. 10.1093/bioinformatics/btr357.PubMed CentralView ArticlePubMed
- Fujita P, Rhead B, Zweig A, Hinrichs A, Karolchik D, Cline M, Goldman M, Barber G, Clawson H, Coelho A et al: The UCSC genome browser database: update 2011. Nucleic acids research. 2011, 39 (suppl 1): D876-D882.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.