EFIN: predicting the functional impact of nonsynonymous single nucleotide polymorphisms in human genome
© Zeng et al.; licensee BioMed Central Ltd. 2014
Received: 15 September 2013
Accepted: 4 June 2014
Published: 10 June 2014
Predicting the functional impact of amino acid substitutions (AAS) caused by nonsynonymous single nucleotide polymorphisms (nsSNPs) is becoming increasingly important as more and more novel variants are being discovered. Bioinformatics analysis is essential to predict potentially causal or contributing AAS to human diseases for further analysis, as for each genome, thousands of rare or private AAS exist and only a very small number of which are related to an underlying disease. Existing algorithms in this field still have high false prediction rate and novel development is needed to take full advantage of vast amount of genomic data.
Here we report a novel algorithm that features two innovative changes: 1. making better use of sequence conservation information by grouping the homologous protein sequences into six blocks according to evolutionary distances to human and evaluating sequence conservation in each block independently, and 2. including as many such homologous sequences as possible in analyses. Random forests are used to evaluate sequence conservation in each block and to predict potential impact of an AAS on protein function. Testing of this algorithm on a comprehensive dataset showed significant improvement on prediction accuracy upon currently widely-used programs. The algorithm and a web-based application tool implementing it, EFIN (Evaluation of Functional Impact of Nonsynonymous SNPs) were made freely available (http://paed.hku.hk/efin/) to the public.
Grouping homologous sequences into different blocks according to the evolutionary distance of the species to human and evaluating sequence conservation in each group independently significantly improved prediction accuracy. This approach may help us better understand the roles of genetic variants in human disease and health.
KeywordsCoding mutation nsSNP Amino acid conservation Functional impact Evolutionary distance
Rapid development in sequencing technology and decrease in cost has made it possible to sequence large number of samples on the entire genome or exome. Each individual carries thousands of rare or even private mutations [1, 2]. For cancer tissues, a large number of somatic mutations may also exist in a given tumour. Apparently not all the mutations play functional roles in a disease, and the ones that do may also have different functional impact. It is usually not feasible to characterize large number of mutations experimentally. Analysing ever increasing sequence data in silico first and identifying a small number of mutations that are more likely to be involved in diseases for further analysis or experimental characterization is an important task in today’s genetic and genomic studies. Analysing the mutation/polymorphism profiles of protein coding genes in the general population may also help us to better understand the evolutionary history of the genes and their functional roles in human health and diseases.
A number of programs have been developed during the last decade or so and they have played a vital role in predicting the functional impact of human mutations in various genetic, genomic studies [2–12]. Amino acid sequence conservation during evolutionary courses, potential protein structural changes, database annotations, and physicochemical properties of the amino acids involved are among the many features considered in various programs for prediction of functional impact of an amino acid substitution (AAS). Of which, conservation among homologous sequences is considered the most important piece of information in determining prediction accuracy, including programs that take into account of various other features .
Despite the enormous progress in this field, accurately predicting damaging AAS from neutral changes is still a challenging task, and most programs have high false positive and false negative rate. Not being able to make use of the full spectrum of homologous sequences and to make use of the information accurately is considered major drawbacks of these algorithms.
In this study, we developed a novel algorithm and a web-based application executing it, EFIN (Evaluation of Functional Impact of Nonsynonymous SNPs), to try to make more thorough and more accurate use of conservation information in predicting functional impact of an AAS. Comparison of our program with widely-used programs on various datasets demonstrated significant improvement of EFIN upon others on prediction accuracy.
The work flow of EFIN
Step 1. Building multiple sequence alignment on the query sequence
EFIN uses BLAST algorithm [13, 14] to identify proteins homologous to the query sequence by searching the UniRef100 database from UniProt (http://www.uniprot.org/). The full length of a query sequence is used for building the multiple sequence alignment (MSA), with a cutoff e-value of 0.0001 and a maximum retrieval of 5,000 sequences for each query process. This is to ensure that the sequences considered are true homologs and an adequate number of such sequences are considered for conservation analysis.
The homologous sequences are then sorted in descending order by their normalized alignment score (NAS), which is the alignment score divided by that of the query sequence itself. The alignment score is an output from BLAST calculated by the amino acid similarity matrix Blosum62. The normalization of alignment score makes it possible to compare proteins of various lengths (see Additional file 1: Method A). The NAS often accurately reflects evolutionary distance between the protein considered and its human ortholog.
Step 2. Grouping the homologous sequences into blocks
The sequences in MSA are first annotated with taxonomy information of the source species, downloaded from the Taxonomy database  in NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/). Based on the taxonomy classification and the evolutionary distance to human, all the sequences in MSA are then grouped into five ortholog blocks and a paralog block. Namely, in descending order, the primate block, the non-primate mammal block, the non-mammal vertebrate block, the invertebrate block, the other species block (including plants, bacteria, and fungi, etc.), and the paralog block.
Briefly, the NAS-sorted MSA is examined starting from the top, usually starting with human sequences and sequences from primates. The first sequence belonging to a different species category (usually from a non-primate mammal, and so on so forth) is marked. According to the evolutionary distance to human, we classified all species into 5 species categories which are primate with human included, non-primate mammal, non-mammal vertebrate, invertebrate and other species such as plant, bacteria, and fungi. The first sequence of each species category in the MSA is the sequence from that category with highest alignment score. Namely, the first primate sequence, the first non-primate mammal sequence, the first non-mammal vertebrate sequence, the first invertebrate sequence, and the first “other species” sequence such as those from plant, bacteria, and fungi.
All the sequences between the two ‘first sequences’ in MSA are then grouped into one block. For example, all sequences between first non-primate mammal sequence (inclusive) and first non-mammal vertebrate sequence (exclusive) are grouped into the non-primate mammal block. The sequences within an ortholog block that should not belong to the same species group are also tagged and moved to the paralog block. For example, protein TP63 and TP73 are paralogs of TP53, and these sequences are usually mixed with orthologs in the MSA for TP53 due to their sequence similarities. TP63 from human (H2QNY5) has comparable sequence similarity to human TP53 to that between human TP53 and fish TP53 (such as H2LPP5). Thus in this process, these TP63 and TP73 sequences will be removed from the non-mammal vertebrate block and will be placed in the paralog block. A mathematical description and an example are also included in the Additional file 1: Methods G and H, respectively.
Step 3. Evaluation of sequence conservation
The features used by EFIN
Value and range
Reference amino acid (AAref)
The reference amino acid of the query position
Mutant amino acid (AAmut)
The mutant amino acid of the query position
Frequency of reference amino acid (Fref)
Frequency of reference amino acid at the query position in each block
interval [0,1], with 1 means perfect conservation of reference amino acid
Frequency of mutant amino acid (Fmut)
Frequency of mutant amino acid at the query position in each block
interval [0,1], with 1 means that all sequences have the mutant amino acid at the position
Shannon Entropy (H)
Shannon entropy in each block at the query position
interval [0,4.322], 0 means no diversity and larger number means more diversity at the position
NAS of the first sequence in each block (NASfirst)
Normalized alignment score of the first sequence in each block.
interval [0,1], while 1 means identical sequence to the query human protein
Number of sequences in each block (No_all)
Number of total sequences in each block
Interval [0,5000], while 5000 is the cutoff for each MSA
Number of sequences which cover the query position in each block (No_qp)
Number of sequences that cover the query position in each block
Interval [0,5000], while 5000 is the cutoff for each MSA
No_qp/ No_all (RatioNN)
The ratio of No_qp and No_all
Lowest conserved block
The lowest block for which all sequences, together with all the sequences in upper blocks, have the reference amino acid perfectly conserved.
Ordinal (primate block, Non-primate mammal block, non-mammal vertebrate block, invertebrate block, other species block)
Frequencies of reference and mutant amino acid in a block
Where n is the total number of sequences in that block and Ai represents the amino acid of the i th sequence at query position of the alignment in that block. “I” is the indication function.
p(a) stands for the frequency of amino acid a at the query position for sequences in a given block, which is calculated in formula (1). AA represents the set of 20 amino acids in human proteins.
Evolving rate of a protein
Proteins may evolve at a different rate during evolutionary courses. Some proteins, such as those involved in immune responses, may evolve much faster compared to structural proteins and other housekeeping molecules. Comparison of sequence conservation during evolutionary courses among different proteins may suggest different functional constraints. We used the difference of NAS between the query sequence (human) and the first sequence in each block (except the primate block for which the human sequence is almost always the first sequence) to represent the evolutionary distance between human and other species in different blocks. A larger distance between these sequences compared with those of other proteins might indicate that the protein has been evolving faster, and novel functions may have been developed among the more advanced species. Thus, similar conservation level for proteins that are evolving at a different rate may suggest different functional implications.
Number of sequences in each block
This variable describes the total number of sequences in a block. Since some sequences in a block may only partially align with the query protein and may not cover the query position, the number of query-position-covering sequences in each block is a different variable and is also considered here. The ratio of the two (RatioNN) may suggest functional implications of the protein sequences surrounding the query position. Positions outside functional domains tend to have lower RatioNNs than those falling in domains since sequences in functional domains tend to be better conserved.
The lowest conserved block
The lowest conserved block is defined as the block for which all its sequences, together with all sequences in upper blocks have the reference amino acid perfectly conserved. Additional file 1: Method H gives an example of the lowest conserved block.
Step 4. Using random forests as a classifier to evaluate AAS
Random Forests  are used as a classifier to distinguish neutral and damaging AAS using the features listed above. Random Forests are an ensemble learning method for classification (and regression), which constructs a multitude of decision trees and outputs the prediction as the majority vote (and average result for regression) from all individual trees. Random Forests is implemented by randomForest package (http://cran.r-project.org/web/packages/randomForest/index.html) in R (http://www.r-project.org/), and the detailed method is also included in Additional file 1: Method D.
Training and testing datasets
Two datasets were used to train EFIN. UniProt-Swiss-Prot Protein Knowledgebase (referred as Swiss-Prot dataset in this paper) is a dataset containing both known neutral polymorphisms and disease-related mutations, (humsavar.txt, http://www.uniprot.org/docs/humsavar) . It is one of the most comprehensive human protein variant databases and contains 37,331 neutral mutations/polymorphisms and 22,617 disease-related mutations (as of release in January 2013). Potential functional impact of the variants in Swiss-Prot dataset was determined based on literature reports on probable disease associations.
HumDiv , a training and testing dataset used in PolyPhen-2, is a dataset with more extreme cases for damaging mutations/polymorphisms compared with Swiss-Prot dataset (http://genetics.bwh.harvard.edu/pph2/dokuwiki/downloads). It contains 5,322 damaging variants known to be causal to human Mendelian diseases, and 7,070 amino acids that only differ between human proteins and their closely related mammalian orthologs, which are considered to be neutral. Damaging variants in the Swiss-Prot dataset include not only casual variants for Mendelian disease but also variants associated with complex diseases, thus covers a much broader range of variants that may have a functional impact. Comparing with Swiss-Prot dataset-trained EFIN, HumDiv dataset trained-EFIN is much more specific on identifying damaging mutations relevant to Mendelian disease. More detailed information about these two datasets can also be found in the Additional file 1: Method C and Additional file 1: Table S2.
EFIN outputs both a prediction (neutral or damaging for the AAS in question) and a score for each query, with the latter being the probability that a mutation is neutral. The smaller the score, the more likely that the query mutation is a damaging mutation. The prediction is based on comparison between the score and the cutoff value. An AAS is considered to be damaging if its EFIN score is smaller than the optimal cutoff value. The optimal cutoff value was determined to obtain the lowest misclassification rate (highest accuracy) for each training set, which is 0.6 for Swiss-Prot-trained EFIN, and 0.28 for HumDiv-trained EFIN in this study.
Distance between adjacent sequences within or between blocks
Evolving rate of different proteins
Relationship between the lowest conserved block and functional impact of an AAS
Comparison of EFIN with other programs
We compared EFIN with the updated versions of five widely-used programs: SIFT (the release of April 2011) , PolyPhen-2 (version 2.1.0 released in May 2011) , MutationTaster , GERP++ (released around May 2011) , and phyloP . SIFT program is based on the evolutionary conservation of amino acids within protein families. PolyPhen-2  used eight sequence-based and three structure-based features to predict the impact of a mutation/polymorphism and used naive Bayes as a statistical classifier. MutationTaster considered series of information such as sequence conservation, splicing-site changes, loss of protein features and changes that might affect the amount of mRNA. Furthermore, we also compared EFIN’s performance with GERP++ and phyloP. These two toolsets detect conserved regions and sites by phylogenetic models built according to DNA sequence alignments.
Predictions and scores for tested AAS from SIFT and PolyPhen-2 were obtained using their web-based programs. Scores from GERP++ and phyloP, were retrieved from an annotation database, dbNSFP (v2.0 released in Februrary 2013) . MutationTaster scores and predictions were also extracted from dbNSFP, which were originally queried from the website of MutationTaster. PhyloP scores in dbNSFP were extracted from placental mammal subset of pre-computed scores  in the UCSC genome browser website. The value 0 was used as a cutoff to distinguish ‘damaging mutations’ from neutral ones for both phyloP score and GERP++ RS score. Detailed method for retrieving prediction and score from these applications is described in Additional file 1: Method B.
Comparison of algorithms tested on Swiss-Prot dataset*
Comparison of EFIN with PolyPhen-2 on a subset of Swiss-Prot variants
Tools (Training set)
Testing set: Swiss-Prot dataset with HumDiv mutations excluded (36998 neutral variants, 17643 damaging variants)*
Testing set: Swiss-Prot dataset with HumVar mutations excluded (15819 neutral variants, 2284 damaging variants)*
EFIN (Swiss-Prot ∩ HumVar)**
Sequence conservation during evolutionary courses is probably the most important piece of information for evaluating potential functional impact of an AAS. It is also the most widely used information in such efforts. However, how to make full and accurate use of such information is not without difficulties. Most programs measure conservation by counting the sequences that the reference amino acid is conserved, but selecting only representative sequences as counting can be easily skewed by uneven representation of sequences from different species and contamination of databases. This affects the accurate use of conservation information since evolutionary distance is not taken into full account, and limiting the number of sequences used unavoidably causes loss of valuable information.
A common ancestor of human and other primates dates back 23 Myr, and it was 95 Myr for all the mammals and 416 Myr for non-mammal vertebrates, and much farther for invertebrates and other species . For paralogs, the evolutionary distances vary depending on when the new paralogs were originated in evolutionary history. Conservation for a protein among species of different evolutionary distances implies dramatic difference in constraints on protein function and should be taken into full account in functional prediction.
Although programs such as PolyPhen-2  uses information from distant species in evaluating amino acid conservation and MAPP  assigns different weights to homologous sequences according to a phylogenetic tree structure, they both have limitations either by not taking full account of evolutionary distance among proteins or by using only limited number of sequences. Making more accurate use of information on sequence conservation among species of different evolutionary distance probably explains most of the improvement by EFIN, which is fulfilled by a block-wise structure. By calculating the conservation features block-wise, the classifier will learn to treat the conservation (or lack of it) from different species differently. The observation that the NAS differences of adjacent sequences in the same block are significantly smaller than those between two adjacent blocks in MSA strongly supports the use of block-wise structure in conservation evaluation (Figure 2). We tested the effect of block-wise features by removing the block structure and treating all the sequences in MSA as a single block. This did result in a dramatic decrease in performance (Accuracy dropped from 0.837 to 0.787; precision decreased from 0.867 to 0.820; sensitivity changed from 0.864 to 0.838; and specificity decreased from 0.795 to 0.707).
Both SIFT and MAPP emphasized the importance of using only remote orthologs but not paralogs in their evaluation, even when this means that there were fewer sequences to be analysed . For PolyPhen (also PolyPhen-2), however, it was argued that using paralogs improves the accuracy of prediction . Here we tried to separate the paralogs from orthologs by grouping the paralogs into a separate block. Thus the pieces of conservation information from both groups can be utilized, yet they were treated differently to reflect potentially different implications on conservation from the two groups.
Overrepresentation of protein sequences from a single species can significantly affect prediction accuracy. For example, Miller et al.  described a potential problem for SIFT when a protein was overwhelmingly represented by repeated sequencing and reporting to databases, and the impact of this can be minimized by the block-wise structure adopted in EFIN. Similarly, damages caused by database contamination can also be minimized by this block-wise structure and evaluation process, as the information is extracted from multiple layers independently. For example, human somatic mutations in genes such as TP53 do appear in the ‘nr’ database from NCBI, which may give inaccurate information on sequence conservation (or lack of it). The block-wise structure used by EFIN minimizes the damage caused by repeated sequencing or contamination of databases. As a result, as many homologous sequences as possible can be used during evaluation to maximize information gain.
By searching for one of the most comprehensive protein database (UniRef100 in UniProt), and using a very inclusive threshold to maximize the number of sequences used, EFIN ensures a thorough use of information from homologs, which is quite different from most existing programs. SIFT selects sequences by adding the most similar sequence from a protein database iteratively to a growing collection until conservation in the conserved regions decreases. PolyPhen  identifies homologs of the input sequences via BLAST  search of the ‘nr’ database, and uses a clustering algorithm and only considers the homologs that belong to a compact cluster that includes the analysed sequence. MutationTaster  makes multiple sequence alignment based on amino acid sequences from only ten other species (from chimpanzee to worm). MAPP also only uses a limited number of orthologs in its prediction . The number of sequences used for each query by EFIN is significantly higher than those used by other programs, which is probably an important factor for an improved performance.Different proteins may evolve at a different rate (as exemplified by the two proteins, GNAS and IL10RA shown in Figure 3). A protein may evolve at a higher rate compared to other proteins, suggesting development of new functions in order to adapt to new environment or to meet functional requirement of the new species. On the other hand, for proteins such as GNAS, there is nearly 80% sequence conservation between the human protein and its orthologs in invertebrate, consistent with a basic house-keeping function for the molecule. Therefore, measuring evolutionary distance among different blocks for each query may be more meaningful than using split time between species during evolutionary courses in general. NAS of the first sequence in each block was used in EFIN to estimate the evolutionary rate and to improve prediction accuracy. Removal of this feature without changing any other features did have a negative effect on prediction accuracy (data not shown).
Theoretically, adding more features may further increase the prediction accuracy of the program. Some features, such as physico-chemical property of amino acid variants and structural information, were used successfully before [5, 24]. Although these features are useful when used alone, they did not significantly improve prediction accuracy after the block-wise conservation evaluation was adopted, suggestive of information overlapping. Detailed evaluation of structural changes is also difficult and may only apply to limited number of proteins with known structures at present. Although detailed consideration of protein features may improve prediction accuracy, as a general prediction tool, making use of only conservation information is a balanced approach between efficiency and accuracy of a program.
SIFT, PolyPhen-2, MAPP and other programs are usually applicable to any proteins for which homologous sequences are available, while EFIN at its present form can only apply to human proteins. Focusing on human proteins, however, has given EFIN the flexibility to make full use of the detailed species information of the homologs. Since EFIN is based on a machine learning method, it is possible or even desirable to use tailor-made training set for different purposes. For example, certain dataset may be more suited for detection of susceptibility variants, while applying to single gene disorders, training set from known causal mutation databases such as HumDiv, OMIM or HGMD  may work better. A selected subset of known mutations could also be used for mutation evaluation for a given gene or a group of genes, although this may mean a much smaller training set.
Thus, building on previous endeavours and progresses, EFIN made further improvement on prediction accuracy in this research area of ever increasing importance and may help move us one step closer towards accurate functional prediction of amino acid substitutions in a protein. By no means that EFIN is designed to replace other programs, and it is more of an addition to the set of available tools that have been already widely used in this field. Using various tools to collectively predict the functional impact of a mutation has been proposed by various software [20, 26–28]. The novel approach of block-wise evaluation proposed by EFIN may have added benefits that may help us better understand human variants in our genome.
To predict damaging AAS, we developed a novel algorithm that tries to make full use of sequencing conservation information by dividing homologous sequences into five ortholog blocks and a paralog block. We used a number of conservation features derived from these blocks and a random forest machine learning method for classification of damaging and neutral amino acid changes. As presented by its receiver operating characteristic curve, the performance of EFIN compared favourably to that of a number of popular prediction tools.
This work is supported by GRF grants HKU 784611 M, 783813 M (to WY), HKU 770411 M (to YLL), and S Kee Yee Medical Foundation general award 211203 (to BHYC, YLL, and WY)
- Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Shaw N, Lane CR, Lim EP, Kalyanaraman N, Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington J, Lipshutz R, Daley GQ, Lander ES: Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999, 22 (3): 231-238. 10.1038/10290.PubMedView ArticleGoogle Scholar
- Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res. 2001, 11 (5): 863-874. 10.1101/gr.176601.PubMed CentralPubMedView ArticleGoogle Scholar
- Ng PC, Henikoff S: SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31 (13): 3812-3814. 10.1093/nar/gkg509.PubMed CentralPubMedView ArticleGoogle Scholar
- Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A: Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005, 15 (7): 901-913. 10.1101/gr.3577405.PubMed CentralPubMedView ArticleGoogle Scholar
- Stone EA, Sidow A: Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 2005, 15 (7): 978-986. 10.1101/gr.3804205.PubMed CentralPubMedView ArticleGoogle Scholar
- Ng PC, Henikoff S: Predicting the effects of amino acid substitutions on protein function. Annu Rev Genom Hum Genet. 2006, 7: 61-80. 10.1146/annurev.genom.7.080505.115630.View ArticleGoogle Scholar
- Lee TC, Lee ASG, Li KB: Incorporating the amino acid properties to predict the significance of missense mutations. Amino Acids. 2008, 35 (3): 615-626. 10.1007/s00726-008-0087-9.PubMedView ArticleGoogle Scholar
- Kumar P, Henikoff S, Ng PC: Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009, 4 (7): 1073-1082.PubMedView ArticleGoogle Scholar
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Methods. 2010, 7 (4): 248-249. 10.1038/nmeth0410-248.PubMed CentralPubMedView ArticleGoogle Scholar
- Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S: Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol. 2010, 6 (12): e1001025-10.1371/journal.pcbi.1001025.PubMed CentralPubMedView ArticleGoogle Scholar
- Schwarz JM, Rodelsperger C, Schuelke M, Seelow D: MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods. 2010, 7 (8): 575-576. 10.1038/nmeth0810-575.PubMedView ArticleGoogle Scholar
- Gonzalez-Perez A, Lopez-Bigas N: Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet. 2011, 88 (4): 440-449. 10.1016/j.ajhg.2011.03.004.PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009, 37: D5-D15. 10.1093/nar/gkn741.PubMed CentralPubMedView ArticleGoogle Scholar
- Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics. 2007, 23 (15): 1875-1882. 10.1093/bioinformatics/btm270.PubMedView ArticleGoogle Scholar
- Shannon CE, Weaver W, Blahut RE, Hajek B: The Mathematical Theory of Communication, vol. 117. 1949, Urbana: University of Illinois pressGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat. 2004, 23 (5): 464-470. 10.1002/humu.20021.PubMedView ArticleGoogle Scholar
- Liu X, Jian X, Boerwinkle E: dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat. 2011, 32 (8): 894-899. 10.1002/humu.21517.PubMed CentralPubMedView ArticleGoogle Scholar
- Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A: Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010, 20 (1): 110-121. 10.1101/gr.097857.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Benton MJ, Donoghue PCJ: Paleontological evidence to date the tree of life. Mol Biol Evol. 2007, 24 (1): 26-53.PubMedView ArticleGoogle Scholar
- Miller MP, Kumar S: Understanding human disease mutations through the use of interspecific genetic variation. Hum Mol Genet. 2001, 10 (21): 2319-2328. 10.1093/hmg/10.21.2319.PubMedView ArticleGoogle Scholar
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30 (17): 3894-3900. 10.1093/nar/gkf493.PubMed CentralPubMedView ArticleGoogle Scholar
- Krawczak M, Cooper DN: The human gene mutation database. Trends Genet. 1997, 13 (3): 121-122. 10.1016/S0168-9525(97)01068-8.PubMedView ArticleGoogle Scholar
- Li MX, Gui HS, Kwan JS, Bao SY, Sham PC: A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases. Nucleic Acids Res. 2012, 40 (7): e53-10.1093/nar/gkr1257.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38 (16): e164-10.1093/nar/gkq603.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang L, Zhang J, Yang J, Ying D, Lau YL, Yang W: PriVar: a toolkit for prioritizing SNVs and indels from next-generation sequencing data. Bioinformatics. 2013, 29 (1): 124-125. 10.1093/bioinformatics/bts627.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.