 Research article
 Open Access
 Published:
Assessing models for genetic prediction of complex traits: a comparison of visualization and quantitative methods
BMC Genomics volume 16, Article number: 405 (2015)
Abstract
Background
In silico models have recently been created in order to predict which genetic variants are more likely to contribute to the risk of a complex trait given their functional characteristics. However, there has been no comprehensive review as to which type of predictive accuracy measures and data visualization techniques are most useful for assessing these models.
Methods
We assessed the performance of the models for predicting risk using various methodologies, some of which include: receiver operating characteristic (ROC) curves, histograms of classification probability, and the novel use of the quantilequantile plot. These measures have variable interpretability depending on factors such as whether the dataset is balanced in terms of numbers of genetic variants classified as risk variants versus those that are not.
Results
We conclude that the area under the curve (AUC) is a suitable starting place, and for models with similar AUCs, violin plots are particularly useful for examining the distribution of the risk scores.
Background
The risk of developing a complex trait is influenced by many genetic variants, possibly hundreds, in combination with environmental factors. Genomewide association studies (GWAS) have had success in identifying some of the genetic risk factors involved in complex traits, but more remain to be discovered. Recently, there have been several in silico attempts at utilizing epigenetic and genomic data to prioritize genetic risk variants. These methods simultaneously incorporate multiple lines of genomic and epigenomic data to identify potential risk variants from all variants [16]. These data tend to have the characteristic of consisting of imbalanced classes: a very high proportion of nonrisk variants (“nonhits”) and a small proportion of risk variants (“hits”). This class imbalance, and other factors unique to genetic data (for instance linkage disequilibrium, allele frequency, etc.), warrant exercising caution when interpreting the results of predictive accuracy measures that are applied to such models.
A variety of predictive accuracy measures and data visualization techniques have been used (Table 1) to assess these models for prioritizing genetic variants. An example is the area under the curve (AUC) from the receiver operating characteristic (ROC) curve, which is generally accepted as a measure of how closely the prediction values reflect the true class. Such methods have previously been employed to predict diagnosis of an individual (risk of developing Type II Diabetes [79], for example), but have only recently been applied to predict whether genetic variants are likely to be risk variants.
We will utilize test set data from a regularized logistic model that predicts genetic risk variants on the basis of a large multivariate functional dataset [1]. We investigate the utility of several approaches for assessing predictive accuracy and data visualization. Based on observations from this work we conclude with suggested guidelines to aid researchers when assessing models for genetic variant prediction.
Three broad categories of predictive accuracy measures will be discussed: (1) concepts in describing predictive accuracy, including ROC, AUC and the confusion matrix (2) visualization of the distribution of prediction values, and (3) statistical tests. All the methods described below were conducted in R, version 3.0.2 [1013]. See Table 2. Sample R code is available in Additional file 1. Code and data to reproduce the results in this paper are provided in Additional file 2. Further details are embedded in the results.
Methods
Dataset and models
The example dataset and model have been described in detail previously [1] and are only described briefly here. Genetic variants from common genotyping arrays were annotated for 14 functional characteristics (twelve of which are binary and two are quantitative), many of which are from the Encyclopedia of DNA Elements (ENCODE) Project, with data from various cell types merged (unweighted) into a single variable for each characteristic. All functional characteristics could be presented in a binary presence/absence format with the exception of two types conservation scores, which remained on a quantitative scale. A regularized logistic model, capable of handling correlated predictor variables, was used. A random 60 % of the genetic variants were assigned to the training set to determine the parameters of the model, and the remaining variants were reserved for the independent test set to evaluate the accuracy of the model. All models produced a prediction value ranging from 0 to 1 for each genetic variant, with values close to 1 implying high probability of the variant contributing to risk. Due to the unbalanced nature of the data a weighting procedure that equalizes the importance of hits and nonhits in the training set was employed. Hits were weighted by (N_{hits} + N_{nonhits})/2N_{hits} and all nonhits by (N_{hits} + N_{nonhits})/2N_{nonhits}, where N_{hits} and N_{nonhits} denote the number of hits and nonhits, respectively, in the training set [1]. Without this weighting scheme, all variants are assigned low prediction values although the model still retains comparable overall accuracy. Overall accuracy may not be representative of accuracy within classification groups, which is the main problem with unbalanced data. As well as using the weighting scheme to ameliorate this issue in our example data we discuss other matters to be considered in relation to the accuracy and data visualization methods described.
For model 1, variants were classified as being hits if present in the genomewide association study (GWAS) Catalogue published by the National Human Genome Research Institute [14] downloaded on August 6, 2013. The GWAS Catalogue reports variants found to be associated with disease or quantitative trait in a GWAS study with a pvalue <1x10^{−6}. Variants not present in the Catalogue but present on common genotyping arrays were assumed to be nonhits. Three alternate classifiers were used to designate hits: (a) pvalue < 5x10^{−8} (model 2), and (b) pvalue < 5x10^{−8} for only a subset of phenotype specific hits namely an autoimmune (model 3) and a brainrelated analysis (model 4).
In our previous work, six models were created using the alterations to the classifier described above. The four assessed here are the two models with the highest AUC (models 2 and 3) and two models with the lowest AUC (models 1 and 4). (See Table 3 for descriptive statistics for the test sets of the various models).
Ethical approval was not required for this study.
Results
Concepts in describing predictive accuracy
The confusion matrix
Predictive accuracy is derived from a confusion matrix (Fig. 1). The cells in the diagonal of the matrix are the correctly identified genetic variants. (See Chapter 4 in “An Introduction to Statistical Learning with Applications in R” [15] and Chapter 11 in “Statistical Learning for Biomedical Data” [16] for more details.) The effects of unbalanced data in unweighted models can be detected in such a matrix. There would be a much larger proportion of negatives compared to positives. The effects on false positive rate (FPR), true negative rate (TNR), positive predictive value (PPV), and negative predictive value (NPV) are described in further detail below. The confusion matrix itself is not often studied as it represents data at only one threshold. However both the ROC curve and PPV and NPV are used to consider model accuracy.
Receiver operating characteristic curves and area under the curve
The use of ROC curves is a common way for assessing binary outcome models [17]. ROC curves offer a global summary of machine performance at all possible cutoffs of prediction values for defining the two classes. In this way, the ROC is a summary of the model’s overall performance. ROC curves reflect the columns of the confusion matrix by presenting FPR (equivalent to 1TNR)) by true positive rate (TPR), with the advantage of depicting these values at every threshold for defining a hit. An AUC = 0.5 means that the predictive accuracy of the model is not better than chance, whereas an AUC = 1 implies perfect predictive accuracy. (See Chapter 4 in “Road to Statistical Bioinformatics” [18] and Chapter 11 in “Statistical Learning for Biomedical Data” [16] for more details).
There typically is not just one confusion matrix (see previous section), but rather there is an infinite number: one for each point along the xaxis of the ROC. Thus in the context of a model that outputs prediction values measured on a continuous scale rather than binary categories (e.g. a logistic regression model among others) one needs to decide at what probability level one “declares” a hit to be a hit. One could use the arbitrary value of greater than 0.5 as the cutoff to declare hits from nonhits, but there are other probability thresholds one could use, which can be summed up in a ROC curve. That is the conceptual difference between the AUC (average over all possible thresholds) and the confusion matrix itself (considers the ROC “frozen” at one particular probability threshold).
It should be noted that unless a weighting scheme such as the one we employed in our modeling or an equal subset of both classes is chosen, ROC curves can present an overly optimistic view of performance for unbalanced data [17]. If the model simply assigns all variants to the nonhit class then it will appear to do well, for instance with an AUC much larger than 0.5. In this way, the larger class (nonhits) can overwhelm the smaller class (hits). The TPR thus tends to be low throughout the thresholds.
In the example data, the AUC of two of the models (autoimmune and all phenotype for the high confidence hits) were very similar and reasonably good (between 0.67 and 0.71) (see Fig. 2). The AUC for the other two models (the all phenotype using all Catalogue hits and the brainrelated models) were also similar to each other, but poor (less than 0.61). Thus, the AUC seems to categorize models as either good or poor, but is not particularly useful for finer discrimination between models. (See Chapter 11 in “Statistical Learning for Biomedical Data” [16] for details on the limitations of ROC curves.) Below we demonstrate that additional investigation provides further insight into the results.
Positive and negative predictive values
The rows of the confusion matrix are represented by PPV and NPV. PPV is the probability of variants that are true hits being correctly classified as hits, and NPV is the probability of variants that are true nonhits being correctly classified as nonhits at any one given threshold. (See Chapter 4 in “Road to Statistical Bioinformatics” [18] for details.) PPV and NPV are also affected by the class imbalance inherent in real genetic association data. The effect of imbalanced data on PPV and NPV has been previously described [19]. In scenarios where the negative class is larger than the positive class, NPV is inflated and PPV is lower compared to the corresponding model where the class sizes are equal and the negative and predictive classes have the same rate of correct predictions [19]. These values are best when there are equal amounts of data in each category [19]. The issue is that cell sizes of the confusion matrix can become too small for the smaller class (hits). One needs to ensure that there is a large enough quantity of hits and/or nonhits per cell in the confusion matrix to draw conclusions. Otherwise, results will be driven by a very small unrepresentative subset of the data. For the models considered here, only the two all phenotype analyses had an adequate amount of samples in each cell, and thus PPV and NPV were only calculated for those models. The NPV tended to be high (>0.899) at all the various prediction value thresholds chosen to define the two classes. See Table 4. However, it is the accuracy of predicting the hits, not the nonhits, which is of interest in this work. Hence, the PPV provides more interesting results. Overall, the all phenotype analysis using all hits in the GWAS Catalogue produced the highest PPVs as the threshold for declaring a positive hit increased. The highest PPV (30.4 %) was achieved for this model at the threshold defining hits as those variants with prediction values greater than 0.7. PPV results conflict between the AUC results. For the two all phenotype models, the one with the higher AUC (the model for the GWAS hits in the Catalogue with the stringent pvalue cutoff) had overall lower PPV compared to the model using all GWAS hits in the Catalogue. NPV results for the two models were similar, but the model based on all GWAS hits in the Catalogue had slightly lower NPV compared to the stringent pvalue model.
Visualization of the distribution of prediction values
Histograms
Next, class separation was investigated through histograms of the prediction values outputted from the models, which display differences in the density distribution between the two classes. Known hits were plotted in black and nonhits in grey on the same plot, with the yaxis being probability densities, rather than numerical quantity, which masks the data imbalance and thus allows for comparison between the two classes. The all phenotype model with high confidence hits (Fig. 3) and the autoimmune model showed the most evidence of having two separate distributions. Although the distributions of the prediction values for the hits and the nonhits overlap, the distribution of the nonhits has the majority of its values closer to the 0 end of the prediction value range. Confirming the AUC results, the brainrelated model and all phenotype model using all Catalogue hits (Fig. 3) do poorly with regard to class separation. As always, caution is warranted since the visualization of the distributions differ depending on the bin size chosen (compare Fig. 3 to Fig. 4). For the histograms with a larger bin size differences in distributions between hits and nonhits at a finer scale is less apparent, and the distributions look more similar compared to if a smaller bin size is used.
Box and whisker plots
Box plots were constructed to visually compare the distributions of the hits versus the nonhits in an alternate way (Fig. 5). These plots visually depict much of the descriptive data present in Table 3, notably differences in the median between the two classes. Again the data imbalance is masked as the summaries presented in the plot are from within each class. As visualized in the histograms, the box plots also showed that for all of the models the distributions of the prediction values for the hits and nonhits overlapped, but to different degrees. The plots for the brainrelated model and the all phenotype model for all variants in the GWAS Catalogue had many outliers for both classes, signifying that for both hits and nonhits had predictions that were a large distance from the predictions of other variants in the respective class. Additionally, the mean prediction scores for the hits and the nonhits appear very close for the all phenotype model for all variants in the GWAS Catalogue.
Violin plots
Violin plots visually combine the density differences depicted in the histograms and the median differences depicted in the box plots into one plot. These plots summarize the results of the histograms and box plots. Furthermore, they are comparable to a histogram with infinitely small bin sizes. See Fig. 6.
Quantilequantile plots
A final visualization method, the quantilequantile plot was explored. See Fig. 7. The quantilequantile plot is often used in the context of GWAS, but it also has the potential to be useful as a predictive accuracy measures. Instead of expected and observed pvalues on the axes as what is done in GWAS, we plotted prediction values for nonhits on the xaxis and prediction values for the hits on the yaxis. Plotted in this way, the plot compares the quantiles of the hits to the nonhits. When the data points on the plot deviate above the diagonal, the hits have higher prediction values compared to nonhits in that quantile. Due to a limited number of hits, the quantilequantile plots for the phenotypespecific analyses produced a staircase pattern. This pattern suggests two characteristics: those models are assigning the same prediction value to several variants, and also there are not enough hits to create a smooth curve. The former could be due to there being different variants that have been assigned identical or similar functional characteristics. The models are binning variants together and are not able to differentiate them on a finer scale. The small sample size for the phenotype specific analyses, makes it difficult to draw conclusions from those quantilequantile plots. For the two all phenotype analyses, the quantilequantile plots supported the findings from the other visualization methods that the high confidence all phenotype analysis separated hits from nonhits better than the analysis based on hits from the GWAS Catalogue. For the all phenotype model based on the high confidence hits, the distribution consistently deviated from the diagonal. The distribution demonstrates that the hits had higher prediction values than nonhits in the same quantiles. The all phenotype analysis based on all hits in the GWAS Catalogue produced a quantilequantile plot that closely followed the line for prediction values less than 0.6. This group of prediction values contained most of the data since from the histograms it was determined that the distribution of the prediction values is skewed so that most of the data fall in the lower percentiles. The distribution deviated from the diagonal roughly in the prediction value range of 0.6 and 0.7.
Statistical tests
Hypergeometric test
The hypergeometric test was also used to identify significant enrichment of hits compared to nonhits in particular prediction value bins by splitting the data into bin sizes of 0.05 ranging from less than 0.35 up to 0.95. For each model, there were effectively 13 tests performed, one test per prediction value bin. Based on this resulting contingency table, significant enrichment of hits was seen for all of the models in at least one bin greater than 0.55 (with significant pvalues ranging from 0.01 to 5.58×10^{−29}), while no enrichment (all pvalues greater than 0.2) was seen in bins less than 0.55.
CochranMantelHaenszel test
Another test was investigated, the asymptotic generalized CochranMantelHaenszel test, which tests the independence of two possibly ordered factors (prediction values of hits vs. nonhits). As with the hypergeometric, a contingency table for hits and nonhits stratified by prediction value was created. Hits and nonhits were stratified independently by prediction values by splitting the data into bin sizes of 0.05 ranging from less than 0.35 up to 0.95. Rather than a single test per prediction value bin as in the hypergeometric, the generalized CochranMantelHaenszel test is a single omnibus test per model. It looks for a trend across the span of prediction values. Similar to the other statistical tests explored in this section, significant pvalues were produced for all models (p < 5.3x10^{−9}).
Mann–Whitney U test
A twosided Mann–Whitney U test can be used to determine whether or not the distributions of the prediction values for the hits differs significantly from that of the nonhits. The Mann–Whitney U tests whether the ranks of the variants in the hit and nonhit sets differ. Significant pvalues were obtained for all analyses, including those with poor AUCs and poor class separation; most notably the all phenotype analysis not refined to the high confidence hits had a Mann–Whitney pvalue of 7.17x10^{−50}. It was hypothesized that this significant pvalue was due to the class imbalance and/or outliers. To explore these hypotheses, only a random subset of nonhits equal in size to the number of hits were selected for the Mann–Whitney U test, and in other test only outliers were removed. In both situations, the pvalues tended to remain highly significant (Table 5).
The significant Mann–Whitney U pvalues do not necessarily suggest that the hits and nonhits are well separated by their prediction values. Instead, the pvalues are highlighting differences in ranks between the hits and the nonhits, which may or may not imply class separation. We plotted the hits and nonhits according to their ranks. In all of the plots, the nonhits follow a uniform distribution, whereas the hits follow a different distribution, roughly negatively skewed (Fig. 8). Thus, as with enrichment according to the hypergeometric, and the CochranMantelHaenszel test for independence, differences in rank according to the Mann–Whitney U are not particularly informative with regard to class separation between the hits and nonhits according to their prediction values.
The statistical tests mentioned above do not explicitly measure class separation between hits and nonhits based on their prediction values, which is a key outcome for investigating the predictive accuracy of models for variant prioritization. The hypergeometric assesses enrichment of hits, the Mann–Whitney U tests for differences in ranks between the hits and nonhits, and the generalized CochranMantelHaenszel test evaluates independence of the hits and nonhits. Thus, significant pvalues from these statistical tests cannot alone be taken as proof of class separation or model performance.
Discussion
In this review we summarized various predictive accuracy measures related to the confusion matrix, visualization methods, and some statistical tests. These methods were described in the context of genetic models for prediction of risk variants in complex traits in which a class imbalance between the hits and nonhits is often inherent.
The choice of predictive accuracy measures was partially motivated by the measures found in the publications described in the background as well as other measures. Note that two of the mentioned papers, [3,5], both focused on investigating enrichment or depletion of disease or traitassociated variants with particular functional and genomic features. Since the predictive accuracy measures in those papers did not relate to an output of a prediction value for each variant, those methods were not discussed further.
In summary, the investigation above emphasizes the importance of visualizing the underlying distributions of the classes. The ROC curve is a good starting place, but visualization measures, especially violin plots, are valuable for differentiating models with similar AUCs. A downside of histograms is that depending on the bin size, the interpretation of the results may vary. With regard to box plots, these plots do not offer any information about density. On the other hand, violin plots are able to show density without the need of binning and at the same time depict the summary statistics that would be seen from a box plot (for instance, [20]). Caution is needed when making conclusions about model performance based on pvalues, such as from the Mann–Whitney U test. Significant pvalues cannot necessarily be attributed to a good separation between hits and nonhits. Visualizing the class distribution seems to be the most informative for determining the predictive accuracy in these scenarios.
Conclusions
All of the papers mentioned in the introduction apply their model(s) to real data to assess the accuracy of identifying diseaserelevant genetic variants. Predictive accuracy measures and visualization of the prediction values can only show model performance in theory. When evaluating model performance it is also vital to assess the model in real applications.
Abbreviations
 AUC:

Area under the curve
 FNR:

False negative rate
 FPR:

False positive rate
 GWAS:

Genomewide association study
 HGMD:

Human gene mutation database
 NPV:

Negative predictive value
 PPV:

Positive predictive value
 ROC:

Receiver operating characteristic curve
 TNR:

True negative rate
 TPR:

True positive rate
References
 1.
Gagliano SA, Barnes MR, Weale ME, Knight J. A Bayesian method to incorporate hundreds of functional characteristics with association evidence to improve variant prioritization. PLoS ONE. 2014;9:e98122.
 2.
Iversen ES, Lipton G, Clyde MA, Monteiro AN. Functional annotation signatures of disease susceptibility loci improve SNP association analysis. BMC Genomics. 2014;15:398.
 3.
Kindt AS, Navarro P, Semple CA, Haley CS. The genomic signature of traitassociated variants. BMC Genomics. 2013;14:108.
 4.
Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–5.
 5.
Pickrell JK. Joint Analysis of Functional Genomic Data and Genomewide Association Studies of 18 Human Traits. Am J Hum Genet. 2014;94:559–73.
 6.
Ritchie GRS, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods. 2014;11:294–6.
 7.
Xu M, Bi Y, Xu Y, Yu B, Huang Y, Gu L, et al. Combined effects of 19 common variations on type 2 diabetes in Chinese: results from two communitybased studies. PLoS One. 2010;5:e14022.
 8.
Lango H, Palmer CNA, Morris AD, Zeggini E, Hattersley AT, McCarthy MI, et al. Assessing the Combined Impact of 18 Common Genetic Variants of Modest Effect Sizes on Type 2 Diabetes Risk. Diabetes. 2008;57:3129–35.
 9.
Janipalli CS, Kumar MVK, Vinay DG, Sandeep MN, Bhaskar S, Kulkarni SR, et al. Analysis of 32 common susceptibility genetic variants and their combined effect in predicting risk of Type 2 diabetes and related traits in Indians. Diabet Med. 2012;29:121–7.
 10.
R Core Development Team. A language and environment for statistical computing. In: R Found Stat Comput. Vienna, Austria: R Foundation for Statistical Computing; 2008. ISBN 3900051070.
 11.
Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–1.
 12.
Lemon J. Plotrix: a package in the red light district of R. RNews. 2006;6:8–12.
 13.
Hothorn T, Hornik K, van de Wiel MA, Zeileis A. A Lego System for Conditional Inference. Am Stat. 2006;60:257–63.
 14.
Hindorff LAJH, Hall PM, Mehta JP, Manolio TA. A catalog of published genomewide association studies. 2010. Available at: www.genome.gov/gwastudies.
 15.
James G, Witten DM, Hastie T, Tibshirani R. An Introduction to Statistical Learning with Applications in R. New York, NY: Springer; 2013.
 16.
Malley JD, Malley KG, Pajevic S. Statistical Learning for Biomedical Data. Cambridge: Cambridge University Press; 2011.
 17.
Davis J, Goadrich M. The Relationship Between PrecisionRecall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. New York, NY, USA: ACM; 2006. p. 233–40 [ICML’06].
 18.
Lee JK. Road to Statistical Bioinformatics. In Statistical Bioinformatics. Edited by Lee JK. Hoboken, N.J: John Wiley & Sons, Inc.; 2010:1–6.
 19.
Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics. 2012;13 Suppl 4:S2.
 20.
Gagliano SA, Ravji R, Barnes MR, Weale ME, Knight J. Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants, bioRxiv. 2014. p. 011445.
Acknowledgements
SAG is funded by the Peterborough K.M. Hunter Graduate Studentship, and is also funded through the Armstrong Family via the CAMH Foundation. SAG is a fellow of CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) – CIHR Training Grant in Genetic Epidemiology and Statistical Genetics. This research was supported by funding from Training grant GET101831. JK is funded by charitable donors at the CAMH Foundation: http://www.supportcamh.ca. JK holds the Joanne Murphy Professor in Behavioural Science. ADP held a Canada Research Chair in the Genetics of Complex Diseases.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SAG conceived of the study, performed the statistical analyses, participated in the interpretation of the data, and drafted the manuscript. ADP participated in the design of the study and the interpretation of the data, and helped to draft the manuscript. MEW participated in the design of the study and the interpretation of the data. JK participated in the design of the study and the interpretation of the data, and helped to draft the manuscript. All authors read and approved the final manuscript.
Additional files
Additional file 1:
R: Sample R code to perform the tests mentioned in this paper. MyData.txt: Sample output data from a model on which to run the code.
Additional file 2:
Codeforpaper. R: R code to reproduce the results in this paper. Autoimmunetestset.csv, Braintestset.csv, Nonpheno5e8testset.csv, NonphenoallCattestset.csv: data files required for Codeforpaper. R; they contain five columns: the identifier for the genetic variant, base position, chromosome number, the classifier (hit = 1, nonhit = 0), and the prediction value.
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Gagliano, S.A., Paterson, A.D., Weale, M.E. et al. Assessing models for genetic prediction of complex traits: a comparison of visualization and quantitative methods. BMC Genomics 16, 405 (2015). https://doi.org/10.1186/s128640151616z
Received:
Accepted:
Published:
Keywords
 Predictive accuracy
 Genetic prediction
 Receiver operating characteristic curve