How accurate can genetic predictions be?
© Dreyfuss et al.; licensee BioMed Central Ltd. 2012
Received: 4 November 2011
Accepted: 1 July 2012
Published: 24 July 2012
Skip to main content
© Dreyfuss et al.; licensee BioMed Central Ltd. 2012
Received: 4 November 2011
Accepted: 1 July 2012
Published: 24 July 2012
Pre-symptomatic prediction of disease and drug response based on genetic testing is a critical component of personalized medicine. Previous work has demonstrated that the predictive capacity of genetic testing is constrained by the heritability and prevalence of the tested trait, although these constraints have only been approximated under the assumption of a normally distributed genetic risk distribution.
Here, we mathematically derive the absolute limits that these factors impose on test accuracy in the absence of any distributional assumptions on risk. We present these limits in terms of the best-case receiver-operating characteristic (ROC) curve, consisting of the best-case test sensitivities and specificities, and the AUC (area under the curve) measure of accuracy. We apply our method to genetic prediction of type 2 diabetes and breast cancer, and we additionally show the best possible accuracy that can be obtained from integrated predictors, which can incorporate non-genetic features.
Knowledge of such limits is valuable in understanding the implications of genetic testing even before additional associations are identified.
Accurate pre-symptomatic prediction of disease and drug response is a vital component of personalized medicine, which could allow for improved clinical decision-making and targeted prevention strategies, easing both the burden and costs of disease . Already, several companies offer consumers personalized risk assessments, lifestyle recommendations, and 'nutraceuticals' based on their genetic profiles . Unfortunately, most genetic factors associated with common traits explain only a small portion of the phenotypic variance (the “missing heritability” problem ), making genetic prediction currently difficult . Investment into studies that assay rare variants  and the use of informative polymorphisms that do not individually pass stringent statistical tests of association  can improve the accuracy of predictions, but the extent to which predictions can be improved is unclear. Thus, identifying the bounds on the accuracy of predictive genetic testing based on readily-known disease parameters (such as prevalence and heritability) can be an invaluable planning tool.
Although the accuracy of a medical test can be measured in many ways, the concepts of sensitivity and specificity are paramount . Frequently, the test result is continuous (e.g. the individual’s predicted risk), while the clinical decision and true outcome are binary (e.g. either the person will get sick or not), so that different thresholds of the test result yield different pairs of sensitivity and specificity. The receiver operator characteristic (ROC) curve depicts this tradeoff between sensitivity and specificity across all possible thresholds, and the area under this curve (AUC) is the most widely used metric to summarize the accuracy of a test. An AUC of 1 indicates perfect prediction while an AUC of 0.5 represents random guessing.
Evidence that a bound on maximum predictive accuracy exists can be found in heritability. The heritability of a trait (in the broad-sense) is the proportion of phenotypic variation in the population that can be attributed to genetic variation; that is, it reflects the contribution of genetic factors relative to environmental ones. Narrow-sense heritability measures the corresponding quantity for additive genetic variance only, which excludes genetic effects such as dominance and epistasis. The heritability of binary phenotypes can be computed directly on the observed binary scale. However, it may also be calculated on a liability scale, where it is assumed that an individual has the binary trait if their risk exceeds a threshold. Both types of heritability can be estimated using family-based studies, such as twin studies , and the two scales can be mapped to each other .
The impact of heritability on genetic test accuracy can be seen by examining its two extremes: a trait that has 100% heritability, such as a Mendelian trait, can be predicted with certainty from the genotype; in contrast, a trait with 0% heritability is not influenced by genetic factors, and thus genetic tests cannot produce any useful information. Previous ground-breaking works have investigated the bounds prevalence and heritability impose on predictive accuracy using simulations , analytical results utilizing genotype relative risks and their frequencies , and analytical approximations under the assumption of a normally distributed liability [12, 13]. Here, we mathematically elucidate the absolute bounds on the specificities, sensitivities, and AUC for genetic testing given any values of heritability and prevalence of the tested trait, without making any assumptions about the risk distribution.
Common complex traits are typically the combined effect of genetic and environmental factors. Since no practical predictor can account for all factors and their interactions, clinical prediction can at best assign probabilistic risks rather than deterministic outcomes. Viewed on the population level, these risk assignments can be seen as comprising a risk distribution, which is an estimate of the population’s true risk distribution. Maximal predictive accuracy occurs when the estimated risk matches the true risk.
where i = 1,…,n indexes people, n is the sample size, risk i is individual i’s genetic risk (i.e. the conditional probability of the trait given genes), and is the average genetic risk, which equals the average population risk (see Methods). The meaning of risk depends on the context: for instance, when the phenotype is current disease status, the average risk in the population is its prevalence, whereas in prediction of lifetime illness, risk is the lifetime risk of disease. (When possible, we nonetheless opt for the term prevalence.) Equation 1 mathematically expresses that heritability is the proportion of phenotypic variance explained by the genetic risk distribution.
Breast cancer has the same maximal AUC as T2D, albeit with a distinct ROC curve from T2D. Breast cancer was found to have a prevalence of 4% , and we calculated its heritability on the binary scale to be 11% (see Methods), which yields a maximum AUC of 89%. Although this is the same maximum AUC as for T2D, the sensitivity/specificity pairs for breast cancer (Figure 3) are not identical to those for T2D, owing to the different disease parameters. For example, to reach a specificity of 99%, sensitivity cannot exceed 24%, which is substantially lower than the corresponding maximal sensitivity of T2D when specificity is 99%. The divergence of these two ROC curves as specificity approaches 100% illustrates the importance of identifying the maximal ROC curve, rather than relying on the maximal AUC alone.
Heritability is the proportion of phenotypic variance explained by all genetic factors, but our analytic approach can treat the proportion of phenotypic variance explained by any particular set of factors. If the proportion of phenotypic variance explained by a particular set of genes is known, that proportion of variance explained could be substituted for heritability in our model. For instance, if a subset of genes could explain 50% of the genetic variance of T2D (i.e. explain 13% of phenotypic variance), then the maximum achievable AUC of this subset would be 80%.
Our method can also be applied in elucidating the maximum accuracy of predictors that integrate features such as gene expression, de novo mutation, body mass index, and lifestyle (which are not fully inherited). The proportion of variance explained by such an integrated predictor can then be greater than heritability. When there are no gene-environment interactions, this difference is the proportion of phenotypic variation that these features explain independently of genes. For example, weekly physical activity can explain 4% of phenotypic variance of T2D (see Methods), is moderately heritable , and was found to not interact with well-known gene variants in T2D . Accordingly, the proportion of variance explained by the integrated predictor comprised of genomic profile and physical activity does not increment by the full 4% beyond the heritability of T2D. If the proportion of T2D variance that physical activity explains independently of genes was known to be only 3%, say, then the integrated predictor’s maximum AUC would be calculated based on a proportion of variance explained of 29% (sum of 26% and 3%), which yields a maximum AUC of 90%. If, however, we did not have an estimate for the proportion of T2D variance that physical activity explains independently of genes, then we could conservatively use 4% in the previous calculation, yielding a similar AUC. This analysis applies to predictors based on non-genetic features that are supplemented by genetics. In general, the estimation of the proportion of variance explained by integrated predictors is complicated by the interaction of genetic and non-genetic features; our method can nonetheless be applied when the interaction can be estimated or bounded. Note that genetic testing alone can still accurately predict outcome for some small, extreme risk groups (such as those with highly penetrant variants), but such a test will not benefit the general population without both a high sensitivity and specificity .
Our results are general and apply to any binary trait, and they rely on only two commonly estimated parameters. Although the quality of the results is only as good as the estimates of prevalence and heritability for the population in question, our method allows for ranges of prevalences and heritabilities to be considered, which can provide important insight into predictive accuracies. Nonetheless, care must be taken when applying these statistics, as different estimates apply in different situations. For example, in assessing limits to the prediction of lifelong risk, lifelong risk estimates should be used in place of prevalence estimates. In particular, the ballooning lifelong risk of T2D in the USA  implies genetic prediction of lifetime T2D will become more difficult.
Using this approach, one may evaluate a proposed GWAS based on parameters such as sample size and the number of loci sampled.
Heritability estimates for any binary trait can be used by our method. Broad-sense heritability estimates are needed to cap predictive accuracy, since genetic predictors can exploit dominance and epistatic interactions not measured by narrow-sense heritability estimates. However, if a genetic predictor is constructed as an additive model in line with the assumptions of narrow-sense heritability, then its maximum accuracy can be calculated using narrow-sense heritability; thus, these estimates can also be used, albeit with a slightly different interpretation. Heritability estimates on the normal liability scale can be used after they are transformed to the observed binary scale, e.g. using the method proposed by Dempster and Lerner [8, 9]. Heritability on the binary scale can be sensitive to prevalence , but its use avoids the assumption of normally-distributed liability, which requires that the trait be affected by many genes, all with small effect (normally-distributed liability effectively requires a purely unimodal genetic risk distribution). In fact, when variants with particularly large effects do exist—such as APOE in Alzheimer’s disease , BRCA1 and BRCA2 in breast and ovarian cancer , and LRRK2 in Parkinson’s disease —previous authors have suggested simulations in lieu of their analytical approximation . Moreover, because liability cannot be measured, the distributional assumptions on liability are virtually untestable .
Our maximal ROC curves (Figure 3) can be substantially higher than those given by the beta distribution, which is an accurate proxy for multiple previous reports [10, 12, 13], indicating that the maximal accuracies of genetic prediction may be substantially higher than previously thought. This difference highlights the importance that the risk distribution can have in the power of genetic prediction. Furthermore, as we are only now beginning to uncover the risk distributions of common complex diseases, it seems important to understand the absolute, distribution-independent limits on genetic-test accuracy, which we present here.
We have given exact limits on genetic prediction for any binary trait imposed by the epidemiological parameters of prevalence and heritability. Knowledge of these limits can help delineate the maximal benefits associated with genetic testing, which can allow for cost-benefit analyses, regulation, and clinical guidelines regarding genetic testing even before additional associations are identified. We have also illustrated how these limits can help us prioritize the allocation of research resources, by showing how they can assist in the prioritization and design of future association studies. The calculations presented in this paper could further be used to mitigate the possibility of investing in the development of a genetic test which could never be accurate enough to be of clinical relevance.
To optimize over the set of risk distributions subject to the disease parameters of average risk and proportion of variance explained (PVE), we modeled a categorical distribution (which resembles a histogram) with b + 1 bins located at 0, 1/b, 2/b, … , 1 representing risk groups, so i/b represents the conditional probability of disease given a set of factors for individuals in risk group i (e.g. people in the 1/b group have risk 1/b). An example of such a distribution is depicted in Figure 1. The probability that someone falls into bin i is p i , where the p i ‘s (for i = 0,…,b) sum to one. We restrict the average risk (e.g. prevalence) and PVE (e.g. heritability) using two observations. (1) By the law of total probability, the unconditional probability of disease is simply the mean of the conditional risk distribution, i.e. it is equal to the average risk. (2) The PVE relates to the risk distribution through Equation 1. (Equation 1 can be understood as the R2 from the regression: binary phenotype = risk + error, where risk is a probability.)
Here, n j individuals have risk j/b, i.e. they are assigned to risk group (histogram bin) j, and p j = n j /n is the probability that a random individual is assigned to risk group j.
With this model of the risk distribution and constraints, we can identify the best-case AUC and optimal sensitivity/specificity pairs using the procedures detailed below. Because these procedures associate a single genetic risk distribution with the best-case AUC and a potentially different risk distribution with each optimal sensitivity/specificity pair, it is possible that only some of these sensitivity/specificity pairs may be realizable for a single trait in practice. Consequently, these sensitivity/specificity pairs cannot be used directly to derive the maximal AUC.
The numerator of this expression can be conveniently represented as p T Qp + b 2 k, where Q is a symmetric matrix whose entry at row i and column j is -j(b + i)/2 for i ≥ j.
where the sum of the p i ‘s (for i = 1,…,b) must not exceed 1, and each p i is bounded between 0 and 1.
The parameters k PVE, and b are predefined constants. Note that for b = 100, as well as for all the values of b we examined, Q is negative definite, so that this is a convex program. Hence, there are efficient solution methods to identify the global maximum. Using the quadprog package  in the R software , we solved this program for values of k and PVE with b = 100. When b = 1000, all maximal AUCs shown in Figure 2 change by less than 0.01%. In fact, using b = 10 does not change any of these maximal AUCs more than 1% from that calculated with b = 1000. Note also that given an estimated risk distribution vector p, a researcher can directly calculate the AUC from the objective function. To calculate the AUC of the beta distribution for given levels of k and PVE, we discretized the beta distribution with parameters a = k(1/PVE-1) and b = (1-k)(1/PVE-1), which uniquely satisfy the constraints.
We optimized sensitivity for any given value of specificity, average risk, proportion of variance explained, and threshold using a linear programming model. This was implemented in the lpSolve package in R  using 1000 bins. We then optimized the sensitivities over the thresholds to obtain the maximal sensitivity for every specificity, average risk, and proportion of variance explained.
To calculate the proportion of T2D variance explained by physical activity we used Equation 1, where the risk distribution was defined by the prevalence and the relative risks of exercise . To calculate the heritability of breast cancer on the binary scale we used twice the difference in correlation between monozygotic and dizygotic twin pairs, where correlations were computed on binary outcomes from 44,788 pairs of Nordic twins .
area under ROC curve
type 2 diabetes
genomewide association study
proportion of variance explained.
We dedicate this to Marco Ramoni, who tragically passed away in June 2010.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.