Example showing positive selection predictors and eQTLs for a single gene. This figure shows the input to the conditional logistic regression associated with a single gene (Cholesterol Ester Transfer Protein, or CETP) from the Zeller 2010 data set. The purposes of this figure are to illustrate the prediction problem in a single gene/strata, and to showcase the relative degree of serial auto-correlation (smoothness) associated with the different predictors. At the top of the figure, each SNP is indicated with a symbol reflecting the location within the gene, only one, rs1532625, is an eQTL in the Zeller data set (indicated with a large orange hourglass symbol), and it happens to be in an Intron; the two major splice isoforms of CETP are illustrated at the bottom of the figure for reference. rs1532625 does NOT show any particular sign of being under positive selection. The conditional logistic regression used in this manuscript is fit to eQTLs such as rs1532625, with genes such as CETP treated as individual strata (equivalent to a 1-to-many case-control matching in a clinical trial.) In this data set CETP contains only a single eQTL but this is not always the case. Four predictors used in this manuscript are scaled to empirical Z-scores in order to fit on the same chart; a fifth potential predictor (composite of multiple signals, cms), very powerful in other contexts, is also shown to illustrate the issue with auto-correlation. Conditional logistic models depend on a degree of independence among the predictors - because the cms score (blue line) has such a strong serial auto-correlation (as would any positive selection measure that is smoothed in a window of any size), it is not independent of the within-gene location (symbols at the top) which are used as an independent predictor. Even Fst (the green line) shows too much serial auto-correlation to converge in the Mangravite data set, which was part of the motivation in developing H|H. The other three positive selection measures, including H|H, are highly non-smooth, so they can be fit to logistic models where individual strata contain short regions of DNA.