- Open Access
Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs)
BMC Genomics volume 16, Article number: S8 (2015)
Over the past 50,000 years, shifts in human-environmental or human-human interactions shaped genetic differences within and among human populations, including variants under positive selection. Shaped by environmental factors, such variants influence the genetics of modern health, disease, and treatment outcome. Because evolutionary processes tend to act on gene regulation, we test whether regulatory variants are under positive selection. We introduce a new approach to enhance detection of genetic markers undergoing positive selection, using conditional entropy to capture recent local selection signals. Results We use conditional logistic regression to compare our Adjusted Haplotype Conditional Entropy (H|H) measure of positive selection to existing positive selection measures. H|H and existing measures were applied to published regulatory variants acting in cis (cis-eQTLs), with conditional logistic regression testing whether regulatory variants undergo stronger positive selection than the surrounding gene.
These cis-eQTLs were drawn from six independent studies of genotype and RNA expression. The conditional logistic regression shows that, overall, H|H is substantially more powerful than existing positive-selection methods in identifying cis-eQTLs against other Single Nucleotide Polymorphisms (SNPs) in the same genes. When broken down by Gene Ontology, H|H predictions are particularly strong in some biological process categories, where regulatory variants are under strong positive selection compared to the bulk of the gene, distinct from those GO categories under overall positive selection. . However, cis-eQTLs in a second group of genes lack positive selection signatures detectable by H|H, consistent with ancient short haplotypes compared to the surrounding gene (for example, in innate immunity GO:0042742); under such other modes of selection, H|H would not be expected to be a strong predictor.. These conditional logistic regression models are adjusted for Minor allele frequency(MAF); otherwise, ascertainment bias is a huge factor in all eQTL data sets. Relationships between Gene Ontology categories, positive selection and eQTL specificity were replicated with H|H in a single larger data set. Our measure, Adjusted Haplotype Conditional Entropy (H|H), was essential in generating all of the results above because it: 1) is a stronger overall predictor for eQTLs than comparable existing approaches, and 2) shows low sequential auto-correlation, overcoming problems with convergence of these conditional regression statistical models.
Our new method, H|H, provides a consistently more robust signal associated with cis-eQTLs compared to existing methods. We interpret this to indicate that some cis-eQTLs are under positive selection compared to their surrounding genes. Conditional entropy indicative of a selective sweep is an especially strong predictor of eQTLs for genes in several biological processes of medical interest. Where conditional entropy is a weak or negative predictor of eQTLs, such as innate immune genes, this would be consistent with balancing selection acting on such eQTLs over long time periods. Different measures of selection may be needed for variant prioritization under other modes of evolutionary selection.
In the post-genome era, major research initiatives are directed towards identifying non protein-coding genetic variants with an effect on the expression of nearby genes: cis-expression quantitative trait loci (cis-eQTLs) [1, 2]. In parallel, comparative analysis  of humans and other primates, as well as human populations, has revealed categories of genes evolving under different selective pressures [4, 5]. For example, the phylogenetic p-values program phyloP  has been used to show that disease markers in genes involved in central nervous system (CNS) developmental defects  have undergone positive selection in the human lineage. Humans have more fixed genetic differences in CNS development genes, compared to our close primate relatives; therefore, phyloP identifies mutations in these CNS development genes. The interpretation is that these mutations conferred differential fitness in the primate ancestors of modern humans, causing many mutations in such genes to rapidly accumulate and fix. Conversely, other genes and especially regulatory regions show human-lineage specific purifying selection , leading to a loss of ancestral diversity. Therefore, it might be expected that different selection measures would be more or less useful in different genes or regulatory features.
In this report, we explore positive selection signatures not in disease genes/markers arising from our pre-human ancestors, but in the cis-eQTLs within/adjacent to genes. Positive selection in this report is confined to a relatively recent timeframe, and is controlled for the human population positive selection level on the entire gene and sub-gene location (exon, intron, intergenic region, flanking within 1kB of promoter or terminator, or untranscribed region/UTR). The positive selection signals used here are measured using allele frequencies and linkage-related properties within and between human populations via the results of the 1,000 Genomes Project . Detecting a controlled association between positive selection and cis-eQTLs is technically challenging from a statistics standpoint, because it introduces considerable dependencies (co-linearity) between the prediction (positive selection, allele frequency, sub-gene region) and stratification (gene) variables. The effort is justified because eQTLs, both cis  and trans  have shown considerable relevance to human disease. Methods to prioritize such variants have considerable practical consequences, since most disease heritability is driven by regulatory markers outside of coding regions [12, 13].
In order to fit statistical models of cis-eQTLs within genes, we introduce a new measure of positive selection, termed Adjusted Conditional Haplotype Entropy (H|H). Conditional entropy is commonly used in information/linguistics approaches , but also sees use in diverse bioinformatics applications [15, 16]. Among existing measures, this conditional entropy is most closely related to the integrated haplotype score (iHS) of Voight et al. , however, in the results below we will show that H|H provides two crucial advantages. First, H|H is a significantly stronger predictor of cis-eQTLs than other measures, and second, H|H values can vary widely even between nearby markers. H|H is a measure of haplotypes but it depends on an individual marker: for example, a minor variant leading to a selective sweep of the pre-existing major haplotype would result in a very low conditional entropy and thus a very high H|H score, but it would be in linkage to other major haplotype markers that would have very low H|H scores. This divergence between nearby markers and even markers in moderate linkage allows for models incorporating H|H to converge when models utilizing other positive selection measures do not.
Results and discussion
Overall research design
The chief goal of this manuscript is to identify measures of positive selection which are able to pick out cis-eQTLs, against a background of non-eQTL markers within an individual gene. However, such measures of positive selection are not uniform throughout the genome. In particular, positive selection signatures are concentrated in certain gene ontology (GO)  categories. Therefore, different measures of positive selection may be appropriate to pick out eQTLs in different GO categories, and this appropriateness could easily be counter-intuitive - genes under positive selection could have a high background level of positive selection, meaning positive selection measures might not be useful in picking out eQTLs within genes which are highly selected overall. Therefore, the secondary goal of this manuscript is to identify differences in which positive selection measures may be most appropriate to identify eQTLs in different GO categories. To test for such difference, we add interaction terms our models; this is not a GO enrichment test. GO enrichment tests implemented in panther  are also reported in the supplement (in additional file 1), but these may not have the desired interpretation. This manuscript also reports a new measure of positive selection which is designed to support this application, chiefly by being as non-smooth as possible (and thus able to distinguish eQTLs from non-eQTLs which are close by on the chromosome.) Figure 1 illustrates the underlying problem: the goal is to find a statistical model which uses positive selection measures to distinguish eQTLs (the one orange hourglass, in Figure 1) from all of the non-eQTL markers in the same gene (all other symbols). The conditional logistic regression adjusts for the relevative prevalance of eQTLs in different genes (for example, the gene shown in Figure 1 only has one eQTL in the Zeller 2010 data set.)Our statistical approach has two phases. In the first (hypothesis-generation) phase, seven modestly-powered QTL data sets from either Lymphoblastic Cell Lines (LCLs), Monocytes or human liver [20–25] are used to pick prediction variables, and, categories of genes in which positive selection scores differentiate cis-eQTLs from the surrounding gene. In the statistical tests of this hypothesis-generation phase, other measures of positive selection where either not strong predictors of eQTLs; or, attempts to utilize measures other than H|H failed because the statistical models did not converge. However, because we cannot guarantee that the statistical model generated by this first stage is not subject to over-fitting, or alternatively to cherry-picking, we test that statistical model in a second phase. In this second (hypothesis-testing) phase, we test specific biological findings in the much larger cis-eQTL data set which powers the blood eQTL browser described by Westra et al. .
In both phases, conditional logistic regression is used to fit statistical models to the different cis-eQTL data sets. GO categories and other measures individually identified in the hypothesis-generation phase are tested as a group in the hypothesis-testing phase.
Hypothesis-testing phase in the Chicago QTL Browser data
For purposes of hypothesis generation, seven diverse eQTL data sets (drawn from six independent high-throughput studies) were chosen from those deposited in the Chicago QTL genome browser  (http://hsb.upf.edu/), and reduced to their overlap with the set of Genotype-Tissue Expression (GTEx)  candidate QTLs. These candidate QTLs are a superset including all GTEx QTLs as well as other non-QTL markers for which GTEx performed statistical tests; these are chosen by GTEx as a representative panel to test for effects, and not because of prior evidence that they are indeed eQTLs. Genes (or intergenic regions, defined here as more than 1kB from any promoter or transcriptional terminator) containing no eQTLs were removed independently in each data set. The focus is on hypothesis generation, parameter selection and model specification, rather than on producing a rigorous finding. These seven eQTL data sets were chosen because they contained at least 10,000 eQTLs after overlap with the GTEx candidates. The seven data sets chosen are listed in Table 1 along with summary statistics of those data sets after they were reduced to the overlap with the GTEx set of candidate QTLs.
When statistical models were fit to these individual data sets, the results were quite heterogeneous (see below in this section). Therefore, GO biological-process categories were used to test whether the inter-data set heterogeneity might be explained by differences in the underlying biology of the genes for which eQTLs were detected by the different approaches. Alternative, non-exclusive explanations for heterogeneity among these data sets include: source tissue differences, participant population differences, and sample size/ascertainment bias differences, any or all of which might reasonably affect the extent to which eQTLs appear more positively selected than nearby markers. In order to produce a preliminary hypothesis on this question, each gene was assigned to a single biological-process GO category, chosen among the available GO assignments (see methods), with genes assigned to multiple GO categories assigned to the single GO category containing closest-to-125 ENCODE  genes. This is an ad hoc approach that produces single assignments to biologically relevant GO categories; it was driven by inspection of the GO categories of most interest to our research group. Size-filtering of some kind is standard in any study of GO categories. Note that this is not a GO enrichment test, and is neither intended nor expected to give a valid GO enrichment result, but to test for an interaction between GO category and our QTL predictors. So, for example, these results cannot be interpreted to indicate in which GO categories QTLs are especially common. The reduction to a single GO assignment for each gene is both a simplifying assumption and a technical requirement for the type of model fitting that is used in the hypothesis-generation phase; the GO category size of 125 was chosen by inspection - GO categories containing more than 200 genes were generally not biologically informative (e.g. the most extreme case, GO 0008150, biological process), while GO categories below size 50 are unlikely to have enough representatives in our data to detect an association, especially an association which is reproduced in several data sets containing a non-overlapping combination of genes. Small GO categories are often not-represented at all in many of our data sets. This procedure will be biased to detect effects from larger GO categories, but the statistical test should still be valid, in the same way that a non-uniformly distributed error rate (e.g. in a GWAS, assigning 50% of the errors to the 5% of the markers which are in coding regions)
This may introduce a bias into the types of GO categories which are selected in this hypothesis-generation phase; yet another reason why the hypothesis-testing phase was required. The GO assignment reduction is not carried out in the hypothesis-testing phase, so the hypothesis-testing phase should be free of any bias introduced by this favoritism.
In each of the seven data sets, conditional logistic regression was used to fit six statistical models, each with gene/intergenic region as a stratification variable, which are given in Table 2.
Except for our Adjusted Haplotype Conditional Entropy (H|H), all of the values in the formulas for Table 2 are taken from the 1,000 Genomes Selection Browser, and converted to a sum of logs (P value when reported, but MAF is similarly transformed although it is not a p-value) across thousand genomes populations on a per-SNP basis. Identification as an eQTL is treated as a random variable, with a logistic distribution fit on a per-gene basis to an effect from one or more of the following variables that are included in the formulas in Table 2: Adjusted Haplotype Conditional Entropy (H|H, the new measure introduced in this manuscript), the absolute value of the differential allele frequency (|ΔDAF|), the differential allele frequency (ΔDAF), the change in integrated haplotype heterozygosity (ΔiHH), the coancestry coefficient (F-statistic or Fst ) the integrated haplotype score (iHS), the minor allele frequency (MAF), and the location within the gene of an individual SNP (either flanking, UTR, intronic or exonic; or, for intergenic regions, intergenic); if an individual parameter is fit independently in each singletized GO term, it is appended to indicate such an interaction (with × GO).
The results of the fits for the formulas in Table 2 are given in Table S1 (in the supplement, additional file 1). The parameter values in Table S1 are summarized and interpreted in the next few paragraphs, and typical values for the Z-scores of the individual components are given in Table 3. As shown in Table 3 H|H is generally a much stronger predictor than other positive selection measures, with much higher Z-scores of consistent sign; ΔiHH and iHS are both weaker predictors, and the relationship between the sign of the value and whether or not a marker is an eQTL actually changes between data sets.
Minor allele frequency is the strongest predictor, and heavily modulates the predictive power of all other predictors (results not shown; this is an expected consequence of sample size constraints/ascertainment bias within the individual source studies.) Otherwise, only one pair of predictors interact (change substantially in predictive power when both are included): |ΔDAF| and Fst. For |ΔDAF|, two Z-scores are given in Table 3: without (before the slash) and with (after the slash) Fst in the model. Fst is a good predictor but does not converge on the Mangravite data set, indicative that the measure is overly smooth (see Figure 1).
Model 6 is underlined in the list above because it is used for hypothesis generation; Model 6 includes only our Adjusted Haplotype Conditional Entropy (fit on a per GO term basis), Minor Allele Frequency (likewise), and Differential Allele Frequency (fit to each data set as a whole). Other positive selection measures either were weak predictors in one or more data sets (ΔiHH and iHS; |ΔDAF| unless Fst is also included); or, caused a failure of the model to converge in some data sets (Fst, ΔDAF if included as an interaction term), because of excessive non-independence (referred to as co-linearity in the logistic regression literature ) with either the other predictors or with the stratification variable. ΔDAF is a strong predictor and we were able to include ΔDAF in models that did converge; ΔDAF serves a crucial purpose in our models because it differentiates the effects of population bottlenecks from selective sweeps. However, unfortunately a model including a ΔDAF interaction with GO terms was tried but did not converge, so ΔDAF was fit globally. Notably, of the positive selection measures, only our Adjusted Haplotype Conditional Entropy is a strong predictor which also converges in all of these data sets when fit on a per-GO term basis.
Other positive selection measures that report values for contiguous blocks of gene sequence are highly useful in gene-prioritization applications; however, they are generally unsuited for the challenge addressed here, which is to prioritize variants within genes, and would be expected to have severe convergence issues, so they were not considered (Figure 1 illustrates this issue with the Composite of Multiple Signals). Future applications of this method would be in a candidate gene context, however in order to assess the predictive power of the measure, whole genome results are used. A positive selection measure that is constant across an individual gene may identify genes that are in turn eQTL rich, but this would tend to produce a failure to converge. In any case, in the currently available data, differences in eQTL richness between genes cannot be distinguished from ascertainment bias. The conditional information haplotype score used here was developed in part to address these technical issues, which are related to the initial observation made by biologists in our group - that a marker in a long haplotype might be only a few hundred bases away from a second marker with which the long haplotype is in weak linkage. In these data, only our new conditional haplotype information has the properties that it is both a strong predictor for eQTLs and allows a convergent statistical fit in relevant cases. Therefore, Model 6 was chosen to carry on to the hypothesis generation step.
GO ID's at high and low extremes of selection score in hypothesis-generation data set
A heuristic score was used to combine the results of the fit to Model 6, for each GO ID in each of the seven data sets. To generate this heuristic score from the fit of Model 6 to each permutation, the effect size is moved one standard error towards 0 (minimum of 0) and then summed across all 7 data sets. There is an overall association between our H|H and eQTLs; by using the effect size rather than the Z-score, we avoid identifying individual GO IDs solely because they contain many genes.
In order to establish preliminary confidence bounds on the range of variability for these prediction by GO category interactions, Model 6 was also fit to 100 randomly permuted versions of the data sets above. In each permutation, assignments between genes assigned to each single, medium-sized GO category, were randomized. In each permutation, each randomized medium-sized GO category contained the same number of member genes as the corresponding original GO category. By generating the same heuristic score (sum of effect sizes) in each permutation, a confidence interval was obtained.
The effect sizes and standard errors for the GO categories outside of the 90% confidence interval (see Meta-analysis of conditional logistic regression regression results for details and justification of this overall approach) are given in Table 4. The following GO categories, hence Group 1, were below the 90% confidence interval from the permutations (that is, in these categories, overall the Conditional Haplotype Entropy was low for eQTLs compared to other GO categories):
http://amigo.geneontology.org/amigo/term/GO:0006367 transcription initiation from RNA polymerase II promoter
http://amigo.geneontology.org/amigo/term/GO:0006396 RNA processing
http://amigo.geneontology.org/amigo/term/GO:0008544 epidermis development
http://amigo.geneontology.org/amigo/term/GO:0042742 defense response to bacterium
While the following GO categories, hence Group 2, were above the 90% confidence interval from the permutations (that is, in these categories, overall the Conditional Haplotype Entropy, which is generally higher for eQTLs, was especially-relatively-high for eQTLs compared to other GO categories):
http://amigo.geneontology.org/amigo/term/GO:0001501 skeletal system development
http://amigo.geneontology.org/amigo/term/GO:0006869 lipid transport
http://amigo.geneontology.org/amigo/term/GO:0006936 muscle contraction
http://amigo.geneontology.org/amigo/term/GO:0007186 G-protein coupled receptor signaling pathway
http://amigo.geneontology.org/amigo/term/GO:0009653 anatomical structure morphogenesis
http://amigo.geneontology.org/amigo/term/GO:0016567 protein ubiquitination
http://amigo.geneontology.org/amigo/term/GO:0018108 peptidyl-tyrosine phosphorylation
http://amigo.geneontology.org/amigo/term/GO:0051056 regulation of small GTPase mediated signal transduction
Group 1 includes bacterial defense genes which might be expected to experience balancing selection[35, 36], which would preserve ancient diversity, allowing time for recombination to produce short haplotypes; since the epidermis serves a largely infection-defense role, we speculate that balancing selection also acts there in order to maintain within-population diversity in pathogen environment. For the other categories in Group 1, we propose simply that the last 50,000 years is too short a time frame for many beneficial mutations to arise in the highly-conserved core functions of molecular biology.
Group 2 is of particular interest because it includes so many current and emerging drug targets[38, 39], as well as the lipid transport genes which play such a large role in western lifestyle diseases[40, 41]. Follow-up work will focus on utilizing these predictions to enrich the variant discovery activities of the Pharmacogenomics Research Network (PGRN).
An additional GO term (Cellular Metabolic Process, GO:0044237) was also above the confidence interval but we chose to exclude it from the hypothesis-testing phase, because by our approach it produces a conglomerate taxon of genes that are simply not well-understood enough to be assigned to any GO ID with fewer members. This is not intended to be a rigorous statistical test of these associations; instead, these two categories are tested in a larger, more extensively genotyped hypothesis-testing phase, described next.
Hypothesis-testing phase in the Westra data set/Blood eQTL Browser
The following variant of Model 6 was then fit to a single, substantially larger data set of eQTLs from whole blood, as a hypothesis-testing phase:
This hypothesis-testing differed from Model 6 in the hypothesis-generation phase above in that:
1) intergenic regions (defined as more than 1kB from an adjacent gene) were associated with GO IDs, and thus groups, of both their flanking genes,
2) all genes associated with the GO IDs in Group 1 and/or Group 2 were included (and not just those that were not also assigned to a second GO ID which had closer-to-125 members), in order to avoid any biases that might have been introduced by our size-filtering,
3) the interaction terms were fit to each of Group 1 and Group 2 (and independently to the few genes and intergenic regions that were a member of GO IDs in both),
4) the data set itself includes more eQTLs as a fraction of all markers, including eQTLs having effect sizes too small to be detected in the Chicago eQTL browser data sets used for the hypothesis-generation phase.
In this test, shown in Table 5Group 1 and Group 2 do show statistically significant differences in the effect size for our Conditional Haplotype Entropy (H|H) measure (two-tailed P < 0.012 extrapolated from the standard error of the two t-test results from the conditional logistic regression). This confirms the biological findings outlined in the previous section, although Group 1 eQTLs shows no association with H|H at all (rather than a negative association) and the association in Group 2 is somewhat smaller than it was in the hypothesis-generation phase data. These differences may be attributable to some degree of cherry-picking in the hypothesis-generation phase (Group 2 would be expected to include GO IDs in which the effect size for H|H appeared strong by chance, implying that there are many more categories of GO IDs for which a significant difference in H|H effect size exists), or may imply that H|H is a predictor only for strong eQTLs (i.e. having a large effect on expression), which are in turn readily detected in the smaller data used in the validation phase above but swamped by weaker eQTLs in the larger validation data set. These possibilities will need to be distinguished in future work.
GO enrichment test on normalized likelihoods
The previous results arise from incorporation of interaction terms into regression methods, and do not utilize existing approaches for GO enrichment. In the supplement (additional file 1), we present the results of an alternative approach where positive selection methods are added to the logistic model, the improvement in model prediction is assessed on a per-gene basis, and PantherDB is used to test for an association between model improvement and biological function GO category. This method finds GO categories containing relatively few genes and a significant interaction is not detected in the hypothesis-testing data set. A supplementary report on this alternative approach can be found in additional file 1.
Our Adjusted Conditional Haplotype Entropy (H|H) will be distrubted as UCSC Genome Browser tracks from the Center for Pharmacogenomics webpage (http://pharmacogenomics.osu.edu/). Conditional entropy indicative of a selective sweep is a strong predictor of eQTLs for genes in some medically interesting biological processes but not in others. In categories of genes where H|H is a weak or negative predictor of eQTLs, such as innate immune system genes, this would be consistent with balancing selection acting on such eQTLs. The evolutionary signatures underlying eQTLs in different biological processes, populations and systems vary widely.
GTEx candidate SNPs
The GTEx consortium has filtered human genetic variation for excess linkage disequilibrium, producing a list of 6,820,472 candidate variations for which GTEx plans to perform statistical tests in their tissues, which we downloaded from their ftp site on July 5, 2014.
First, because this list of SNPs has been filtered for D' values, the risk of over-counting in our statistical model fits is reduced. Second, we anticipate that tissue-specific QTLs identified by GTEx will be the most relevant data for future applications that may incorporate selection as a component.
Adjusted conditional entropy in 1,000 genomes data
For these calculations, 1,000 genomes data release 3 (Nov. 23, 2010) was used, including all single nucleotide polymorphisms (but not other types of variation) to which a reference SNP cluster (rs number) had been assigned. Our method for finding positively-selected (long) haplotypes is based on information theoretical considerations, in particular the Khinchin axiom for the Shannon entropy, also known as the (Shannon) entropy decomposition formula; for the purposes of this manuscript, the Khinchin axiom is used only to justify the summation of conditional entropy in different populations of roughly equal size. The formulas below describe the decomposition of the entropy for a population of haplotypes (in a given window), into a total entropy and a conditional entropy associated with a minor variant - the conditional entropy associated with the major variant would also be a part of this decomposition but is not used for our score. Frequency-dependent terms, the total entropy, the conditional entropy for the minor variant, and also any population-specific effects are all combined into an ad-hoc linear Adjusted Conditional Haplotype Entropy (H|H), which has a mean of 0 in each 1,000 genomes population.
The difference between the total entropy and the conditional entropy given the minor variant quantifies the influence of the minor variant on the diversity of the haplotypes in the window. The larger the difference between entropies, the more conserved (longer) the haplotype that carries the minor variant, consistent with a recent selective sweep.
First, independently in each population, a window is established around each marker SNP, m (where m must be a GTEx Candidate SNP, defined above, but the SNPs in the window are not so constrained):
(meaning: bpmin and bpmax are chosen such-that the sum is at least 10 but otherwise as small as possible).
Where Hi is the Shannon entropy at each SNP in the window (here using the natural log as the base) in the corresponding 1,000 genomes population. The sum of Hi in the window would correspond to entropy of the window if the individual positions were all mutually independent. The value of 10 was chosen arbitrarily in order to make the results computationally tractable.
This gives a window that will be shorter around markers in more variable regions of the genome, and for the same markers will be shorter in more genetically diverse populations; this is intended to account for background differences in haplotype lengths between populations and regions.
Within this window, each 1,000 genomes chromosome is assigned to a haplotype, containing all the markers from bpmin to bpmax excluding the original marker m. Given that each haplotype J is observed in some fraction n*J of the 1,000 genomes participants in population p, we define H*pm:
(this is the true entropy of the window - zero minus the sum over all haplotypes J in the population of the haplotype frequency times the log haplotype frequency). Further, given that each haplotype J is observed in some fraction of the n'J of the 1,000 genomes participants in population p who are carrying the minor variant at m, we define H'pm:
Finally, these entropies are summed across all 1,000 genomes populations, and adjusted in order to account for the linear contribution of the log of the minor allele frequency (note that this factor also replaces the frequency weight which would apply for the true entropy decomposition) to give the Adjusted Haplotype Conditional Entropy (H|H); Np is the number of chromosomes in population p, NMp is the number of major allele carrying chromosomes in population p, and mp is the number of minor allele carrying chromosomes in population p.
αp, βMp, and βmp are chosen empirically so that the average value of this sum, in each 1,000 genomes population, would be 0; see Table 6 below. The intent is that the adjusted conditional entropy would be more independent of both the minor allele frequency and the number of populations in which the SNP is found, both of which are better expressed using the minor allele frequency and differential allele frequency measures.
Chicago eQTL browser data sets
These data sets were used only in the hypothesis-generation phase.
Via the blood eQTL browser maintained by the authors (http://genenetwork.nl/bloodeqtlbrowser/) of a study primarily focused on potential disease consequences for trans-eQTLs, we obtained cis-eQTLs from a very large (5,311 participants in their discovery phase, with an additional 2,775 participants in their replication phase) study of expression genetics in whole human blood. These cis-eQTLs are used only in our hypothesis-testing phase.
1,000 Genomes Selection Browser results
A variety of diversity and positive selection genetics measures are made available via the 1,000 Genomes Selection Browser. The entire data set was downloaded on January 10, 2014.
Of these, ΔDAF, |ΔDAF|, ΔiHH, Fst , and iHS, along with minor allele frequency (MAF) were used in at least one model in the hypothesis-generation phase. In the final model and in the hypothesis-testing phase, only ΔDAF and MAF were used. Positive selection measures that report results in windows (including the Composite of Multiple Signals score, distributed elsewhere) were not used because preliminary results indicated that the gene-based Conditional logistic regression would not converge.
Covariates incorporated using Annovar and the UCSC Table Browser
The results from the previous sections were matched to genes and within-gene location using Annovar which in-turn utilizes 2013 ENCODE data. Once mapped to gene symbols, gene ontology IDs are mapped using the UCSC table browser with tables downloaded on or after September 25, 2014.
To account for differential eQTL frequencies in different locations within genes, the within-gene locations included in the Annovar distribution were converted into four distinct location categories as follows:
exonic, ncRNA_exonic exonic
UTR3, UTR5, ncRNA_UTR3, ncRNA_UTR5 UTR
intronic, splicing, ncRNA_intronic, ncRNA_splicing intronic
upstream, downstream flanking
For a SNP in multiple categories, the category higher in the above list is chosen only (e.g. a SNP that is both in an exon and upstream of the promoter for a non-coding gene is classified as exonic only). Upstream and downstream are defined as within 1kB of the most distal promoter and transcriptional terminators defined in ENCODE; this 1kB distance is the default for Annovar.
In the hypothesis generation phase, each gene was assigned to a single GO ID in the biological function category; when a gene is assigned to multiple categories (as is generally the case), it was instead assigned only to the category that was closest to containing exactly 125 genes (with ties broken for smaller categories, so a GO ID with 100 genes would be preferred to a GO ID with 150). Once this was resolved, all GO ids with 50 or fewer assigned genes, along with all intergenic regions, were assigned to a large "OTHER" category.
In the hypothesis testing phase, this mapping to a single GO id is not used. Instead, genes were assigned either to one of the Group 1 GO IDs in a group for which our haplotype score had a positive association with eQTL status in the validation/model selection phase (see Conditional logistic regression below,) to a Group 2 GO IDs for which our haplotype score had a negative association with eQTL status in the validation/model selection phase, to GO categories in both groups, or to neither group; eQTLs in intergenic regions were assigned to the group(s) corresponding to the closest flanking genes.
Conditional logistic regression
Conditional logistic regression, as implemented in the survival package for the statistical programming language R, is used for all statistical model fitting. The notation used in the results closely matches the R syntax, with gene assignments from Annovar (see Covariates incorporated using Annovar and the UCSC Table Browser) used as the stratification variable. Each intergenic region is treated as an individual and distinct gene for stratification purposes.
Meta-analysis of conditional logistic regression results
In the Conditional logistic regression above, gene by gene variability plays a significant role. Therefore, in order to identify GO categories in which the Adjusted Haplotype Conditional Entropy reproducibly reported different extremes, the 1-to-1 mapping of genes to GO IDs was permuted 100 times (with genes in the OTHER category, including intergenic regions and genes mapped to GO IDs with fewer than 50 members after 1-to-1 mapping, not permuted). Within each of the 100 permutations, the statistical model (Model 6, see Results) was fit to the permuted data.
In both the original data and in each permutation, the following score was summed across the seven data sets:
Where β is the effect size, and σ is the standard error in the effect size. The 100 permutations of 94 such GO IDs produced 9,400 such sums - these were used to produce a 90% confidence interval, and GO IDs in the original data for which the sum was outside of this confidence interval, were continued onto the hypothesis-testing phase.
Supplementary enrichment tests
As a supplement to the analysis reported in the main text, an additional round of analysis, testing for enrichment of extreme changes in likelihood when models incorporate additional positive selection predictions according to GO IDs, is reported in additional file 1.
Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic Dissection of Transcriptional Regulation in Budding Yeast. Science. 2002, 296 (5568): 752-755. 10.1126/science.1069516.
Bryois J, Buil A, Evans DM, Kemp JP, Montgomery SB, Conrad DF, et al: Cis and Trans Effects of Human Genomic Variants on Gene Expression. PLoS Genet. 2014, 10 (7): e1004461-10.1371/journal.pgen.1004461.
Felsenstein J: Phylogenies and the Comparative Method. Am Nat. 1985, 125 (1): 1-15. 10.1086/284325.
Fagny M, Patin E, Enard D, Barreiro LB, Quintana-Murci L, Laval G: Exploring the Occurrence of Classic Selective Sweeps in Humans Using Whole-Genome Sequencing Data Sets. Mol Biol Evol. 2014, 31 (7): 1850-1868. 10.1093/molbev/msu118.
O'Bleness M, Searles VB, Varki A, Gagneux P, Sikela JM: Evolution of genetic and genomic features unique to the human lineage. Nat Rev Genet. 2012, 13 (12): 853-866. 10.1038/nrg3336.
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A: Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010, 20 (1): 110-121. 10.1101/gr.097857.109.
Vissers LELM, de Ligt J, Gilissen C, Janssen I, Steehouwer M, de Vries P, et al: A de novo paradigm for mental retardation. Nat Genet. 2010, 42 (12): 1109-1112. 10.1038/ng.712.
Ward LD, Kellis M: Evidence of abundant purifying selection in humans for recently acquired regulatory functions. Science. 2012, 337 (6102): 1675-1678. 10.1126/science.1225057.
1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65. 10.1038/nature11632.
Zhang X, Gierman HJ, Levy D, Plump A, Dobrin R, Goring HH, et al: Synthesis of 53 tissue and cell line expression QTL datasets reveals master eQTLs. BMC Genomics. 2014, 15: 532-10.1186/1471-2164-15-532.
Westra H-J, Peters MJ, Esko T, Yaghootkar H, Schurmann C, Kettunen J, et al: Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet. 2013, 45 (10): 1238-1243. 10.1038/ng.2756.
Sadee W, Hartmann K, Seweryn M, Pietrzak M, Handelman SK, Rempala GA: Missing heritability of common diseases and treatments outside the protein-coding exome. Hum Genet. 2014, 133 (10): 1199-1215. 10.1007/s00439-014-1476-7.
Bromberg Y, Capriotti E: SNP-SIG 2013: from coding to non-coding-new approaches for genomic variant interpretation. BMC Genomics. 2014, 15 (Suppl 4): S1-10.1186/1471-2164-15-S4-S1.
Bratko A, Filipič B, Cormack GV, Lynam TR, Zupan B: Spam Filtering Using Statistical Data Compression Models. J Mach Learn Res. 2006, 7: 2673-2698.
Kaitchenko A: Algorithms for estimating information distance with application to bioinformatics and linguistics. Canadian Conference on Electrical and Computer Engineering, 2004. 2004, 4: 2255-2258.
Nalbantoglu ÖU, Russell DJ, Sayood K: Data Compression Concepts and Algorithms and Their Applications to Bioinformatics. Entropy. 2009, 12 (1): 34-52. 10.3390/e12010034.
Voight BF, Kudaravalli S, Wen X, Pritchard JK: A Map of Recent Positive Selection in the Human Genome. PLoS Biol. 2006, 4 (3): e72-10.1371/journal.pbio.0040072.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.
Mi H, Muruganujan A, Casagrande JT, Thomas PD: Large-scale gene function analysis with the PANTHER classification system. Nat Protoc. 2013, 8 (8): 1551-1566. 10.1038/nprot.2013.092.
Mangravite LM, Engelhardt BE, Medina MW, Smith JD, Brown CD, Chasman DI, et al: A statin-dependent QTL for GATM expression is associated with statin-induced myopathy. Nature. 2013, 502 (7471): 377-380. 10.1038/nature12508.
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, et al: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464 (7289): 773-777. 10.1038/nature08903.
Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, et al: Mapping the Genetic Architecture of Gene Expression in Human Liver. PLoS Biol. 2008, 6 (5): e107-10.1371/journal.pbio.0060107.
Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al: Population genomics of human gene expression. Nat Genet. 2007, 39 (10): 1217-1224. 10.1038/ng2142.
Veyrieras J-B, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK: High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation. PLoS Genet. 2008, 4 (10): e1000214-10.1371/journal.pgen.1000214.
Zeller T, Wild P, Szymczak S, Rotival M, Schillert A, Castagne R, et al: Genetics and Beyond - The Transcriptome of Human Monocytes and Disease Susceptibility. PLoS One. 2010, 5 (5): e10693-10.1371/journal.pone.0010693.
Pybus M, Dall'Olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P, et al: 1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans. Nucleic Acids Res. 2013, 42 (Database issue): D909-D909.
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al: The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013, 45 (6): 580-585. 10.1038/ng.2653.
Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, et al: ENCODE Data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res. 2013, 41 (Database issue): D56-D63.
Alexa A, Rahnenfuhrer J: topGO: Enrichment Analysis for Gene Ontology. R Package Version 2.18. 0. 2010
Hofer T, Ray N, Wegmann D, Excoffier L: Large allele frequency differences between human continental groups are more likely to have occurred by drift during range expansions than by selection. Ann Hum Genet. 2009, 73 (1): 95-108. 10.1111/j.1469-1809.2008.00489.x.
Weir BS, Hill WG: Estimating F-statistics. Annu Rev Genet. 2002, 36: 721-750. 10.1146/annurev.genet.36.050802.093940.
Jennrich RI, Ralston ML: Fitting Nonlinear Models to Data. Annu Rev Biophys Bioeng. 1979, 8: 195-238. 10.1146/annurev.bb.08.060179.001211.
Simonson TS, Yang Y, Huff CD, Yun H, Qin G, Witherspoon DJ, et al: Genetic Evidence for High-Altitude Adaptation in Tibet. Science. 2010, 329 (5987): 72-75. 10.1126/science.1189406.
Grossman SR, Shylakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, et al: A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection. Science. 2010, 327 (5967): 883-886. 10.1126/science.1183863.
Azad AK, Sadee W, Schlesinger LS: Innate immune gene polymorphisms in tuberculosis. Infect Immun. 2012, 80 (10): 3343-3359. 10.1128/IAI.00443-12.
Van der Hoorn RAL, De Wit PJGM, Joosten MHAJ: Balancing selection favors guarding resistance proteins. Trends Plant Sci. 2002, 7 (2): 67-71. 10.1016/S1360-1385(01)02188-4.
Boube M, Joulia L, Cribbs DL, Bourbon H-M: Evidence for a Mediator of RNA Polymerase II Transcriptional Regulation Conserved from Yeast to Man. Cell. 2002, 110 (2): 143-151. 10.1016/S0092-8674(02)00830-9.
Cohen P: Protein kinases -- the major drug targets of the twenty-first century?. Nat Rev Drug Discov. 2002, 1 (4): 309-315. 10.1038/nrd773.
Neubig RR, Siderovski DP: Regulators of G-Protein signalling as new central nervous system drug targets. Nat Rev Drug Discov. 2002, 1 (3): 187-197. 10.1038/nrd747.
O'Keefe JH, Cordain L: Cardiovascular Disease Resulting From a Diet and Lifestyle at Odds With Our Paleolithic Genome: How to Become a 21st-Century Hunter-Gatherer. Mayo Clin Proc. 2004, 79 (1): 101-108. 10.4065/79.1.101.
Hu FB: Globalization of Diabetes The role of diet, lifestyle, and genes. Diabetes Care. 2011, 34 (6): 1249-1257. 10.2337/dc11-0442.
Giacomini KM, Brett CM, Altman RB, Benowitz NL, Dolan ME, Flockhart DA, et al: The Pharmacogenetics Research Network: From SNP Discovery to Clinical Drug Response. Clin Pharmacol Ther. 2007, 81 (3): 328-345. 10.1038/sj.clpt.6100087.
Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, et al: The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 2014, 42 (Database issue): D764-D770.
Ong RT-H, Wang X, Liu X, Teo Y-Y: Efficiency of trans-ethnic genome-wide meta-analysis and fine-mapping. Eur J Hum Genet. 2012, 20 (12): 1300-1307. 10.1038/ejhg.2012.88.
Lewontin RC: The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. Genetics. 1964, 49 (1): 49-67.
Consortium T 1000 GP: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.
Khinchin AI: Mathematical Foundations of Information Theory. 1957, Courier Dover Publications, 434 ():
Kolmogorov A: On the Shannon theory of information transmission in the case of continuous signals. Inf Theory IRE Trans On. 1956, 2 (4): 102-108.
Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38 (16): e164-e164. 10.1093/nar/gkq603.
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004, 32 (Database issue): D493-D496.
Therneau TM, T L: (original S->R port and maintainer until: 2009). Survival: Survival Analysis. 2014,
The authors acknowledge that this material is based upon work supported by or in part by NIGMS grant U01 092655 to the XGEN consortium, and by NSF agreement no. 0931642 to the Mathematical Biosciences Institute. Publication of this work was supported in part by an allocation of computing time from the Ohio Supercomputer Center. None of these funding agencies had any part in the research presented here, or in the preparation thereof.
We would like to thank Drs. Bromberg and Capriotti for their work in organizing the ISMB special interest group session.
This article has been published as part of BMC Genomics Volume 16 Supplement 8, 2015: VarI-SIG 2014: Identification and annotation of genetic variants in the context of structure, function and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S8.
The authors declare that they have no conflicts of interest in relation to this SNP-SIG article.
The original concept was developed by SKH, MS, RMS, KH, DW and WS. Experiments were designed and carried out by SKH, MS, KH, MP, ADJ and AK. All authors contributed towards the writing of the manuscript; all authors have read and approved the manuscript.
SKH§: co-designed the new method, performed the analysis, interpreted the results, and co-wrote the manuscript.
MS: co-designed the new method, assisted with analysis and interpretation, and assisted with the manuscript.
RMS: co-designed the new method and assisted with the manuscript. KH: assisted with the analysis and interpretation, and assisted with the manuscript.
DW: assisted with the method design and the interpretation. MP: assisted with the analysis and interpretation.
ADJ: contributed to the analysis, aided the interpretation, and assisted with the manuscript.
AK: contributed to the analysis, aided the interpretation, and assisted with the manuscript.
WS: co-designed the new method and co-wrote the manuscript.
Electronic supplementary material
About this article
Cite this article
Handelman, S.K., Seweryn, M., Smith, R.M. et al. Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs). BMC Genomics 16, S8 (2015). https://doi.org/10.1186/1471-2164-16-S8-S8
- Positive Selection
- Selective Sweep
- Conditional Entropy
- Conditional Logistic Regression