TSABL: Trait Specific Annotation Based Locus predictor

Background The majority of Genome Wide Associate Study (GWAS) loci fall in the non-coding genome, making causal variants difficult to identify and study. We hypothesized that the regulatory features underlying causal variants are biologically specific, identifiable from data, and that the regulatory architecture that influences one trait is distinct compared to biologically unrelated traits. Results To better characterize and identify these variants, we used publicly available GWAS loci and genomic annotations to build 17 Trait Specific Annotation Based Locus (TSABL) predictors to identify differences between GWAS loci associated with different phenotypic trait groups. We used a penalized binomial logistic regression model to select trait relevant annotations and tested all models on a holdout set of loci not used for training in any trait. We were able to successfully build models for autoimmune, electrocardiogram, lipid, platelet, red blood cell, and white blood cell trait groups. We used these models both to prioritize variants in existing loci and to identify new genomic regions of interest. Conclusions We found that TSABL models identified biologically relevant regulatory features, and anticipate their future use to enhance the design and interpretation of genetic studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08654-x.


Background
Genome-wide association studies (GWAS) are a widely used method to identify genomic regions associated in a population with a phenotype of interest, and have successfully detected thousands of loci across the human genome, each of which contains a set of linked variants that are potentially causal [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]. Unfortunately, identifying the specific causal variant(s) and underlying mechanism at associated loci is far more challenging, and for many genomic regions there exist more potential causal variants in credible sets than can be experimentally tested [17][18][19]. While multi-ancestry fine-mapping can help to narrow the set of putative variants [20], this approach is limited to loci which share genetic architecture and genetic variation across populations, and still may not resolve causal variant(s) within a locus. Therefore, an orthogonal method for prioritizing variants in GWAS loci would be beneficial. Additionally, as more well-powered GWAS reveal more loci, a straightforward method to prioritize the likelihood of variants contributing to a specific disease will aid in locus discovery.
While pleiotropy is pervasive across the human genome [21], the contribution of individual variants depends on the biological relatedness of the traits under consideration. For example, the loci, causal genes, and variants associated with circulating plasma lipid levels might reasonably be expected to differ from those that modify electrocardiogram (ECG) traits, as the biological basis for these traits involve different tissues (liver vs. heart), different physiologies (lipid metabolism and trafficking vs. cardiac electric conductance), and different genes (e.g., SORT1 [22] vs. NOS1AP [23]). Presumably, then, with sufficient association data and a diverse and dense collection of functional annotations across the genome, it might be possible to train a model to predict loci -and perhaps even variants -that relate to specific sets of traits relative to all others. Ideally, such a predictor would learn and select biologically relevant annotations from a set of all possible genomics features, thereby also providing informative hypotheses for why particular loci or variants are predicted to be "trait-relevant". Current methods to predict if a variant is likely functional for any trait (such as CADD [24], GWAVA [25], and Deep-SEA [26]) provide a score based on likelihood of function across multiple (unrelated) phenotypes and diseases.
Previously, we built platelet and red blood cell models using those GWAS as positives and random matched genomic regions as controls [27]. While this was successful, we found the models to not be trait selective. Here, we instead sought to develop a trait specific annotation based locus (TSABL) predictor which would discriminate loci associated with a specific trait (or collection of related traits) from all other complex trait associated loci identified by GWAS. We collected association results from 42 GWAS-scanned phenotypes [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16], coarsely grouping them into 17 related trait groups to test and validate prediction models (Supplementary Table 1). We then applied a penalized binomial logistic regression model (LASSO) to each trait group using publicly available genome wide features such as DNase hypersensitivity and histone or transcription factor ChIP-seq (Supplementary Table 2), along with gene expression, phastCon score, and size of locus. The six successful TSABL models were able to separate regions associated with their trait group from other GWAS loci and identified sensible biological features for the phenotypes in focus. We then applied these models to fine mapping and locus discovery paradigms, both of which may lead to translationally relevant trait-specific biological mechanisms.

Results
To develop our approach, we needed to define the set of traits, identify positive and negative loci for those traits, and obtain a collection of genomic features to discriminate between these labels. For traits to study, we focused on 42 phenotypes obtained from curated GWAS and the NHGRI-EBI GWAS catalog that we collapsed into 17 related trait groupings (Methods, Supplementary Table 1). Using all GWAS lead variants identified, we first clumped variants into loci by linkage disequilibrium (using European LD R 2 > 0.7). Next, for each trait group we defined positive loci as those which contained at least one variant with an associated P-value < 5 × 10 -8 with at least one of the phenotypes in the trait group. Negative loci for each trait group were defined as those without any variant even modestly associated (P-value > 0.05) with one or more of the phenotypes in the trait group. We utilized genome-wide annotations from multiple public resources, including nearest gene tissue-specific expression and epigenomic features, tagging these features across positive and negative examples if the features overlapped with any variant in the locus.
We then generated predictive models for each of the 17 trait groups using a penalized binomial logistic regression model (LASSO) [28], using cross-validation for training and estimating final prediction accuracy on a holdout set not used in training any model (Methods, Supplementary Fig. 1). We defined a model as informative if the area under the curve (AUC) was ≥ 0.7 and the difference between training and holdout AUCs was ≤ 0.02 (implying only a small level of overfitting). Based on these criteria, models trained for autoimmune, electrocardiogram (ECG), lipid, platelet, red blood cell (RBC), and white blood cell (WBC) traits passed, and were the focus of further investigation (Fig. 1a, Table 1).
To assess why these 6 traits produced successful models while the other 11 failed, we considered whether trait models with more loci with coding variants, more positive loci or more features were more predictive (Supplementary Fig. 2). We found a significant relationship between percentage of loci containing coding variants and holdout set AUC, suggesting that even though our models are training on annotations we consider to be non-coding in nature, traits with more loci containing coding variants are easier for our models to predict. We did not find an obvious relationship between either number of loci used to train models or number of annotations selected and holdout set AUC, suggesting that our inability to predict these traits is not based on limitations in availability of positive/negative examples or the total number of available features. The implication may be that the catalog of existing features available for modelling insufficiently captures differences between these other 11 traits and the spectrum of complex traits more generally.
We considered two metrics of model specificity. First, trait specific models should be more informative for prediction relative to functional variant prediction methods which were designed to be trait agnostic. For the trait groups we assembled, the holdout set AUCs for CADD, GWAVA, and DeepSEA were all close to 0.5, indicating that none were informative for discriminating amongst trait-associated GWAS loci (Fig. 1B, Table 1). Second, trait specific predictors should perform the best on the phenotype collections from which they were trained. Looking at comparative model performance, we observed that all six selected models were better at predicting their own holdout set compared to others (Fig. 2, left panels and Supplementary Fig. 3). These results indicate that these six prediction models can discriminate between traits using GWAS loci alone, and that the power to make predictions is not due to intrinsic functionality of GWAS loci in general that would be captured by trait agnostic methods.
Given the pervasive nature of pleiotropy, we would expect that predictors for traits that are related to one another -but perhaps not so closely as to be grouped together -should be partially cross-predictive. We observed this to be true for the RBC model, which could be used to predict the platelet holdout set with an AUC of 0.79 ( Fig. 2A, left panel). To rule out that this similarity was driven by loci that are positive for both traits, we removed all loci from the holdout set that were positive in both the "in-focus" trait (here, platelets) and any other model, then reassessed the holdout AUC for each model. Even after removing overlapping loci, the RBC model still predicted the platelet holdout set well, indicating the prediction accuracy is not driven by overlapping loci (RBC AUC of 0.76 on platelets holdout set; Fig. 2A, right panel). In contrast, the ECG model was the only one that predicted the ECG holdout set reasonably, while the autoimmune, lipids, platelets, RBC and WBC models all had AUC < 0.55 on the ECG holdout (Fig. 2B).
Looking at the annotations selected as important for our six successful models, we found them to be largely    Table 3). All three blood traits (platelets, RBC, and WBC) selected mainly whole blood annotations that positively rank active regions (active histone marks, gene expression, and transcription factor (TF) binding) and negatively rank repressive histone marks (H3K9me3) (Fig. 3A, Supplementary Fig. 4B and C). By contrast, the ECG model selected only heart tissue annotations (Fig. 3B), and the lipids model chose predominantly liver tissue features (Fig. 3C). We also identified features that distinguish these traits from each other. For example, only the models for the immune traits autoimmune and WBC selected spleen gene expression. We next used the framework of stratified LDScore regression [29] to evaluate if the scores emitted by our models could be used as an annotation to explain partitioned heritability. With our six added annotations derived from our six selected models, we tested 103 annotations in total and so considered enrichments with P-value < 4.8 × 10 -4 significant by Bonferroni correction. We found enrichment in GWAS for all successful models with summary statistics available (autoimmune, lipids, platelets, RBC, WBC) (Supplementary Table 4). To test whether this enrichment was driven by the lead GWAS SNPs that the models were based on, we also removed these lead GWAS SNPs and their LD proxies (EUR R 2 LD > 0.7) from the dataset and reassessed enrichment. We found that the lipids, platelets, and WBC models were significant in all individual phenotype association data, and the RBC model was significant for 3 of 4 phenotype GWAS (Supplementary Table 5). Perplexingly, the autoimmune model was not significant for any phenotype GWAS after lead SNPs were removed, but the related trait models for WBC and platelets were significant for all 3 autoimmune phenotype GWAS. We expect that the autoimmune trait group failed this test due to a lack of power, as it had the fewest positive loci of the tested successful models. The enrichment we see in LDScore partitioned heritability demonstrates that our trait-and cell type-specific modeling approach can define target loci for related disease phenotypes, facilitating downstream validation and biological/mechanistic understanding.

Application-locus discovery
We next applied our six predictive models to all SNPs identified in phase III of the 1000 Genomes project.  Table 6). To find genes associated with these SNPs, we first identified the 436 to 2,137 genes overlapping these variants, of which 153 to 879 were outside of identified GWAS regions (defined as 1 MB window surrounding lead SNPs; Supplementary Table 6). For example, the lipids model identified 18,847 SNPs, with 458 overlapping genes, of which 167 were outside known GWAS regions. While we found many SNPs near established GWAS variants, we were interested here in the regions not previously identified by GWAS analyses, and our models highlight several points of potentially interesting biology. For the lipids model, we identified a cluster of 10 SNPs overlapping DHCR24, a known cholesterol biosynthesis gene [30], and a cluster of 63 SNPs overlapping PLIN2, a gene involved in lipid globule storage [31], both of which are reasonable biological candidates for lipid related phenotypes. For the platelets model, we identified  [32], and 7 SNPs overlapping PXN, which is involved in immunity response [33], both of which may plausibly impact platelet phenotypes. Additionally, we found that FDR < 0.01 variants of the platelets, RBC, and WBC models were enriched for being within 25 kb of a newly published set of blood traits GWAS [34] (Supplementary Fig. 5).
To assess whether the trait associated SNPs mapped onto established biological knowledge, we used GREAT [35] analysis to examine both the human gene ontology (GO) biological process and mouse single knockout (KO) phenotypes associated with these SNPs. We used both the full set of trait associated SNPs and also removed all variants within 500 KB of identified GWAS regions (non-GWAS variants) (Supplementary Table 7). For the full set of trait associated SNPs, all models returned mouse KO phenotypes consistent with our expectations. For example, mouse KOs for genes overlapping our autoimmune associated SNPs are enriched for phenotypes such as "abnormal T cell morphology" and "decreased lymphocyte cell number. " GO biological processes were similarly consistent with expectations for autoimmune, ECG, lipids, and WBC trait models, while the platelets and RBC traits returned mostly generic results (for example, "RNA processing"). The non-GWAS variants analyses were similar, except that the lipids and RBC models did not return any results for either the mouse KO or GO biological processes analyses.

Application -fine mapping
Functional cellular follow-up studies require one to prioritize GWAS loci, genes and variants within them that are likely causal. To assess the usefulness of our model scores for individual SNPs within GWAS loci, we identified SNPs in credible sets that exceed FDR thresholds of 0.10, 0.25, or 0.50 of the relevant model for phenotypes where appropriate summary statistics were available (autoimmune, platelets, RBC and WBC, Table 2, Supplementary Tables 8 and 9). At FDR 0.10, our models prioritize a subset of SNPs of interest in 27% of multi-SNP credible sets; at FDR 0.25, 52% of credible sets see prioritization. For many of these loci the number of SNPs highly prioritized is substantially less than the size of the credible set. For example, lead rs10486483 from Crohn's Disease had a credible set of 85 variants, prioritizing using an FDR 0.10 threshold leads to a list of 3 variants to test; or lead rs2382817 in Inflammatory Bowel Disorder, where 5 variants were prioritized from a credible set of 68. While a set threshold is useful for summarizing, it is not necessary in practice, and variants in a credible set can simply be ordered by their TSABL score for assessment, facilitating immediate integration into functional follow up study pipelines.

Discussion
This approach is limited by both the GWAS data and genome wide features available. Surprisingly, we were unable to adequately model some complex traits (Table 1). However, we expect that additional genomic features will improve some models; for example, in the dataset used for this analysis there are 642 annotations from blood tissue but only 31 from bone (Supplementary Table 2) and it seems likely that part of the reason the height model was not well specified is that it lacked key genomic annotations relevant to bone growth. Additionally, the rapidly expanding repertoire of single cell data sets may better parse complex tissue types (e.g., pancreas) and capture data relevant to rare cell types that are underrepresented in current datasets. Future studies will be designed to incorporate these data types, which are only currently emerging.
We selected a modelling strategy to allow for improved biological interpretability, but this presumably comes at a cost of accuracy. We intentionally selected a variety of traits with very different underlying physiology, and we found that the traits well described by our models are linked largely to single cell types, namely blood, liver, or heart. We do not successfully model traits such as Body Mass Index (BMI) or Type 2 Diabetes (T2D), both of which involve many tissues and have disease subtypes that are not well captured in current GWAS analyses [36][37][38]. We expect that analyzing these phenotypes as a single cohort may serve to obfuscate rather than clarify, and anticipate that incorporating more granularity in phenotypes (type of T2D or severity of BMI, for example) will improve model performance. Pleiotropy presents another challenge to our modeling approach, but is inherently captured in any genomic data set. It is plausible that more sophisticated approaches with this in mind might be anticipated to improve performance (e.g., multi-label classification). Moreover, approaches that captured uncertainty around labelled examples (e.g., "noisy" labels) might also be helpful to address these challenges as well as expand the use of both positive and negative examples. Additional processing of genome wide features, both to account for correlation and to reduce the feature space prior to model building (e.g., stability selection) may also improve results.
An important limitation to acknowledge concerning our locus discovery and fine mapping applications is that both are applied to variant level data, while our models were trained on locus level data that incorporates feature overlaps from multiple variants within a locus. While ideally we would train a model on individual variants before applying it to individual variants, the reality of GWAS data means that the pool of validated causal variants is still inadequate to this task. Identifying causal variants from other sources introduces additional sources of bias which we judged to be larger than that introduced by shifting from locus to variant.
A logical point of comparison for TSABL is to another strategy used for fine mapping or locus discovery. The most similar is the fGWAS method, which is designed to jointly consider GWAS and genomic annotations [39]. The main distinction between these models is that fGWAS considers trait loci compared to all regions of the genome (regardless of their association with a complex trait) while TSABL specifically uses established GWAS loci as the negative set. In practice, what this means is that TSABL models found differences between traits, rather than annotations that may be held in common between all GWAS loci. While both approaches are useful, identifying trait specific annotations will allow for the translation of GWAS loci into functional biological hypotheses, facilitating disease treatment and variantbased drug discovery.

Conclusions
With current pitfalls in fine mapping and locus discovery, our approach can distinguish amongst complex traitrelated groups applied to 17 sets of traits. Our modeling process, built using publicly available GWAS and annotation data, selects and outputs features and variants relevant to underlying disease biology. These models predict differences between traits that currently available SNP ranking schemes do not, and the overlaps we see among models are consistent with known biological links between the traits assessed. Our modelling strategy makes no assumptions about number of causal variants in a GWAS locus, is expandable to include new genome wide features as they are developed, and can be applied to locus discovery problems or to prioritization after fine-mapping for functional validation efforts.

General data sources and analysis
All chromosome:position coordinates given in this paper refer to genome build hg19. We downloaded refseq gene exon coordinates for build hg19 from ENSEMBL [40] at http:// grch37. ensem bl. org/ bioma rt/ on 10/24/17, and used the extreme exon boundaries as gene boundaries. We assessed annotations for rsIDs of interest using ANNOVAR version 2016Feb01 [41] and regulomeDB [42].

GWAS trait grouping
Full listing of GWAS phenotypes included in each of the 17 trait groups are found in Supplementary Table 1. We grouped GWAS together based on known biological links. The traits that include multiple GWAS analysis are autoimmune, blood pressure, electrocardiogram (ECG), lipids, platelets, red blood cell (RBC), type 2 diabetes (T2D), and white blood cell (WBC), For autoimmune, we used the five most correlated traits (psoriasis, Crohn's disease, ulcerative colitis, Behcet's disease and ankylosing spondylitis) in an analysis of autoimmune disease correlation [43] plus inflammatory bowel disorder due to its similarity to ulcerative colitis and Crohn's disease [10]. For T2D, we also included fasting glucose and fasting insulin results from the NHGRI-EBI catalog, under the reasoning that either or both are used in the clinical diagnosis of T2D. For the remaining trait groups, all GWAS phenotypes included were assessed in the same paper and are highly correlated.
For all variants, we identified hg19 positions using Ensembl biomart [40], and removed variants if they could not be assigned a position. We identified loci of interest by collecting all variants with R 2 ≥ 0.7 to a GWAS variant in European 1000 genomes phase III [44] data using PLINK 1.90Beta4.5 [45]. Variants were considered part of a locus if they had R 2 ≥ 0.7 with any other variant in the locus. Final variant list includes 715,404 variants assigned to 32,713 loci and can be found in Supplementary Table 10.

Feature identification and processing
Gappedpeak files for histone ChIP-seq experiments and the "hotspot.fdr0.01.broad" files for DNase assays were downloaded on 02/10/16 from the consolidated epigenomes section of the Roadmap Epigenomics Project [46] portal. Uniform DNase files were downloaded 03/28/16 from http:// hgdow nload. cse. ucsc. edu/ golde nPath/ hg19/ encod eDCC/ wgEnc odeAw gDnas eUnif orm/. Homo sapiens FAIRE, transcription profiling by array, and ChIP-seq data were downloaded 01/06/16 from the ENCODE [47] project portal. We included data as provided for experiment accession numbers with a single file, and used the intersection if multiple files were provided. For transcription profiling by array data, we selected data labeled as "filtered transcribed fragments" if available. For ChIP-seq files, if the experiment accession had a file labeled "optimal idr threshold" or "replicated peaks" we used that file. For histone data, if both broadpeak and narrowpeak files were available, we created gappedpeak files. If only one peak type was available, we included that file. For all ChIP-seq targets that were not histones, narrowpeak files were used if available and broadpeak files otherwise.
Feature overlaps with GWAS loci were identified using bedtools2v2.25.0 intersect [48]. For a given locus, a feature was coded as 1 if any variant in the locus overlapped the feature, and a 0 if not. For indel variants, overlap was counted only with the start position of the variant.
To build the features for tissue-specific gene expression, we utilized previously calculated t-statistics for specific expression [49]. The nearest gene to each locus was determined as the gene containing a corresponding tissue-specific expression t-statistic with a start position nearest to the locus center. The locus was assigned a 1 in a tissue if the nearest gene had a t-statistic ranking in the top ten percent for that tissue and a 0 otherwise, creating 53 tissue-specific nearest gene expression features.
Finally, we counted the number of variants in a locus and included normalized variant number as a possible feature.

Benchmarking scores
We downloaded CADD [24] scores for 1000 genomes phase 3 variants from http:// cadd. gs. washi ngton. edu/ downl oad on 02/25/16. We downloaded GWAVA [25] scores on 02/25/2016 from ftp:// ftp. sanger. ac. uk/ pub/ resou rces/ softw are/ gwava/ v1.0/ annot ated/ GWAVA_ db_ csv. tgz and used tss_score for all analyses. For both CADD and GWAVA, a higher score indicates a greater probability of functionality, so if the downloaded database had multiple scores for a position, we used the highest score provided. If they did not have a score for a position, it was set to 0. We use the highest variant score as the locus score for both CADD and GWAVA.
We used the standalone deepSEA-0.94 [26] program to get deepSEA scores for our variants. Unlike the previous scores, a smaller e value corresponds to a greater probability of functionality, so we use the lowest variant deepSEA score as the score for the locus. For comparison purposes, we subtract this value from 1 so that it is on the same scale as the other metrics.

Building models
In order to accurately compare models across traits on independent holdout sets, we wanted to ensure that our holdout sets would not contain loci used for training, and vice versa. Therefore, we used the following schema to divide loci ( Supplementary Fig. 1). We considered a locus as being positive for a trait if at least one variant in the locus was associated with the trait and negative otherwise. We separated all loci into 2/3 training and 1/3 holdout. This separation was done semi-randomly, with the following conditions enforced: 1) for each trait, maintained a 2/3 train, 1/3 holdout ratio for positive loci, and 2) maintained native distribution of variant count in loci (for example if there were 60 loci containing 10 variants each, 40 were assigned to training and 20 to holdout). Furthermore, we removed any locus with > 1000 variants (1 locus removed). In this way we created a separate set of holdout loci that were not used to build any of the trait models and so could be used to compare model predictions across traits.
For each trait, we processed the training and holdout sets to identify positive and negative loci. First, we identified positive loci where at least one variant had P-value < 5 × 10 -8 with a phenotype in the trait group. From the remaining loci, we removed those with P-value < 0.05 in the associated GWAS (if summary statistics were available -no removals for ECG or psoriasis GWAS were possible) to create a negatives pool. Finally, we used only positive loci where we could match 6 (5 for height) negative loci with similar locus variant counts: within 10 of a locus with variant count < 100 or within 50 of a locus with variant count > = 100. This ensured that our positive and negative locus pools were roughly matched for locus size. Using these processed positive and negative locus sets for each trait, we built models using the glmnet 28 package in R, using binomial logistic regression with LASSO regularization and maximizing the AUC. The feature pool included 2305 ENCODE and Epigenome Roadmap annotations, 53 tissue-specific nearest gene expression annotations, the 100way Phastcon score, and normalized variant number. We used tenfold cross validation to select lambda.s1e, performed 15 trials of model building, and selected the median training AUC model as the trait model for analysis. We reasoned that selecting the median model preserved the goal of running multiple trials of cross validation modellingreducing overfitting and bias due to data imbalances.

Model analysis
General model performance was assessed using the ROCR package [52] in R on the holdout data & compared to the GWAVA, CADD, and DeepSEA predictors. We considered a model successful if the holdout AUC was > 0.7 and the difference between training and testing AUC was < 0.02; this includes models describing autoimmune, ECG, lipid, platelet, RBC, and WBC traits. Trait specific model performance was assessed by comparing the six selected model predictions on each holdout set. FDR thresholds used in further analysis were calculated from holdout datasets only.
Due to the nature of GWAS, there are loci tagged as positive across multiple traits. It is possible that these loci were driving the correlation between models for different traits. To test this, we removed all loci that are positive in the holdout set for the trait of interest and any other well modeled trait, creating a non-overlapping holdout set. We then compared ROC plots of the prediction of different models on the all and no overlap holdouts (Fig. 2,  Supplementary Table 1).

Genome wide SNP scores
Used 1000 genomes phase III downloaded on 03/03/2016 from ftp:// ftp. 1000g enomes. ebi. ac. uk/ vol1/ ftp/ relea se/ 20130 502/. Removed all variants that were not SNPs or that did not have rsID, leaving 78,017,615 SNPs. Calculated scores for individual SNPs via bedtools intersect using bed files used to make models, and recalculated gene expression annotation using single SNPs instead of loci. We then identified SNPs with FDR ≤ 0.01 for each model, excluding all SNPs used as positives in model building (Supplementary Table 6).
To calculate enrichment in the most recent blood traits GWAS [34], we used the fGWAS [39] platform. We downloaded the multi-ancestry MR-MEGA summary statistics for RBC, HCT, MCV, RDW, Neutro, Mono, Baso, Eosin, PLT and MPV phenotypes from http:// www. mhi-human genet ics. org/ en/ resou rces/ on 08/13/21. We filtered these results to remove variants at the same genomic coordinates, keeping the most common variant and calculated nearest distance to an FDR < 0.01 variant in the corresponding TSABL model (RBC: RBC, HCT, MCV, RDW; WBC: Neutro, Mono, Baso, Eosin; platelets: PLT, MPV). We ran fGWAS using distance intervals of 25, 50 and 500 kb (ranging from most to least stringent) and found 25 kb to have the best enrichment for all phenotypes ( Supplementary Fig. 5).
We used the Genomic Regions Enrichment of Annotations Tool (GREAT) [35] to interpret these scores using version 4.0.4 found at http:// bejer ano. stanf ord. edu/ great/ public/ html/ index. php. SNPs with FDR < 0.01 as described above were submitted using the whole genome as background to assess genomic enrichment. We found that the tool did not function if more than 150,000 SNPs were submitted, so for datasets with > 150,000 SNPs (ECG, WBC), a random subset of 150,000 was submitted. To assess if enrichments changed when variants linked to known GWAS variants were removed, we removed any variants within 500 KB of a lead GWAS hit for the trait, again submitting a random 150,000 if the SNP list exceeded (ECG) this threshold. For all analysis, we recorded the GO biological processes and mouse single KO phenotype results. We note that the analyses that returned no results were all assessing < 7,000 SNPs, and expect that they failed due to having too few input regions. Full lists of variants and results can be found in Supplementary Table 7.

Credible sets
We used available summary statistics (Supplementary Table 1) to calculate credible sets for GWAS of interest. We limited our analysis to loci used for modelling where the lead SNP was present in the summary statistic data (this eliminated ECG, for which no summary statistics were available, and lipids, which was missing summary statistics for many lead variants). We identified variants of interest within a 1 MB window surrounding lead SNPs, limiting to variants with R 2 ≥ 0.5 to the lead variant in European 1000 genomes phase III [44] data using PLINK 1.90Beta4.5 [45] These variants were assessed using R package corrcov [53] to create 95% confidence interval credible sets (Supplementary Table 8). To calculate percentage of credible sets our models reduce SNP number for, at each FDR threshold we found the number of credible sets where that threshold yielded at least 1, but less than the maximum number of SNPs and divided by the number of credible sets with more than 1 SNP.
Partitioned heritability for 20 relevant GWAS (Supplementary Table 4) as per https:// github. com/ bulik/ ldsc/ wiki/ Parti tioned-Herit abili ty. For models with some GWAS summary statistics (autoimmune, lipids, platelets, RBC, WBC), we also removed SNPs in LD R 2 > 0.7 with lead SNPs (Supplementary Table 45) from both annot and plink files, then recalculated LD Scores and repartitioned heritability to test how much heritability is driven by lead SNPs (Supplementary Table 5). To build models using GWAS loci, we grouped lead variants by LD, then split into training and holdout sets semi-randomly, enforcing an even distribution of positive loci and maintaining the distribution of variant counts between the sets. For modelling, we included only positive loci which had a sufficient number of negative loci with similar locus variant count, which was enforced in both training and holdout sets. After models were built, they were assessed by building ROC plots in the holdout sets. Supplementary Figure 2. Holdout AUCs vs # Selected Features and # Groups used for Training. Model AUC in holdout set plotted against A) number of features selected by the model B) number of positives groups used for training C) % total groups containing a coding variant and D) % training groups containing a coding variant. Panels C and D include a linear model best fit line as well as significance. In all panels, models which passed selection criteria (AUC ≥ 0.7, AUC change between training and holdout ≤ 0.02) are green while models which were not selected are purple.