Skip to main content

TSABL: Trait Specific Annotation Based Locus predictor



The majority of Genome Wide Associate Study (GWAS) loci fall in the non-coding genome, making causal variants difficult to identify and study. We hypothesized that the regulatory features underlying causal variants are biologically specific, identifiable from data, and that the regulatory architecture that influences one trait is distinct compared to biologically unrelated traits.


To better characterize and identify these variants, we used publicly available GWAS loci and genomic annotations to build 17 Trait Specific Annotation Based Locus (TSABL) predictors to identify differences between GWAS loci associated with different phenotypic trait groups. We used a penalized binomial logistic regression model to select trait relevant annotations and tested all models on a holdout set of loci not used for training in any trait. We were able to successfully build models for autoimmune, electrocardiogram, lipid, platelet, red blood cell, and white blood cell trait groups. We used these models both to prioritize variants in existing loci and to identify new genomic regions of interest.


We found that TSABL models identified biologically relevant regulatory features, and anticipate their future use to enhance the design and interpretation of genetic studies.

Peer Review reports


Genome-wide association studies (GWAS) are a widely used method to identify genomic regions associated in a population with a phenotype of interest, and have successfully detected thousands of loci across the human genome, each of which contains a set of linked variants that are potentially causal [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]. Unfortunately, identifying the specific causal variant(s) and underlying mechanism at associated loci is far more challenging, and for many genomic regions there exist more potential causal variants in credible sets than can be experimentally tested [17,18,19]. While multi-ancestry fine-mapping can help to narrow the set of putative variants [20], this approach is limited to loci which share genetic architecture and genetic variation across populations, and still may not resolve causal variant(s) within a locus. Therefore, an orthogonal method for prioritizing variants in GWAS loci would be beneficial. Additionally, as more well-powered GWAS reveal more loci, a straightforward method to prioritize the likelihood of variants contributing to a specific disease will aid in locus discovery.

While pleiotropy is pervasive across the human genome [21], the contribution of individual variants depends on the biological relatedness of the traits under consideration. For example, the loci, causal genes, and variants associated with circulating plasma lipid levels might reasonably be expected to differ from those that modify electrocardiogram (ECG) traits, as the biological basis for these traits involve different tissues (liver vs. heart), different physiologies (lipid metabolism and trafficking vs. cardiac electric conductance), and different genes (e.g., SORT1 [22] vs. NOS1AP [23]). Presumably, then, with sufficient association data and a diverse and dense collection of functional annotations across the genome, it might be possible to train a model to predict loci – and perhaps even variants – that relate to specific sets of traits relative to all others. Ideally, such a predictor would learn and select biologically relevant annotations from a set of all possible genomics features, thereby also providing informative hypotheses for why particular loci or variants are predicted to be “trait-relevant”. Current methods to predict if a variant is likely functional for any trait (such as CADD [24], GWAVA [25], and DeepSEA [26]) provide a score based on likelihood of function across multiple (unrelated) phenotypes and diseases.

Previously, we built platelet and red blood cell models using those GWAS as positives and random matched genomic regions as controls [27]. While this was successful, we found the models to not be trait selective. Here, we instead sought to develop a trait specific annotation based locus (TSABL) predictor which would discriminate loci associated with a specific trait (or collection of related traits) from all other complex trait associated loci identified by GWAS. We collected association results from 42 GWAS-scanned phenotypes [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16], coarsely grouping them into 17 related trait groups to test and validate prediction models (Supplementary Table 1). We then applied a penalized binomial logistic regression model (LASSO) to each trait group using publicly available genome wide features such as DNase hypersensitivity and histone or transcription factor ChIP-seq (Supplementary Table 2), along with gene expression, phastCon score, and size of locus. The six successful TSABL models were able to separate regions associated with their trait group from other GWAS loci and identified sensible biological features for the phenotypes in focus. We then applied these models to fine mapping and locus discovery paradigms, both of which may lead to translationally relevant trait-specific biological mechanisms.


To develop our approach, we needed to define the set of traits, identify positive and negative loci for those traits, and obtain a collection of genomic features to discriminate between these labels. For traits to study, we focused on 42 phenotypes obtained from curated GWAS and the NHGRI-EBI GWAS catalog that we collapsed into 17 related trait groupings (Methods, Supplementary Table 1). Using all GWAS lead variants identified, we first clumped variants into loci by linkage disequilibrium (using European LD R2 > 0.7). Next, for each trait group we defined positive loci as those which contained at least one variant with an associated P-value < 5 × 10–8 with at least one of the phenotypes in the trait group. Negative loci for each trait group were defined as those without any variant even modestly associated (P-value > 0.05) with one or more of the phenotypes in the trait group. We utilized genome-wide annotations from multiple public resources, including nearest gene tissue-specific expression and epigenomic features, tagging these features across positive and negative examples if the features overlapped with any variant in the locus.

We then generated predictive models for each of the 17 trait groups using a penalized binomial logistic regression model (LASSO) [28], using cross-validation for training and estimating final prediction accuracy on a holdout set not used in training any model (Methods, Supplementary Fig. 1). We defined a model as informative if the area under the curve (AUC) was ≥ 0.7 and the difference between training and holdout AUCs was ≤ 0.02 (implying only a small level of overfitting). Based on these criteria, models trained for autoimmune, electrocardiogram (ECG), lipid, platelet, red blood cell (RBC), and white blood cell (WBC) traits passed, and were the focus of further investigation (Fig. 1a, Table 1).

Fig. 1
figure 1

Model AUCs. A Training vs Holdout AUCs. Dotted grey line shows equality. Selected models (holdout AUC ≥ 0.7) shown in green, unselected in purple. B Individual holdout AUC plots for selected models, compared to CADD (pink), GWAVA (cyan), DeepSEA (brown)

Table 1 All Traits Results Summary

To assess why these 6 traits produced successful models while the other 11 failed, we considered whether trait models with more loci with coding variants, more positive loci or more features were more predictive (Supplementary Fig. 2). We found a significant relationship between percentage of loci containing coding variants and holdout set AUC, suggesting that even though our models are training on annotations we consider to be non-coding in nature, traits with more loci containing coding variants are easier for our models to predict. We did not find an obvious relationship between either number of loci used to train models or number of annotations selected and holdout set AUC, suggesting that our inability to predict these traits is not based on limitations in availability of positive/negative examples or the total number of available features. The implication may be that the catalog of existing features available for modelling insufficiently captures differences between these other 11 traits and the spectrum of complex traits more generally.

We considered two metrics of model specificity. First, trait specific models should be more informative for prediction relative to functional variant prediction methods which were designed to be trait agnostic. For the trait groups we assembled, the holdout set AUCs for CADD, GWAVA, and DeepSEA were all close to 0.5, indicating that none were informative for discriminating amongst trait-associated GWAS loci (Fig. 1B, Table 1). Second, trait specific predictors should perform the best on the phenotype collections from which they were trained. Looking at comparative model performance, we observed that all six selected models were better at predicting their own holdout set compared to others (Fig. 2, left panels and Supplementary Fig. 3). These results indicate that these six prediction models can discriminate between traits using GWAS loci alone, and that the power to make predictions is not due to intrinsic functionality of GWAS loci in general that would be captured by trait agnostic methods.

Fig. 2
figure 2

Model AUCs on same holdout sets. Each panel shows the named holdout set: A Platelets and B ECG. AUCs for all six selected models are shown on all plots, with the left (All) plot having all loci in that holdout and the right (No Overlap) having only those loci not positive in multiple models

Given the pervasive nature of pleiotropy, we would expect that predictors for traits that are related to one another – but perhaps not so closely as to be grouped together – should be partially cross-predictive. We observed this to be true for the RBC model, which could be used to predict the platelet holdout set with an AUC of 0.79 (Fig. 2A, left panel). To rule out that this similarity was driven by loci that are positive for both traits, we removed all loci from the holdout set that were positive in both the “in-focus” trait (here, platelets) and any other model, then reassessed the holdout AUC for each model. Even after removing overlapping loci, the RBC model still predicted the platelet holdout set well, indicating the prediction accuracy is not driven by overlapping loci (RBC AUC of 0.76 on platelets holdout set; Fig. 2A, right panel). In contrast, the ECG model was the only one that predicted the ECG holdout set reasonably, while the autoimmune, lipids, platelets, RBC and WBC models all had AUC < 0.55 on the ECG holdout (Fig. 2B).

Looking at the annotations selected as important for our six successful models, we found them to be largely in line with known trait biology (Fig. 3, Supplementary Fig. 4, Supplementary Table 3). All three blood traits (platelets, RBC, and WBC) selected mainly whole blood annotations that positively rank active regions (active histone marks, gene expression, and transcription factor (TF) binding) and negatively rank repressive histone marks (H3K9me3) (Fig. 3A, Supplementary Fig. 4B and C). By contrast, the ECG model selected only heart tissue annotations (Fig. 3B), and the lipids model chose predominantly liver tissue features (Fig. 3C). We also identified features that distinguish these traits from each other. For example, only the models for the immune traits autoimmune and WBC selected spleen gene expression.

Fig. 3
figure 3

Selected Model Feature Coefficients. A Platelets, B ECG, and C Lipids model coefficients, sorted by tissue type and value. GE – Gene Expression

We next used the framework of stratified LDScore regression [29] to evaluate if the scores emitted by our models could be used as an annotation to explain partitioned heritability. With our six added annotations derived from our six selected models, we tested 103 annotations in total and so considered enrichments with P-value < 4.8 × 10–4 significant by Bonferroni correction. We found enrichment in GWAS for all successful models with summary statistics available (autoimmune, lipids, platelets, RBC, WBC) (Supplementary Table 4). To test whether this enrichment was driven by the lead GWAS SNPs that the models were based on, we also removed these lead GWAS SNPs and their LD proxies (EUR R2 LD > 0.7) from the dataset and reassessed enrichment. We found that the lipids, platelets, and WBC models were significant in all individual phenotype association data, and the RBC model was significant for 3 of 4 phenotype GWAS (Supplementary Table 5). Perplexingly, the autoimmune model was not significant for any phenotype GWAS after lead SNPs were removed, but the related trait models for WBC and platelets were significant for all 3 autoimmune phenotype GWAS. We expect that the autoimmune trait group failed this test due to a lack of power, as it had the fewest positive loci of the tested successful models. The enrichment we see in LDScore partitioned heritability demonstrates that our trait- and cell type-specific modeling approach can define target loci for related disease phenotypes, facilitating downstream validation and biological/mechanistic understanding.

Application—locus discovery

We next applied our six predictive models to all SNPs identified in phase III of the 1000 Genomes project. From ~ 78 million scored SNPs, we predicted between 7,366 and 485,424 trait associated SNPs per model that exceeded a FDR < 0.01 threshold (Supplementary Table 6). To find genes associated with these SNPs, we first identified the 436 to 2,137 genes overlapping these variants, of which 153 to 879 were outside of identified GWAS regions (defined as 1 MB window surrounding lead SNPs; Supplementary Table 6). For example, the lipids model identified 18,847 SNPs, with 458 overlapping genes, of which 167 were outside known GWAS regions. While we found many SNPs near established GWAS variants, we were interested here in the regions not previously identified by GWAS analyses, and our models highlight several points of potentially interesting biology. For the lipids model, we identified a cluster of 10 SNPs overlapping DHCR24, a known cholesterol biosynthesis gene [30], and a cluster of 63 SNPs overlapping PLIN2, a gene involved in lipid globule storage [31], both of which are reasonable biological candidates for lipid related phenotypes. For the platelets model, we identified a cluster of 8 SNPs overlapping JARID2, a gene known to be involved in hematopoiesis [32], and 7 SNPs overlapping PXN, which is involved in immunity response [33], both of which may plausibly impact platelet phenotypes. Additionally, we found that FDR < 0.01 variants of the platelets, RBC, and WBC models were enriched for being within 25 kb of a newly published set of blood traits GWAS [34] (Supplementary Fig. 5).

To assess whether the trait associated SNPs mapped onto established biological knowledge, we used GREAT [35] analysis to examine both the human gene ontology (GO) biological process and mouse single knockout (KO) phenotypes associated with these SNPs. We used both the full set of trait associated SNPs and also removed all variants within 500 KB of identified GWAS regions (non-GWAS variants) (Supplementary Table 7). For the full set of trait associated SNPs, all models returned mouse KO phenotypes consistent with our expectations. For example, mouse KOs for genes overlapping our autoimmune associated SNPs are enriched for phenotypes such as “abnormal T cell morphology” and “decreased lymphocyte cell number.” GO biological processes were similarly consistent with expectations for autoimmune, ECG, lipids, and WBC trait models, while the platelets and RBC traits returned mostly generic results (for example, “RNA processing”). The non-GWAS variants analyses were similar, except that the lipids and RBC models did not return any results for either the mouse KO or GO biological processes analyses.

Application – fine mapping

Functional cellular follow-up studies require one to prioritize GWAS loci, genes and variants within them that are likely causal. To assess the usefulness of our model scores for individual SNPs within GWAS loci, we identified SNPs in credible sets that exceed FDR thresholds of 0.10, 0.25, or 0.50 of the relevant model for phenotypes where appropriate summary statistics were available (autoimmune, platelets, RBC and WBC, Table 2, Supplementary Tables 8 and 9). At FDR 0.10, our models prioritize a subset of SNPs of interest in 27% of multi-SNP credible sets; at FDR 0.25, 52% of credible sets see prioritization. For many of these loci the number of SNPs highly prioritized is substantially less than the size of the credible set. For example, lead rs10486483 from Crohn’s Disease had a credible set of 85 variants, prioritizing using an FDR 0.10 threshold leads to a list of 3 variants to test; or lead rs2382817 in Inflammatory Bowel Disorder, where 5 variants were prioritized from a credible set of 68. While a set threshold is useful for summarizing, it is not necessary in practice, and variants in a credible set can simply be ordered by their TSABL score for assessment, facilitating immediate integration into functional follow up study pipelines.

Table 2 Credible set improvements


This approach is limited by both the GWAS data and genome wide features available. Surprisingly, we were unable to adequately model some complex traits (Table 1). However, we expect that additional genomic features will improve some models; for example, in the dataset used for this analysis there are 642 annotations from blood tissue but only 31 from bone (Supplementary Table 2) and it seems likely that part of the reason the height model was not well specified is that it lacked key genomic annotations relevant to bone growth. Additionally, the rapidly expanding repertoire of single cell data sets may better parse complex tissue types (e.g., pancreas) and capture data relevant to rare cell types that are underrepresented in current datasets. Future studies will be designed to incorporate these data types, which are only currently emerging.

We selected a modelling strategy to allow for improved biological interpretability, but this presumably comes at a cost of accuracy. We intentionally selected a variety of traits with very different underlying physiology, and we found that the traits well described by our models are linked largely to single cell types, namely blood, liver, or heart. We do not successfully model traits such as Body Mass Index (BMI) or Type 2 Diabetes (T2D), both of which involve many tissues and have disease subtypes that are not well captured in current GWAS analyses [36,37,38]. We expect that analyzing these phenotypes as a single cohort may serve to obfuscate rather than clarify, and anticipate that incorporating more granularity in phenotypes (type of T2D or severity of BMI, for example) will improve model performance.

Pleiotropy presents another challenge to our modeling approach, but is inherently captured in any genomic data set. It is plausible that more sophisticated approaches with this in mind might be anticipated to improve performance (e.g., multi-label classification). Moreover, approaches that captured uncertainty around labelled examples (e.g., “noisy” labels) might also be helpful to address these challenges as well as expand the use of both positive and negative examples. Additional processing of genome wide features, both to account for correlation and to reduce the feature space prior to model building (e.g., stability selection) may also improve results.

An important limitation to acknowledge concerning our locus discovery and fine mapping applications is that both are applied to variant level data, while our models were trained on locus level data that incorporates feature overlaps from multiple variants within a locus. While ideally we would train a model on individual variants before applying it to individual variants, the reality of GWAS data means that the pool of validated causal variants is still inadequate to this task. Identifying causal variants from other sources introduces additional sources of bias which we judged to be larger than that introduced by shifting from locus to variant.

A logical point of comparison for TSABL is to another strategy used for fine mapping or locus discovery. The most similar is the fGWAS method, which is designed to jointly consider GWAS and genomic annotations [39]. The main distinction between these models is that fGWAS considers trait loci compared to all regions of the genome (regardless of their association with a complex trait) while TSABL specifically uses established GWAS loci as the negative set. In practice, what this means is that TSABL models found differences between traits, rather than annotations that may be held in common between all GWAS loci. While both approaches are useful, identifying trait specific annotations will allow for the translation of GWAS loci into functional biological hypotheses, facilitating disease treatment and variant-based drug discovery.


With current pitfalls in fine mapping and locus discovery, our approach can distinguish amongst complex trait-related groups applied to 17 sets of traits. Our modeling process, built using publicly available GWAS and annotation data, selects and outputs features and variants relevant to underlying disease biology. These models predict differences between traits that currently available SNP ranking schemes do not, and the overlaps we see among models are consistent with known biological links between the traits assessed. Our modelling strategy makes no assumptions about number of causal variants in a GWAS locus, is expandable to include new genome wide features as they are developed, and can be applied to locus discovery problems or to prioritization after fine-mapping for functional validation efforts.


General data sources and analysis

All chromosome:position coordinates given in this paper refer to genome build hg19. We downloaded refseq gene exon coordinates for build hg19 from ENSEMBL [40] at on 10/24/17, and used the extreme exon boundaries as gene boundaries. We assessed annotations for rsIDs of interest using ANNOVAR version 2016Feb01 [41] and regulomeDB [42].

GWAS trait grouping

Full listing of GWAS phenotypes included in each of the 17 trait groups are found in Supplementary Table 1. We grouped GWAS together based on known biological links. The traits that include multiple GWAS analysis are autoimmune, blood pressure, electrocardiogram (ECG), lipids, platelets, red blood cell (RBC), type 2 diabetes (T2D), and white blood cell (WBC), For autoimmune, we used the five most correlated traits (psoriasis, Crohn’s disease, ulcerative colitis, Behcet’s disease and ankylosing spondylitis) in an analysis of autoimmune disease correlation [43] plus inflammatory bowel disorder due to its similarity to ulcerative colitis and Crohn’s disease [10]. For T2D, we also included fasting glucose and fasting insulin results from the NHGRI-EBI catalog, under the reasoning that either or both are used in the clinical diagnosis of T2D. For the remaining trait groups, all GWAS phenotypes included were assessed in the same paper and are highly correlated.

GWAS loci identification & processing

We downloaded the NHGRI-EBI GWAS catalog v1.0 2019–06-20 from on 06/21/19 and processed GWAS results by removing the following associations:

  1. 1)

    with no listed P-value or a P-value > 5e-8

  2. 2)

    with no rsID listed or multiple rsIDs listed

  3. 3)

    listed as on the X chromosome, no chromosome, or multiple chromosomes

  4. 4)

    with descriptions indicating they were interaction effects with another locus

  5. 5)

    located in the HLA region (chr6:29,670,261–33,104,175; hg19 coordinates)

Remaining associations were grouped by rsIDs, resulting in 53,640 variants.

We also included lead variants from GWAS for specific phenotypes associated with our 17 trait groups, regardless of their inclusion in the GWAS catalog (autoimmune [3, 10], breast cancer [14], birthweight [6], bone mineral density [15], body mass index [8], blood pressure [11], coronary artery disease [4], electrocardiogram [5], height [8], lipids [12], age at menarche [9], platelets [1], red blood cells [1], schizophrenia/bipolar disorder [2], type 2 diabetes [13], white blood cells [1], and waist hip ratio adjusted for BMI [16]; Supplementary Table 1). Literature curated GWAS variants were processed in the same way as catalog variants so that our input was a list of variant rsIDs.

For all variants, we identified hg19 positions using Ensembl biomart [40], and removed variants if they could not be assigned a position. We identified loci of interest by collecting all variants with R2 ≥ 0.7 to a GWAS variant in European 1000 genomes phase III [44] data using PLINK 1.90Beta4.5 [45]. Variants were considered part of a locus if they had R2 ≥ 0.7 with any other variant in the locus. Final variant list includes 715,404 variants assigned to 32,713 loci and can be found in Supplementary Table 10.

Feature identification and processing

Gappedpeak files for histone ChIP-seq experiments and the “hotspot.fdr0.01.broad” files for DNase assays were downloaded on 02/10/16 from the consolidated epigenomes section of the Roadmap Epigenomics Project [46] portal. Uniform DNase files were downloaded 03/28/16 from Homo sapiens FAIRE, transcription profiling by array, and ChIP-seq data were downloaded 01/06/16 from the ENCODE [47] project portal. We included data as provided for experiment accession numbers with a single file, and used the intersection if multiple files were provided. For transcription profiling by array data, we selected data labeled as “filtered transcribed fragments” if available. For ChIP-seq files, if the experiment accession had a file labeled “optimal idr threshold” or “replicated peaks” we used that file. For histone data, if both broadpeak and narrowpeak files were available, we created gappedpeak files. If only one peak type was available, we included that file. For all ChIP-seq targets that were not histones, narrowpeak files were used if available and broadpeak files otherwise.

Feature overlaps with GWAS loci were identified using bedtools2v2.25.0 intersect [48]. For a given locus, a feature was coded as 1 if any variant in the locus overlapped the feature, and a 0 if not. For indel variants, overlap was counted only with the start position of the variant.

To build the features for tissue-specific gene expression, we utilized previously calculated t-statistics for specific expression [49]. The nearest gene to each locus was determined as the gene containing a corresponding tissue-specific expression t-statistic with a start position nearest to the locus center. The locus was assigned a 1 in a tissue if the nearest gene had a t-statistic ranking in the top ten percent for that tissue and a 0 otherwise, creating 53 tissue-specific nearest gene expression features.

We downloaded 100way phastCon [50] scores for the human genome on 04/26/16 from and used the bigWigAverageOverBed [51] tool from the UCSC toolkit to extract scores for single SNPs. We use the highest variant 100way phastCon score as the score for the locus.

Finally, we counted the number of variants in a locus and included normalized variant number as a possible feature.

The complete list of 2305 ENCODE and Roadmap Epigenomics features, nearest gene tissue specific expression, phastcon, and variant count used in this paper is found in Supplementary Table 2. Tables used for modelling are provided as Supplementary Tables 11121314151617181920212223242526272829303132333435363738394041424344.

Benchmarking scores

We downloaded CADD [24] scores for 1000 genomes phase 3 variants from on 02/25/16. We downloaded GWAVA [25] scores on 02/25/2016 from and used tss_score for all analyses. For both CADD and GWAVA, a higher score indicates a greater probability of functionality, so if the downloaded database had multiple scores for a position, we used the highest score provided. If they did not have a score for a position, it was set to 0. We use the highest variant score as the locus score for both CADD and GWAVA.

We used the standalone deepSEA-0.94 [26] program to get deepSEA scores for our variants. Unlike the previous scores, a smaller e value corresponds to a greater probability of functionality, so we use the lowest variant deepSEA score as the score for the locus. For comparison purposes, we subtract this value from 1 so that it is on the same scale as the other metrics.

Building models

In order to accurately compare models across traits on independent holdout sets, we wanted to ensure that our holdout sets would not contain loci used for training, and vice versa. Therefore, we used the following schema to divide loci (Supplementary Fig. 1). We considered a locus as being positive for a trait if at least one variant in the locus was associated with the trait and negative otherwise. We separated all loci into 2/3 training and 1/3 holdout. This separation was done semi-randomly, with the following conditions enforced: 1) for each trait, maintained a 2/3 train, 1/3 holdout ratio for positive loci, and 2) maintained native distribution of variant count in loci (for example if there were 60 loci containing 10 variants each, 40 were assigned to training and 20 to holdout). Furthermore, we removed any locus with > 1000 variants (1 locus removed). In this way we created a separate set of holdout loci that were not used to build any of the trait models and so could be used to compare model predictions across traits.

For each trait, we processed the training and holdout sets to identify positive and negative loci. First, we identified positive loci where at least one variant had P-value < 5 × 10–8 with a phenotype in the trait group. From the remaining loci, we removed those with P-value < 0.05 in the associated GWAS (if summary statistics were available – no removals for ECG or psoriasis GWAS were possible) to create a negatives pool. Finally, we used only positive loci where we could match 6 (5 for height) negative loci with similar locus variant counts: within 10 of a locus with variant count < 100 or within 50 of a locus with variant count >  = 100. This ensured that our positive and negative locus pools were roughly matched for locus size. Using these processed positive and negative locus sets for each trait, we built models using the glmnet28 package in R, using binomial logistic regression with LASSO regularization and maximizing the AUC. The feature pool included 2305 ENCODE and Epigenome Roadmap annotations, 53 tissue-specific nearest gene expression annotations, the 100way Phastcon score, and normalized variant number. We used tenfold cross validation to select lambda.s1e, performed 15 trials of model building, and selected the median training AUC model as the trait model for analysis. We reasoned that selecting the median model preserved the goal of running multiple trials of cross validation modelling – reducing overfitting and bias due to data imbalances.

Model analysis

General model performance was assessed using the ROCR package [52] in R on the holdout data & compared to the GWAVA, CADD, and DeepSEA predictors. We considered a model successful if the holdout AUC was > 0.7 and the difference between training and testing AUC was < 0.02; this includes models describing autoimmune, ECG, lipid, platelet, RBC, and WBC traits. Trait specific model performance was assessed by comparing the six selected model predictions on each holdout set. FDR thresholds used in further analysis were calculated from holdout datasets only.

Due to the nature of GWAS, there are loci tagged as positive across multiple traits. It is possible that these loci were driving the correlation between models for different traits. To test this, we removed all loci that are positive in the holdout set for the trait of interest and any other well modeled trait, creating a non-overlapping holdout set. We then compared ROC plots of the prediction of different models on the all and no overlap holdouts (Fig. 2, Supplementary Table 1).

Genome wide SNP scores

Used 1000 genomes phase III downloaded on 03/03/2016 from Removed all variants that were not SNPs or that did not have rsID, leaving 78,017,615 SNPs. Calculated scores for individual SNPs via bedtools intersect using bed files used to make models, and recalculated gene expression annotation using single SNPs instead of loci. We then identified SNPs  with FDR ≤ 0.01 for each model, excluding all SNPs used as positives in model building (Supplementary Table 6).

To calculate enrichment in the most recent blood traits GWAS [34], we used the fGWAS [39] platform. We downloaded the multi-ancestry MR-MEGA summary statistics for RBC, HCT, MCV, RDW, Neutro, Mono, Baso, Eosin, PLT and MPV phenotypes from on 08/13/21. We filtered these results to remove variants at the same genomic coordinates, keeping the most common variant and calculated nearest distance to an FDR < 0.01 variant in the corresponding TSABL model (RBC: RBC, HCT, MCV, RDW; WBC: Neutro, Mono, Baso, Eosin; platelets: PLT, MPV). We ran fGWAS using distance intervals of 25, 50 and 500 kb (ranging from most to least stringent) and found 25 kb to have the best enrichment for all phenotypes (Supplementary Fig. 5).

We used the Genomic Regions Enrichment of Annotations Tool (GREAT) [35] to interpret these scores using version 4.0.4 found at SNPs with FDR < 0.01 as described above were submitted using the whole genome as background to assess genomic enrichment. We found that the tool did not function if more than 150,000 SNPs were submitted, so for datasets with > 150,000 SNPs (ECG, WBC), a random subset of 150,000 was submitted. To assess if enrichments changed when variants linked to known GWAS variants were removed, we removed any variants within 500 KB of a lead GWAS hit for the trait, again submitting a random 150,000 if the SNP list exceeded (ECG) this threshold. For all analysis, we recorded the GO biological processes and mouse single KO phenotype results. We note that the analyses that returned no results were all assessing < 7,000 SNPs, and expect that they failed due to having too few input regions. Full lists of variants and results can be found in Supplementary Table 7.

Credible sets

We used available summary statistics (Supplementary Table 1) to calculate credible sets for GWAS of interest. We limited our analysis to loci used for modelling where the lead SNP was present in the summary statistic data (this eliminated ECG, for which no summary statistics were available, and lipids, which was missing summary statistics for many lead variants). We identified variants of interest within a 1 MB window surrounding lead SNPs, limiting to variants with R2 ≥ 0.5 to the lead variant in European 1000 genomes phase III [44] data using PLINK 1.90Beta4.5 [45] These variants were assessed using R package corrcov [53] to create 95% confidence interval credible sets (Supplementary Table 8). To calculate percentage of credible sets our models reduce SNP number for, at each FDR threshold we found the number of credible sets where that threshold yielded at least 1, but less than the maximum number of SNPs and divided by the number of credible sets with more than 1 SNP.

LDSC partitioned heritability

Downloaded LD Scores v2.2, plink files, weights, and other necessary files from Added 6 annotations corresponding to 0–1 normalized SNP scores, one per trait model (autoimmune, ECG, lipids, platelets, RBC, WBC) to V2.2 annot files. Removed 11,608 SNPs on chromosome 12 containing ssIDs instead of rsIDs from both annot and plink files. Recalculated LD Scores [29] using new annot files as per Partitioned heritability for 20 relevant GWAS (Supplementary Table 4) as per For models with some GWAS summary statistics (autoimmune, lipids, platelets, RBC, WBC), we also removed SNPs in LD R2 > 0.7 with lead SNPs (Supplementary Table 45) from both annot and plink files, then recalculated LD Scores and repartitioned heritability to test how much heritability is driven by lead SNPs (Supplementary Table 5).

Availability of data and materials

All data generated or analysed during this study are included in this published article and its supplementary information files.

Refseq genes:

Roadmap Epigenomics consolidated epigenomes files:

Uniform DNase files:


100way phastCons:

CADD scores:

GWAVA scores:

1000 genomes phase III:

Blood trait meta analysis:

LDSC resources:


  1. Astle WJ, et al. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell. 2016;167:1415-1429.e19.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Bipolar Disorder and Schizophrenia Working Group of the Psychiatric Genomics Consortium, Electronic address:, Bipolar Disorder and Schizophrenia Working Group of the Psychiatric Genomics Consortium. Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell. 2018;173(7):1705-1715.e16.

    Article  PubMed Central  CAS  Google Scholar 

  3. Tsoi LC, et al. Large scale meta-analysis characterizes genetic architecture for common psoriasis associated variants. Nat Commun. 2017;8:1–8.

    Article  CAS  Google Scholar 

  4. van der Harst P, Verweij N. Identification of 64 novel genetic loci provides an expanded view on the genetic architecture of coronary artery disease. Circ Res. 2018;122:433–43.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. van Setten J, et al. PR interval genome-wide association meta-analysis identifies 50 loci associated with atrial and atrioventricular electrical activity. Nat Commun. 2018;9:1–11.

    CAS  Google Scholar 

  6. Warrington NM, et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nat Genet. 2019;51:804–14.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Willer CJ, et al. Discovery and refinement of loci associated with lipid levels. Nat Genet. 2013;45:1274–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Yengo L, et al. Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of European ancestry. Hum Mol Genet. 2018;27:3641–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Day FR, et al. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat Genet. 2017;49:834–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. De Lange KM, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat Genet. 2017;49:256–61.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  11. Giri A, et al. Trans-ethnic association study of blood pressure determinants in over 750,000 individuals. Nat Genet. 2019;51:51–62.

    Article  CAS  PubMed  Google Scholar 

  12. Klarin D, et al. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat Genet. 2018;50:1514–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Mahajan A, et al. Fine-mapping of an expanded set of type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps Individual study design and principal investigators Europe PMC Funders Group. Nat Genet. 2018;50:1505–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Michailidou K, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551:92–4.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Morris JA, et al. An atlas of genetic influences on osteoporosis in humans and mice. Nat Genet. 2019;51:258–66.

    Article  CAS  PubMed  Google Scholar 

  16. Pulit SL, et al. Meta-analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Hum Mol Genet. 2019;28:166–74.

    Article  CAS  PubMed  Google Scholar 

  17. Maller JB, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet. 2012;44:1294–301.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Claussnitzer M, et al. A brief history of human disease genetics. Nature. 2020;577:179–89.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Mägi R, et al. Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Hum Mol Genet. 2017;26:3639–50.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  21. Weissbrod O, et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet. 2020;52:1355–63.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Strong A, Patel K, Rader DJ. Sortilin and lipoprotein metabolism: Making sense out of complexity. Curr Opin Lipidol. 2014;25:350–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. George AL, et al. NOS1AP is a genetic modifier of the long-QT syndrome. Circulation. 2009;120:1657–63.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  24. Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Ritchie GRS, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods. 2014;11:294–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015;12:931–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Thom CS, et al. Tropomyosin 1 genetically constrains in vitro hematopoiesis. BMC Biol. 2020;18:1–16.

    Article  CAS  Google Scholar 

  28. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Finucane HK, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015;47:1228–35.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Waterham HR, et al. Mutations in the 3β-hydroxysterol Δ24-reductase gene cause desmosterolosis, an autosomal recessive disorder of cholesterol biosynthesis. Am J Hum Genet. 2001;69:685–94.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Heid HW, Moll R, Schwetlick I, Rackwitz H-R, Keenan TW. Adipophilin is a specific marker of lipid accumulation in diverse cell types and diseases. Cell Tissue Res. 1998;294:309–21.

    Article  CAS  PubMed  Google Scholar 

  32. Pasini D, et al. JARID2 regulates binding of the Polycomb repressive complex 2 to target genes in ES cells. Nature. 2010;464:306–10.

    Article  CAS  PubMed  Google Scholar 

  33. Ostergaard HL, Lou O, Arendt CW, Berg NN. Paxillin phosphorylation and association with Lck and Pyk2 in anti-CD3- or anti-CD45-stimulated T cells. J Biol Chem. 1998;273:5692–6.

    Article  CAS  PubMed  Google Scholar 

  34. Vuckovic D, et al. The polygenic and monogenic basis of blood traits and diseases. Cell. 2020;182:1214-1231.e11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. McLean CY, et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Prasad R, Groop L. Genetics of Type 2 Diabetes—Pitfalls and Possibilities. Genes (Basel). 2015;6:87–123.

    Article  CAS  Google Scholar 

  37. Udler MS, et al. Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: A soft clustering analysis. PLoS Med. 2018;15: e1002654.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  38. Fernández-Tajes J, et al. Developing a network view of type 2 diabetes risk pathways through integration of genetic, genomic and functional data. Genome Med. 2019;11:1–14.

    Article  Google Scholar 

  39. Pickrell JK. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet. 2014;94:559–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Yates A, et al. Ensembl 2016. Nucleic Acids Res. 2016;44:D710–6.

    Article  CAS  PubMed  Google Scholar 

  41. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164–e164.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Boyle AP, et al. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 2012;22:1790–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Farh KKH, et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2015;518:337–43.

    Article  CAS  PubMed  Google Scholar 

  44. Auton A, et al. A global reference for human genetic variation. Nature. 2015;526:68–74.

    Article  PubMed  CAS  Google Scholar 

  45. Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Kundaje A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.

    Article  CAS  Google Scholar 

  48. Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Finucane HK, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat Genet. 2018;50:621–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Siepel A. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010;26:2204–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Sing T, Sander O, Beerenwinkel N, Lengauer TROCR. Visualizing classifier performance in R. Bioinformatics. 2005;21:3940–1.

    Article  CAS  PubMed  Google Scholar 

  53. Hutchinson A, Watson H, Wallace C. Improving the coverage of credible sets in Bayesian genetic fine-mapping. PLOS Comput Biol. 2020;16:e1007829.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.


NIH NIDDK (DK101478).

Author information

Authors and Affiliations



KL and BFV designed the study and wrote the manuscript; KL, CST and SA performed the analyses; all authors read and approved the final manuscript.

Corresponding author

Correspondence to Benjamin F. Voight.

Ethics declarations

Ethics approval and consent to participate

All human data used in this study was publically available.

Consent for publication

Not applicable.

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

Trait Groups. Phenotypes and how we grouped them into traits. Columns are Trait, Trait abbreviation, phenotype, GWAS citation, Number of Lead SNPs included from Citation, and Summary Statistic availability.

Additional file 2: Table S2.

List of Features. All 2305 features used to build models. Columns are Feature Name, Source, Type, Sample, and Tissue.

Additional file 3: Table S3.

Full Coefficient_Names. Features returned by TSABL models. Columns are Identifier, Type, Mark, Name, autoimmune, ECG, lipids, platelets, RBC, WBC; the last 6 columns contain the coefficient values for the named model.

Additional file 4: Table S4.

LDSC. Results of LDSC partitioned heritability analysis; each tab is named as trait_phenotype where trait indicates the trait group modeled and phenotype the specific GWAS phenotype.

Additional file 5: Table S5.

LDSC no lead r2. Results of LDSC partitioned heritability analysis where all SNPs with LD R2 > 0.7 have been removed from the analysis; each tab is named as trait_phenotype where trait indicates the trait group modeled and phenotype the specific GWAS phenotype.

Additional file 6: Table S6.

Locus Discovery. High scoring SNPs from discover analysis. First tab is summary of SNP numbers by trait, remaining tabs are individual lists per trait.

Additional file 7: Table S7.

GREAT Analysis. Results of GREAT analysis. Tab A contains all SNPs submitted, separated by trait. Tab B contains the same SNP lists, but with SNPs adjacent to GWAS loci removed. Tab C is a summary of the analysis results. Tabs D-AA are individual analysis results as labeled.

Additional file 8: Table S8.

Credible Sets. Credible sets and credible set reductions calculated using TSABL models. Each tab is a separate phenotype. Columns include SNP reductions at 0.5, 0.75, 0.9, 0.95, and 0.99 thresholds.

Additional file 9: Table S9.

Credible Sets Summary. A summary of results presented in Table S8, including total SNP counts across all tested phenotypes and totals grouped by trait.

Additional file 10: Table S10.

All SNPs. All SNPs used in any model building. The first three columns contain SNP information: Group, rsID, and hg19 chromosome and position. Remaining columns indicate trait model. For each SNP-model combination, table notes if SNP was Not Used, used as training negative, used as training positive, used as holdout negative, or used as holdout positive. 

Additional file 11: Table S11.

Autoimmune testing.

Additional file 12: Table S12.

Autoimmune training.

Additional file 13: Table S13.

BC testing.

Additional file 14: Table S14.

BC training.

Additional file 15: Table S15.

Birthweight testing.

Additional file 16: Table S16.

Birthweight training.

Additional file 17: Table S17.

BMD testing.

Additional file 18: Table S18.

BMD training.

Additional file 19: Table S19.

 BMI testing.

Additional file 20: Table S20.

BMI training.

Additional file 21: Table S21.

BP testing.

Additional file 22: Table S22.

BP training.

Additional file 23: Table S23.

CAD testing.

Additional file 24: Table S24.

CAD training.

Additional file 25: Table S25.

ECG testing.

Additional file 26: Table S26.

ECG training.

Additional file 27: Table S27.

Height testing.

Additional file 28: Table S28.

Height training.

Additional file 29: Table S29.

Lipids testing.

Additional file 30: Table S30.

Lipids training.

Additional file 31: Table S31.

Menarche testing.

Additional file 32: Table S32.

Menarche training.

Additional file 33: Table S33.

Platelets testing.

Additional file 34: Table S34.

Platelets training.

Additional file 35: Table S35.

RBC testing.

Additional file 36: Table S36.

RBC training.

Additional file 37: Table S37.

S_BD testing.

Additional file 38: Table S38.

S_BD training.

Additional file 39: Table S39.

T2D testing.

Additional file 40: Table S40.

T2D training.

Additional file 41: Table S41.

WBC testing.

Additional file 42: Table S42.

WBC training.

Additional file 43: Table S43.

WHRadjBMI testing.

Additional file 44: Table S44.

WHRadjBMI training.

Additional file 45: Table S45.

LDSC SNPs removed. Lists of SNPs removed between Table S4 and Table S5.

Additional file 46: Supplementary Figure 1.

Model Building Flow Chart. To build models using GWAS loci, we grouped lead variants by LD, then split into training and holdout sets semi-randomly, enforcing an even distribution of positive loci and maintaining the distribution of variant counts between the sets. For modelling, we included only positive loci which had a sufficient number of negative loci with similar locus variant count, which was enforced in both training and holdout sets. After models were built, they were assessed by building ROC plots in the holdout sets. Supplementary Figure 2. Holdout AUCs vs # Selected Features and # Groups used for Training. Model AUC in holdout set plotted against A) number of features selected by the model B) number of positives groups used for training C) % total groups containing a coding variant and D) % training groups containing a coding variant. Panels C and D include a linear model best fit line as well as significance. In all panels, models which passed selection criteria (AUC ≥ 0.7, AUC change between training and holdout ≤ 0.02) are green while models which were not selected are purple. Supplementary Figure 3. Model AUCs on same holdout sets. Each panel shows the named holdout set: A) Autoimmune, B) Lipids, C) Red Blood Cells and D) White Blood Cells. AUCs for all six selected models are shown on all plots, with the left (All) plot having all loci in that holdout and the right (No Overlap) having only those loci not positive in multiple models. Supplementary Figure 4. Selected Model Feature Coefficients. A) Autoimmune, B) Red Blood Cells, and C) White Blood Cells model coefficients, sorted by tissue type and value. GE – Gene Expression. Supplementary Figure 5. Enrichment of FDR < 0.01 SNPs within 25kb of GWAS variants. GWAS tested for platelets model: mean platelet volume (MPV) and platelet count (PLT); for RBC model: hematocrit (HCT), mean corpuscular volume (MCV), red blood cell count (RBC), and red cell distribution width (RDW); for WBC model: basophil count (BAS), eosinophil count (EOS), monocyte count (MON) and neutrophil count (NEU).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lorenz, K., Thom, C.S., Adurty, S. et al. TSABL: Trait Specific Annotation Based Locus predictor. BMC Genomics 23, 444 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: