Study populations
In the discovery stage, our analysis included up to 64,784 participants of self-identified AA (n=16,201), HL (n=21,347), or EA (n=27,236) race/ethnicity from four cohort studies and one biobank (Table 1): the Atherosclerosis Risk in Communities Study (ARIC), the Coronary Artery Risk Development in Young Adults Study (CARDIA), the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), the Women’s Health Initiative (WHI), and the BioMe™ Biobank (BioMe) (Supplemental Table 1). All participants provided informed consent and each study was approved by the Institutional Review Board (Supplemental Methods).
Phenotype measurement and quality control
We studied eight hematological traits as defined in the standard clinical complete blood count (CBC) analysis, measuring properties of white blood cells (WBC, BAS, EOS, LYM, MON, and NEU) and platelets (PLT and MPV). Counts of white blood cells and the five subtype cells as well as platelets were measured using automated hematology cell counters and following standardized laboratory protocols from blood draws at the earliest available visit. Each count was reported in trillions of cells per liter (109/L).
QC of the measured traits were performed before analysis (Supplemental Table 1). When available, participants were excluded if they had ever been diagnosed with HIV or leukemia, were currently pregnant or receiving chemotherapy, or had a severe hereditary anemia (primarily sickle-cell disease, determined by genotype) at time of blood draw. To remove sources of technical and non-genetic biological variation, and thus increase our power to detect genetic associations, we removed outliers exceeding four standard deviations from the mean of each trait in the overall study population. Due to the small proportion of basophils and eosinophils in whole blood, the counts for these two traits are often below the detection limit and were then recorded as zero. Therefore, we randomly imputed a phenotype value from a uniform distribution ranging from 0 to a study-specific lower detection limit (ranging from 0.0067 in the HCHS/SOL to 0.1 in BioMe) for those with a complete blood count measurement that was below the detection limit and used this value in analysis. The assignment of low, but non-zero counts, allows these subjects to be included in the analysis, as it is known that these values are in fact (very) low.
Genotyping, imputation and quality control
Among all included participants, 8831 AA and 19,484 HL participants were genotyped on the MEGA arra y[29], yielding a total of 1,705,969 genetic variants. Standard QC filters were applied at the individual level as well as the variant level (Supplemental Methods). After QC, variants were further imputed to 1000 Genomes Phase 3 data using SHAPEIT2 and IMPUTE (version 2.3.2), resulting in 39,723,562 imputed SNPs with IMPUTE info score ≥0.4. An additional 36,469 participants (7370 AA, 1863 HL, and 27,236 EA participants) previously genotyped using other Illumina or Affymetrix arrays were also included in the analysis, again using standard QC procedures (Supplemental Methods). Variants that passed QC were imputed to the 1000 Genomes Phase 3 reference panel in each study. We further excluded variants on a study-specific basis which had poor imputation quality (info score< 0.4) or an effective sample size < 35 (calculated as 2×MAF×(1-MAF)×N×info score, where MAF is minor allele frequency and N is sample size). In the ancestry-combined, AA-specific, HL-specific, and EA-specific samples, 61, 53, 57, and 64% of variants had allele frequencies below 1%, respectively.
Statistical analysis
In the discovery stage, we performed both univariate GWAS analysis for each of the eight traits and aSPU simulation-based method which jointly tested all eight traits [40]. For WBC and the five subtypes (BAS, EOS, LYM, MON, and NEU), values were log10 transformed before association analysis. For PLT and MPV, raw values were used. For samples genotyped on the MEGA array, residual values for each trait were calculated from linear regression models after adjustment for age, age2, sex (when applicable), center (when applicable), and the first 10 principal components (PC). For samples previously genotyped on either other Illumina or Affymetrix arrays, residual values for each trait were calculated from linear regression models after adjustment for age, age2, sex (when applicable), center (when applicable), and the first 10 PCs calculated from an LD-pruned set of genotypes in each individual study. In the univariate GWAS analysis, we tested the association of each genetic variant with the rank-based inverse-normally transformed residual values in MEGA samples and in each individual study, respectively. All MEGA samples were pooled together for testing while association testing was performed by study and ancestral group in non-MEGA samples. These association analyses were performed using SUGEN, which is based on generalized estimating equations (GEE) allowing correlated errors for first or second-degree relatives and independent error distributions by self-reported race/ethnic group [41]. Association results from these studies were then combined through fixed-effect inverse-variance-weighted meta-analysis in METAL for each trait [42]. Both ancestry-combined and ancestry-specific meta-analyses were performed. Complete summary level results are available through dbGaP (phs000356).
To identify additional novel loci and evaluate evidence for shared genetic effects across all eight traits, we combined the trans-ethnic meta-analysis results from each univariate trait analysis using aSPU to generate a joint P value for each variant [40]. The aSPU approach uses a simulated reference distribution (based on Monte Carlo simulations [40] ) to evaluate whether the most powerful combination of univariate summary z-scores implies an association between each variant and one or more of the tested traits. In comparison with other available multi-trait methods, we chose aSPU because it exhibited low type 1 error rate in simulations, accommodated direction of effect, and showed computational efficiency enabling the test of millions of variants. We implemented aSPU using Julia 1.0 to optimize efficiency (https://github.com/kaskarn/aspu_julia).
Genome-wide and suggestive significant cutoffs were set as P<2E-9 and P<5E-8, respectively [28, 30]. Using guidelines for frequency-based thresholds [30] we set genome-wide significance at 2E-9 as new discovery in this study was likely to be rare/low-frequency. We additionally used a suggestive cutoff of 5E-8 for relatively common variants with MAF> 5%. Novel loci were defined as those that: (1) reached the genome-wide or suggestive significance threshold; (2) were located more than 1 Mb away from any reported variants associated with any of the eight traits; (3) were available in the pooled MEGA result or were available in at least two non-MEGA studies when not available in the pooled MEGA result. Novel associations were defined as those that: (1) reached the genome-wide or suggestive significance threshold; (2) were located more than 1 Mb away from any reported variants associated with the examined trait but located within 1 Mb from variants previously reported for any of the other hematological traits; (3) were available in the pooled MEGA result or were available in at least two non-MEGA studies when not available in the pooled MEGA result. All novel loci and novel associations exhibiting genome-wide or suggestive significance were moved forward to the replication stage. These novel findings were examined in the publicly available summary statistics from the BCX (http://www.mhi-humangenetics.org/en/resources). There are two studies, BioMe and WHI, that were included in our discovery analysis and the BCX results as well. The fixed effect meta-analysis provided by BCX is a combination of the results of overlapping samples in WHI and BioMe and other remaining samples in BCX. As we know the association results for the overlapping samples in WHI and BioMe we can “invert” the fixed effect meta-analysis reported on the BCX website to obtain the results of all non-overlapping samples in BCX.
To identify distinct association signals at previously reported loci, we performed stepwise conditional analysis in each study, followed by meta-analysis. At each known locus that reached genome-wide significance in the ancestry-combined meta-analysis (P<2E-9), we identified the lead variant with the lowest P value and defined a 2 Mb region centered on the lead variant. At the DARC locus, the examined region was extended to ~ 6.5 Mb due to the extensive LD. We included the genotype dosage for the lead variant in each region as an additional covariate in the regression model. We did not stop the conditional analysis until all variants in each region showed P>2E-9.
Phenotypic variance explained by each genetic variant was calculated using the equation [43]
Explained phenotypic variance = \( \frac{2{\beta}^2 MAF\left(1- MAF\right)}{2{\beta}^2 MAF\left(1- MAF\right)+S{E}^2\ 2N\ MAF\left(1- MAF\right)} \),
where β denotes the effect size of the variant on the associated trait and SE denotes standard error of the effect size.
PrediXcan analysis
PrediXcan is a multi-omics approach for identifying genes associated with a trait of interest [44], which uses a reference database of derived genotype weights to impute unobserved gene expression levels into a set of genotyped samples. Gamazon et al. provided imputation models for gene expression in 48 different human tissues with Genotype-Tissue Expression (GTEx) V7 data, using elastic net regression with all cis-variants (defined as within 1 Mb of the gene) with MAF> 5% [44]. We performed a PrediXcan analysis to identify associations of WBC and PLT levels with these imputed values, representing GREx [44], in five disease-relevant tissues and cell types: whole blood, liver, spleen, thyroid, and Epstein-Barr virus (EBV) transformed lymphocytes. First, GREx in over 28,518 PAGE minority ancestry participants (AA, HL, Asian, and Hawaiian ancestry) genotyped on the MEGA array were imputed. Associations between the GREx and the traits (which were only available for WBC and PLT in AA and HL populations) were then estimated using SUGEN, both in an ancestry-combined and ancestry-specific manner. Genes with FDR < 0.05 in the ancestry-combined analysis were considered significant. Novel genes were defined as those exhibiting FDR < 0.05 and located more than 1 Mb away from any reported variants.
Functional annotation
Functional annotation of the novel findings listed in Table 2 was performed using a comprehensive annotation database constructed from the WGSA [45], gene-centric function (GTEx) and genome-wide functional prediction scores (DANN and Eigen PC)] and a custom UCSC analysis hub visualizing various important regions [enhancer and repressor activities, DNase I hypersensitive sites (DHS) and transcribed regions], which facilitated the prioritization of potential functional genes and variants. Variants with DANN rank score ≥ 0.9 were coded as deleterious [46], and variants with Eigen PC phred score ≥ 17 were coded as functional [47]. Independent signals identified in the stepwise conditional analysis were also examined using this functional database. Custom UCSC bed tracks included the most significant variant of each novel finding and the proxy variants that are in LD (r2≥0.4) with the most significant variant within ±1 Mb region. The LD proxies of the six novel findings were generated using either ancestry-specific or ancestry-combined data (sample size-weighted LD using MEGA AA and HL samples for INSIG1, MED13L, and HADHB, sample size-weighted LD using MEGA AA and HL samples plus EA samples from WHI and ARIC for TG and IGF1, and European-specific LD using EA samples from WHI and ARIC for PPP1R16B). Primary mononuclear cells, monocytes, neutrophils, natural killer cells, T and B cells from peripheral blood and primary hemotapoietic stem cells, the seven most relevant cells to the eight studied white blood cell and platelet traits, were selected to examine chromatin immunoprecipitation-sequencing (ChiP-seq) signals associated with enhancers (H3K27ac and H3K4m1), repressors (H3K27me3), and transcribed regions (H3K36me3) [48].
In addition, we examined our novel findings and their LD proxies, as well as the independent signals at previously reported loci, in two comprehensive data sets combining chromatin accessibility data derived from ATAC-seq in hematopoesis-related cell types and GWAS results for various blood cell traits in the UKBB (EA participants only, https://molpath.shinyapps.io/ShinyHeme/) [34] and the BCX Consortium (ancestrally diverse populations) [23] . These two data sets included variants selected from the fine-mapping analyses (PP < 0.01 in UKBB and < 0.001 in BCX, respectively), which also showed enrichment based on chromatin accessibility of various hematopoietic populations using gchromVAR [34].