Multi-trait GWAS for diverse ancestries: mapping the knowledge gap

Background Approximately 95% of samples analyzed in univariate genome-wide association studies (GWAS) are of European ancestry. This bias toward European ancestry populations in association screening also exists for other analyses and methods that are often developed and tested on European ancestry only. However, existing data in non-European populations, which are often of modest sample size, could benefit from innovative approaches as recently illustrated in the context of polygenic risk scores. Methods Here, we extend and assess the potential limitations and gains of our multi-trait GWAS pipeline, JASS (Joint Analysis of Summary Statistics), for the analysis of non-European ancestries. To this end, we conducted the joint GWAS of 19 hematological traits and glycemic traits across five ancestries (European (EUR), admixed American (AMR), African (AFR), East Asian (EAS), and South-East Asian (SAS)). Results We detected 367 new genome-wide significant associations in non-European populations (15 in Admixed American (AMR), 72 in African (AFR) and 280 in East Asian (EAS)). New associations detected represent 5%, 17% and 13% of associations in the AFR, AMR and EAS populations, respectively. Overall, multi-trait testing increases the replication of European associated loci in non-European ancestry by 15%. Pleiotropic effects were highly similar at significant loci across ancestries (e.g. the mean correlation between multi-trait genetic effects of EUR and EAS ancestries was 0.88). For hematological traits, strong discrepancies in multi-trait genetic effects are tied to known evolutionary divergences: the ARKC1 loci, which is adaptive to overcome p.vivax induced malaria. Conclusions Multi-trait GWAS can be a valuable tool to narrow the genetic knowledge gap between European and non-European populations. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10293-3.


Supplementary Note 3: Highly correlated Z-scores under the null coupled to imputation lead to an inflation of the multi-trait test
For the SAS ancestry specifically, we observed an increase of the number of significant hits for the joint test after imputation (almost twice the number of hits observed before imputation, see table below) for the multi-trait test that was inconsistent with the proportion of imputed SNPs.As the imputation performances and sanity checks, that is imputation did not lead to an increase of significant hits for the univariate test, for this ancestry were comparable to other ancestry, we hypothesize that the inflation was a consequence of a specific interaction between the imputation and the joint test.

SAS
To test this hypothesis, we generated Z-scores vector under the null (following a multivariate distribution with mean 0 and covariance matrix given by the intercept of the LDscore regression) and introduce 20% of imputed SNPs by adding a random imputation error coherent with the one observed in our data: normally distributed noise with a standard deviation of 0.1 (comparable to the median imputation error reported in Table S3).We report the distribution of the p-value of the omnibus test for these Z-scores across the 5 ancestry (see Figure S4) and notice a large inflation specific to the SAS ancestry.
We investigated the level of collinearity of the covariance matrices of Z-score under the null ( ).As derived in [4], for a pair of trait i and j the expected covariance under the null is equal to , where is the total covariance between traits and , the number of samples shared between study and , and and the sample sizes of study and .Hence, the covariance is impacted by one factor that is relatively constant across ancestries (the observed covariance between the two traits), and by one other factor that strongly varies depending on the meta-analysis composition: sample overlap between studies.Concerning the SAS ancestry, all samples for hematologic traits are from Ukbiobank (a total sample overlap).
To assess the impact of sample overlap on collinearity, we computed the condition number for the matrix in each ancestry (see table above).The condition number is the square root of the ratio of the largest eigenvalue and the lowest eigenvalue of the covariance matrix.We observe the highest condition number for the SAS ancestry, confirming the higher colinearity of the covariance under the null for this ancestry.In this specific setting -highly correlated phenotypes coupled to a total sample overlap -the omnibus test lacks robustness.Hence we decided to exclude the SAS ancestry from subsequent analysis.

Figure S4 .
Figure S4.Comparison of direct and LD-score regression estimations of the Pearson correlations between hematological traits.Matrices on the first line depict direct estimates of the Pearson correlation between hematological traits in UKBiobank for individuals from the Indian, Nigerian and Caribbean ancestries.The second line displays the same correlation estimated from summary statistics derived in the same individuals.LD-score regression was applied on these summary statistics to derive correlation estimates.The third line displays the covariance under the null for the SAS and AFR ancestry used in this study, and a scatter plot of LDscore correlation estimates with respect to direct estimate of the correlation.

Figure S5
Figure S5 Distribution of p-value under the null hypothesis for the five ancestries initially considered (SAS, EUR, AMR, AFR, EAS) with data simulated to mimic imputation error.

Figure S6 .
Figure S6.Manhattan Plots of Multi-trait GWAS for hematological traits in each population.The dashed line represents 5 x 10⁻⁶ threshold and the solid line represents 5 x 10⁻⁸ threshold.

Figure S7 .
Figure S7.Manhattan Plots of Multi-trait GWAS for glycemic traits in each population.The dashed line represents 5 x 10⁻⁶ threshold and the solid line represents 5 x 10⁻⁸ threshold

Figure S9 .
Figure S9.Glycemic trait GWAS Quadrant plots for each ancestry.The y-axis represents the -log10(p-value) for the most significant SNP per region for the omnibus test with respect to -log10(pvalue) for the most significant SNP per region across all univariate GWAS.Complete results are presented in the left sub-panel, and a zoom around the genome-wide significance threshold is presented on the right sub-panel.

Figure S10 .
Figure S10.hematological trait GWAS Quadrant plots for TRANS ancestry.The y-axis represents the -log10(p-value) for the most significant SNP per region for the omnibus test with respect to -log10(p-value) for the most significant SNP per region across all univariate GWAS.Complete results are presented in the left sub-panel, and a zoom around the genome-wide significance threshold is presented on the right sub-panel.

Figure
Figure S11.QQ-PLOT of the p-value of the joint analysis of hematological traits for each population.

Figure
Figure S12.QQ-PLOT of the p-value of the joint analysis of glycemic traits for each population.

Figure
Figure S13.A QQ-PLOT of the p-value of the joint analysis of hematological traits for the TRANS ancestry analysis

Figure
Figure S16Fraction of significant European loci that remain specific (i.e.found only in Europe) after simulating a diminished sample size for the European GWASs.We compared the simulated European multi-trait GWAS on hematological traits to the corresponding analysis in the AMR, EAS and AFR ancestry at their full sample size.Each linetype corresponds to one non-European ancestry.The y axis represents the fraction of loci detected in the simulated European GWASs that were absent in the ancestry of comparison.The x axis is the relative sample size of the simulated European GWAS with respect to the GWAS of comparison.

Figure S17 .
Figure S17.Manhattan plot and corresponding multi-trait signal heatmap for the TRANS and all ancestry multi-trait GWASs with p-value from univariate test.For ancestry specific GWAS, dots' color represent the LD with the smallest p-value lead SNPs in each panel.Under each Manhattan plot, the normalized SNPs genetic effects (z-score) are reported through a heatmap.hematological traits order: LYMPH, NEUT, MCV, EO, MONO, RBC, HGB, MCH, HCT, WBC, BASO, MCHC, PLT, MPV, RDW.

Figure S18 .
Figure S18.Differentially expressed genes in four populations.Differentially expressed genes using GTEx v8 30 general tissue types.Significantly enriched DEG sets (Pbon < 0.05) are highlighted in red.Tissues are ordered by up-regulated DEG P-value.