Polygenic risk score model for renal cell carcinoma in the Korean population and relationship with lifestyle-associated factors

Background The polygenic risk score (PRS) is used to predict the risk of developing common complex diseases or cancers using genetic markers. Although PRS is used in clinical practice to predict breast cancer risk, it is more accurate for Europeans than for non-Europeans because of the sample size of training genome-wide association studies (GWAS). To address this disparity, we constructed a PRS model for predicting the risk of renal cell carcinoma (RCC) in the Korean population. Results Using GWAS analysis, we identified 43 Korean-specific variants and calculated the PRS. Subsequent to plotting receiver operating characteristic (ROC) curves, we selected the 31 best-performing variants to construct an optimal PRS model. The resultant PRS model with 31 variants demonstrated a prediction rate of 77.4%. The pathway analysis indicated that the identified non-coding variants are involved in regulating the expression of genes related to cancer initiation and progression. Notably, favorable lifestyle habits, such as avoiding tobacco and alcohol, mitigated the risk of RCC across PRS strata expressing genetic risk. Conclusion A Korean-specific PRS model was established to predict the risk of RCC in the underrepresented Korean population. Our findings suggest that lifestyle-associated factors influencing RCC risk are associated with acquired risk factors indirectly through epigenetic modification, even among individuals in the higher PRS category. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-09974-w.


Background
Renal cell carcinoma (RCC) accounts for 90% of kidney cancers and ranks as the seventh most common cancer in the western world; it constitutes approximately 3% of all cancer diagnoses worldwide [1,2].In Asia, the incidence of RCC has increased due to the adoption of western lifestyles [3].Well-known risk factors for RCC include smoking, excessive weight, and hypertension [4,5].Additionally, heritability plays a role in certain rare syndromes with predisposed germline mutations in genes such as VHL, FH, and MET [6,7].
RCC is usually detected incidentally and asymptomatically when diagnosed at an early stage.Early detection through screening is crucial for reducing the morbidity and mortality associated with RCC [8,9].Several prediction models based on clinical, biochemical, historical, and lifestyle markers have been developed and validated to predict the diagnosis, grade, stage, and progression of several cancers, including RCC [10].Similarly, polygenic risk score (PRS) models that use genetic markers to predict the risk of cancers have demonstrated sufficient predictive power, thereby enabling individualized risk management [11,12].
Genomic architecture and predisposed allele frequencies vary among different ancestries [13].PRS models utilizing genetic factors predict individual risk more accurately in Europeans compared to non-Europeans, primarily because the majority of genetic discoveries are made within European populations [14].Europeans represent the largest ethnicity in training genome-wide association studies (GWAS) globally, accounting for 91% of the data, followed by East Asians at 4.9% [15].Consequently, the accuracy of the Asianspecific PRS is affected by the relatively smaller sample size of genetic studies conducted in Asian populations, thereby lowering precision when estimating the relative risk for each individual [16].To address this issue, we conducted a GWAS for RCC using genomic data from 992 cases and 3,431 controls in the Korean population.
Favorable lifestyle factors, such as avoiding tobacco and alcohol, following a healthy diet, and engaging in moderate physical activity, serve as an optimal approach to prevent and manage cancers or complex diseases [17].Numerous studies have revealed that favorable lifestyle factors can mitigate the risk of cancer among individuals with high genetic risk [18][19][20].The aim of this study is to identify RCC-susceptible germline variants specific to Koreans, construct a Korean PRS model to assess the risk of developing RCC based on these variants, and evaluate the performance of the PRS model.Furthermore, this study examined whether lifestyle-associated factors interact with the genetic risk expressed as PRS.

Study participants
This study involved 4,991 Korean individuals.We included the cases of 1,120 patients with RCC who were registered in the Seoul National University Prospectively Enrolled Registry for RCC-Nephrectomy (SUPER-RCC-Nx) and had their blood stored in the human biobank [21].The control group consisted of 3,871 participants from the Ansan/Ansung study of the Korean Genome and Epidemiology Study (KoGES), a population-based prospective cohort study [22].The baseline survey for the KoGES was conducted in 2001-2002, and a follow-up survey was carried out biennially for 14 years.The participants were selected based on specific criteria, excluding participants diagnosed with any cancer during the baseline survey and those diagnosed with kidney diseases during the follow-up survey.Genotyping was performed using the Korean Chip array, and the same array was used by the Korean National Institute of Health to genotype KoGES samples.

Korea biobank array (KoreanChip)
KoreanChip comprises more than 833,000 markers, among which 208,000 are functional markers that have been directly genotyped.These data were collected from an extensive dataset of 22 million variants identified in 2,576 sequenced Korean samples.The dataset encompasses 397 whole-genome sequences from the Korean Reference Genome, along with 2,179 whole-exome sequences sourced from various places, such as the T2D-GENES consortium, the Ansung and Ansan study, a cardiovascular disease sequencing study, and the Korean Children and Adolescents Obesity Cohort study [23].

Quality control (QC)
QC was performed to analyze the samples and variants.Individuals with sexual inconsistencies were excluded from the study based on the principle that the genotype data on the sex of an individual was inconclusive when the homozygosity rate is greater than 0.2 but less than 0.8.Samples with a call rate < 95%, excessive heterogeneity, and genetic relatedness were removed.Single nucleotide polymorphisms (SNPs) with a call rate < 95%, minor allele frequency (MAF) < 5%, and Hardy-Weinberg Equilibrium (HWE) p-value < 1.0 × e − 6 were also excluded.Batch effect corrections were conducted for cases [24].The subsequent step involved correcting the batch effects that arose between cases and controls.Importantly, regulations state that results obtained with KoreanChip must be normalized with 5,000 samples registered in the Korean consortium.Consequently, even though cases and controls underwent separate genotyping in different laboratories, they were effectively normalized to each other according to this regulation, which eliminated batch effects.To assess the effect of population substructure, principal component analysis (PCA) was performed before and after merging the datasets of the cases and controls.QC was completed using a combination of R v4.2, Plink v1.9, and bcftools git version 1.17-10 [25].

Imputation for missing values
Variants that were not directly genotyped or excluded during QC were imputed using Minimac4.Phasing was performed using Eagle v2.4.The ancestry was limited to East Asians with 1000 Genome project phase 3 for the reference genome panel.We filtered the imputed variants with a genotype quality R2 > 0.8 [26].Post-imputation QC was conducted by applying the exclusion criteria of an MAF < 5% and an HWE p-value < 1.0 × e − 6 .The percentage of imputed data after the post-QC step was 92.72%.

Statistical analysis for SNP selection
The samples were divided into two: discovery and validation datasets.The validation dataset, including 492 samples (approximately 10% of the total samples), was randomly extracted, whereas the remaining 4,915 samples were retained for the discovery set after undergoing QC.Association testing with RCC was conducted for the discovery dataset.Logistic regression was performed for the GWAS with covariates, including age, sex, body mass index (BMI), hypertension, and smoking.The associated SNPs were filtered using a threshold of 1.0 × e − 5 and a false discovery rate (FDR) of 0.05.LD pruning and fine mapping methods were used to identify causal SNPs for predicting RCC risk [27].Hail 0.2 was used for statistical analysis.

PRS calculation and optimal performance
The PRS model was constructed using causal SNPs selected from the GWAS results with the validation dataset.
where PRS j is the risk score for individual j, dosage ij is the number of risk alleles for the i-th variant, β i is the nat- ural logarithm of the odds ratio [ln(OR)] (or effect size, beta) of the i-th variant, and N is the number of SNPs in the score [28].
To compare the performance of the PRS models, systematically removing one SNP at a time and starting from the SNP with the highest p-value, a receiver operating characteristic (ROC) curve was plotted, and the area under the curve (AUC) was calculated for different numbers of SNPs.The optimal PRS cut-off value was selected at the point of the maximal Youden's index (sensitivity and specificity) performed using Plink v1.9 and the pROC package in R.

Association of PRS and lifestyle-associated factors with RCC risk
We selected BMI, smoking status, alcohol intake, and history of hypertension as lifestyle-associated factors related to RCC risk.Although a favorable lifestyle score is commonly calculated by considering obesity, tobacco use, alcohol intake, diet, and physical activity as lifestyle-associated factors, we replaced diet and physical activity with history of hypertension considering our present data and previous studies related to RCC risk [29,30].A favorable lifestyle was indicated by BMI < 30 kg/m 2 , no smoking, moderate alcohol intake, and no history of hypertension (see Additional File 1: Table S1).We assigned one point to each favorable lifestyle-associated factor.We categorized combined lifestyle scores into Ideal (favorable lifestyle score of 3 or 4), Intermediate (favorable lifestyle score of 2), and Poor (favorable lifestyle score of 0 or 1).PRS distributions were categorized into Low (0-40%), Intermediate (40-90%), and High (> 90%).We explored the association of favorable lifestyle-associated factors and PRS with RCC risk and further investigated the relationship between lifestyle-associated factors and RCC risk across the strata of PRS using a Cox proportional hazard model.

Discovery phase findings
This study included 4,915 Koreans who were divided into two groups to identify risk variants and construct the PRS model.The discovery dataset comprised 992 cases and 3,431 controls, whereas the validation dataset comprised 112 cases and 380 controls.Although RCC can occur at any age, this study focused only on participants aged ≥ 40 years to examine the common effects of these factors on RCC risk (Table 1).
Batch effect correction was performed to address the technical variations or non-biological differences between measurements in different sample groups.Substantial correction of the case dataset was performed.Additionally, to assess the effect of the population substructure, PCAs were performed before and after merging the cases and controls.No specific population substructure was observed (see Additional File 1: Figure S1).

Korean PRS construction for RCC risk and biological process of 31 variants
The Korean-specific PRS model was constructed using 43 SNPs on 492 Korean participants.The maximal AUC value for the PRS model was 77.4% when 31 variants out of 43 were selected (Fig. 2).Although the effect size was not significantly high, the aggregate of the weighted effect size of the 31 SNPs showed a high prediction rate.Of the 31 variants in the PRS model, 15 variants were in the intronic region, 15 in the intergenic region, and 1 downstream (Table 2; see Additional File 1: Figure S3).We annotated these variants with the genes they regulated to investigate whether they were associated with RCC risk.Functions and pathways of the genes regulated by the 15 variants in the intronic region are listed in Table 3.

Relevance of lifestyle-associated factors to RCC risk across PRS strata
We categorized the combined lifestyle score as Ideal, Intermediate, and Poor and the PRS as Low, Intermediate, and High for 492 individuals.In the Cox proportional hazard model with combined lifestyle scores and RCC risk, the Poor lifestyle category (HR = 3.81, 95% CI: 2.33-6.22)involved a risk that was three times higher than that of the Ideal lifestyle category.A high genetic risk (PRS) was significantly associated with the RCC risk (HR = 10.22,95% CI: 5.11-20.45).When lifestyle factors associated with the risk of RCC were stratified by PRS in the Cox proportional hazard model, the probability of RCC risk was higher in the poor lifestyle score category across PRS strata (Fig. 3).overlap with our Korean-specific variants [6].To the best of our knowledge, this study is the first to construct a PRS model to predict the risk of RCC in the underrepresented Korean population.

Non-coding DNA variants and biological mechanisms
Fifteen of the 31 Korean-specific variants identified in this study indirectly contribute to cancer initiation and progression.These intronic variants regulate genes such as enhancers, repressors, or promoters, and are involved in biological functions and pathways associated with the development of cancers by exerting oncogenic or tumor-suppressive effects in multiple organs [31].Wellannotated pathways were related to the genes affected by the variants implicated in RCC.For example, the RPTOR gene, located in the 17q25.3region, codes for a subunit of the mTORC1 complex, which is crucial for regulating various cellular processes, such as assembly, localization, and substrate binding of mTORC1.The PI3K/AKT/ mTOR signaling pathway is an intracellular pathway that plays a vital role in cell cycle regulation, including the G0 phase and cell proliferation.PI3K, a lipid kinase, produces phosphatidylinositol-3,4,5-trisphosphate, a key second messenger that facilitates AKT translocation to the plasma membrane.AKT activation is central to fundamental cellular functions, such as cell proliferation and survival, as it phosphorylates various substrates.Dysregulation of this pathway is frequently observed in human cancers, particularly in RCC, and has been linked to aggressive tumor development and reduced survival rates [32][33][34].The SUSD5 protein encoded by the SUSD5 gene in the 3p22.3region is expected to have hyaluronic acid-binding activity and play a role in the Notch signaling pathway.Notch signaling is crucial in regulating cell fate, proliferation, and death during development.It operates mainly between adjacent cells as its ligands are transmembrane proteins.Despite its simplicity in intracellular signaling with no secondary messengers, the Notch pathway is part of various developmental processes, and its dysfunction is implicated in many cancers, including RCC [35,36].

Relationship between lifestyle-associated factors and genetic risk expressed as PRS
Both lifestyle-associated factors and PRS were significantly associated with RCC risk, and lifestyle-associated factors affected RCC risk across PRS strata.However, Cox proportional hazard analysis showed no evidence that lifestyle-associated factors and PRS directly interacted with each other.Numerous studies have recently The PKA-stimulated degradation of GRIP1 leads to changes in the expression of a subset of genes regulated by estrogen receptor-α in MCF-7 breast cancer cells [54].
The TBC domain family is implicated in various cellular events contributing to initiation and development of different cancers [55].
reported the relationship between epigenetic markers and lifestyle-associated factors, such as stress, smoking, alcohol use, and diet [37].Various environmental factors epigenetically remodel the genome without altering its DNA sequence.Epigenetic markers influence the modulation of gene expression and thus play a critical role in health status and prevention of cancers and complex diseases [38].
The last 15 of the 31 Korean-specific variants identified in this study were intergenic variants.Many intergenic variants can affect gene regulation through epigenetic modifications, such as chromatin remodeling or histone modifications, including methylation or acetylation.Modulated expression of oncogenes and tumor suppressor genes affects cancer development [39].In the present The open chromatin region is accessible and has a less condensed chromatin structure, facilitating the binding of transcription factors and other regulatory proteins to the DNA.The SEMA3C gene, in closest proximity to rs73149350, contributes to the promotion of cancer cell growth [40].Therefore, rs73149350 may potentially regulate SEMA3C expression through processes such as chromatin remodeling or histone modification.This regulatory effect could have implications for the risk associated with RCC.However, it is important to note that further studies are needed to fully understand the biological mechanisms underlying the regulation of genes by these intergenic variants.The finding suggest that lifestyle-associated factors may indirectly affect acquired risk factors through epigenetic modulation [41].

Limitations and future directions
This study has certain limitations.First, we did not perform additional pathway or biological mechanism analysis of the intergenic variants.Without these analyses, the biological relevance of these variants in the context of RCC risk may remain unclear.Second, epigenetic association studies should be conducted to draw more accurate inferences.We must investigate the specific epigenetic mechanisms through which lifestyle-associated factors, such as stress, smoking, alcohol use, and diet, influence gene expression and how these modifications are related to RCC risk.This investigation could involve detailed epigenome-wide association studies to identify specific epigenetic changes associated with lifestyle factors.Further in-depth studies are required to explore the relationship between lifestyle-associated factors and genetic risk.These studies should consider incorporating such analyses to gain a deeper understanding of the underlying biology and potentially develop clinical applications.

Conclusion
The aim of the present study was to construct a Koreanspecific PRS model that predicts the risk of RCC development and to explore the association of lifestyleassociated factors with the genetic factor influencing RCC risk.To mitigate the impact of ethnicity, GWAS analysis was exclusively performed on the underrepresented Korean population, leading to the identification of Korean-specific variants associated with RCC risk.The Korean-specific PRS model was constructed with 31 identified variants and demonstrated a robust prediction rate of 77.4%.Among the 31 variants, 15 intronic variants indirectly contributed to cancer initiation and progression through their involvement in key biological functions and pathways such as PI3K/AKT/mTOR or Notch signaling pathway.The remaining 15 intergenic variants potentially impact gene regulation through epigenetic modifications such as methylation or histone modification.Epigenetic modification is known to be influenced by environmental factors including lifestyle-associated factors.Furthermore, we investigated the association between lifestyle-associated factors, such as physical activity, alcohol use, smoking habit, and diet, and the risk of RCC development.Our results suggest that lifestyleassociated factors may indirectly influence acquired risk factors through epigenetic modification.However, further studies that delve deeper into these complex interactions and facilitate a comprehensive understanding of the interplay between genetic factors and lifestyle-associated factors in relation to RCC risk are warranted.

PRS model for
Fig. 2 PRS distribution of 31 Korean-specific SNPs and evaluation of PRS performance.The PRS was constructed based on 31 specific SNPs in the Korean population.(a) Density plot showing the different distribution of the PRS in cases and controls.(b) ROC curve for evaluating PRS performance.SNP, single nucleotide polymorphism; PRS, polygenic risk score; RCC, renal cell carcinoma; ROC, receiver operating characteristic

Fig. 3
Fig. 3 Risk of RCC according to genetic and lifestyle-associated factors.The risk of RCC was affected by genetic and lifestyle-associated factors.(a) Association of genetic factor with RCC risk.(b) Association of lifestyle-associated factors with RCC risk.(c) Association of lifestyle-associated factors with the risk of RCC across strata of PRS.HR, hazard ratio; CI, confidence interval; N, number; RCC, renal cell carcinoma; PRS, polygenic risk score; p, p-value

Table 3
[42]onic variants and biological processes (n = 14) This gene involves the Notch signaling pathway.Notch is the receptor in a highly conserved signaling pathway that is crucial in development and implicated in malignant transformation[42].