A phenomics-based approach for the detection and interpretation of shared genetic influences on 29 biochemical indices in Chinese south men

Background: Phenomics provides a new technologies and platforms as a systematic phenome-genome approach. However, few studies have reported on the system mining of shared genetics among clinical biochemical indices based on Phenomics methods, especially in China. This study aimed to apply phenomics to systematically explore shared genetics among 29 biochemical indices based on the Fangchenggang Area Male Health and Examination Survey cohort. Result: A total of 1,999 subjects with 29 biochemical indices and 709,211 single nucleotide polymorphisms were subjected to phenomics analysis. Three bioinformatics methods, namely, Pearson test, Jaccard index, and linkage disequilibrium score regression , were used. Results showed that 29 biochemical indices were from a network. IgA, IgG, IgE, IgM, HCY, AFP and B12 were in the central community of 29 biochemical indices. Key genes and loci associated with metabolism traits were further identified, shared-genetics analysis showed that 29 SNPs (P < 10 -4 ) were associated with three or more traits. After integrating the SNPs related to two or more traits with the GWAS catalog, 31 SNPs were found to be associated with several diseases (P < 10 -8 ). Taking ALDH2 as an example to preliminarily explore its biological function, we also confirmed that rs671 (ALDH2) polymorphism affected multiple traits of osteogenesis and adipogenesis differentiation in 3T3-L1 preadipocytes. Conclusion: All these findings indicated a network of shared genetics and 29 biochemical indices, which will helpfully understand the genetics participated in biochemical metabolism.

were from a network. IgA, IgG, IgE, IgM, HCY, AFP and B12 were in the central community of 29 biochemical indices. Key genes and loci associated with metabolism traits were further identified, shared-genetics analysis showed that 29 SNPs (P < 10 -4 ) were associated with three or more traits. After integrating the SNPs related to two or more traits with the GWAS catalog, 31 SNPs were found to be associated with several diseases (P < 10 -8 ). Taking ALDH2 as an example to preliminarily explore its biological function, we also confirmed that rs671 (ALDH2) polymorphism affected multiple traits of osteogenesis and adipogenesis differentiation in 3T3-L1 preadipocytes. Conclusion: All these findings indicated a network of shared genetics and 29 biochemical indices, which will helpfully understand the genetics participated in biochemical metabolism.

Background
Complex traits are the product of various biological signals and some intermediate traits that may be affected either directly or indirectly by these signals [1]. A phenome is the sum of many phenotypic characteristics (phenomics traits) that signifies the expression of the whole genome, proteome and metabolome under a specific environmental influence [2,3]. The study of phenomes (called phenomics) provides a suite of new technologies and platforms that have enabled a transition from focused phenotype-genotype studies to a systematic phenome-genome approach [4]. Many recent studies have found that, compared to consider only the binary patients vs. healthy controls, mapping intermediate steps in disease processes, such as various disease-related clinical quantitative traits or gene expression, are more informative [5,6].
Pleiotropy, which is a DNA variant or mutation that can affect multiple traits, is a common phenomenon in genetics [7]. For example, Joseph Pickrell and colleagues [8] performed genome-wide association studies (GWAS) of 42 traits or diseases to compare the genetic variants associated with multiple phenotypes, and identified 341 loci associated with multiple traits. Heid IM et al [9] performed a GWAS of fasting insulin, high-density lipoprotein cholesterol (HDL-C) and triglyceride (TG) levels to identify 53 loci associated with a limited capacity to store fat in a healthy way, and this multi-trait approach could increase the power to gain insights into an otherwise difficult-to-grasp phenotype.
Furthermore, many evidences have found that diseases or clinically quantitative traits can be interconnected. For example, the increasing of circulating fatty acids (Fas) could lead to the development of obesity-associated metabolic complications, such as insulin resistance [10]. Goh et al [11] found that essential human genes tended to encode hub proteins and widely expressed in multiple tissues. Many shared genetics variants are identified in linkage disequilibrium with variants associated with other human traits or diseases, and these pleiotropic connections make the human traits connect together [8,12]. Therefore, understanding the complex relationships among human traits and diseases is important for learning about the molecular function of hub genes. Although the role of genetic factor and gene polymorphisms lead to affect biochemical indices are reported, the network of biochemical indices themselves, biochemical indices and genetic type are still puzzling. As the rapid advances in bioinformatics techniques, clarifying the biochemical indices network with genetic types becomes feasible.
The aim of this study was to identify the shared genetics responsible for 29 biochemical indices in the FAMHES cohort, using a phenomics approach. shed light on the relationships between these 29 biochemical indices, including their shared genetic basis and genetic risk loci.

Genetic and trait-based characteristics of 1,999 samples
A total of 1,999 subjects with 29 biochemical indices that passed the QC call rate of 95% were analyzed, and a total of 709,211 SNPs in these subjects were subjected to the subsequent genetic analysis. The average GWAS inflation factor for all 29 biochemical indices was 1.029 (range: 0.975-1.060), suggesting that the stratification correlation worked well (Table S1). The heatmaps based on the Pearson correlation coefficient showed that 106 correlated pairs were found among these 29 traits (correlation coefficient was over 0.3 or less than -0.3 and the P value was less than 0.01) ( Figure 1). In addition, cluster analysis with the hclust package in R package could be classified these 29 biochemical indices into 2 groups, one group including blood urea nitrogen (

Correlation analysis based on network medicine
For each trait, we used a linear mixed model estimate fixed value, adjusted with PC1 and PC2 of population stratification and age, respectively, to perform GWAS. A total of 86,556 SNPs (p-value<1×10 -3 ) associated with all 29 biochemical indices were obtained and then annotated using the SNP Function database with default parameters and the south Asian population option [18]. A total of 12,521 genes were obtained, and protein-protein interactions were determined using the BioGRID database [19]. A total of 5,313 genes with known proteins were obtained, and the interactional network was built with Cytoscape 3 [20]. The topological coefficient, clustering coefficient and degree distribution were important indices to evaluate network nodes. Details of these three factors for 5,313 genes are shown in Figure S2 (A, B, C, D).
The Jaccard correlation matrix heatmaps showed that there were 63 correlated pairs among 435 pairwise combinations among these 29 traits indices with MCI were over 0.6 ( Figure 2). In these pairs, HCY, IgG, SHBG, B12, IgA and C4 were closely related with more than other six traits. However, in view of the information regarding gene/protein interactions in public databases is limited, interaction information for most of the genes/proteins in this study could not be obtained, and the Jaccard index was computed based on a small number of genes/proteins.

Correlation analysis based on linkage disequilibrium score regression (LDSC)
Genetics can help to elucidate cause and effect. However, single variants tend to have minor effects, and reverse causation involves an even smaller list of confounding factors.
Therefore, interrogating genetic overlap via GWAS that focuses on genome-wide significant SNPs is predicted to be an effective means of mining the correlation between different phenotypes. The GWAS effect size estimate for a given SNP will capture information about SNPs near the linkage disequilibrium [21]. The correlations based on GWAS of the 29 quantitative clinical traits were estimated using cross-trait LDSC. The genetic correlation estimates for all 435 pairwise combinations among these 29 traits.
After removing the outlier values, 68 significant correlated pairs (p<0.05) were found ( Figure 3). The details for these 68 selected pairs of traits are shown in Table S2.

Integration and interpretation of important pairs identified by these three methods
To identify the correlation pairs among these three methods, we integrated the correlated traits fitting at least one of the following: Pearson coefficient was greater than 0.3 or less than -0.3 and P value less than 0.01, Jaccard coefficient was greater than 0.6, or P value of LDSC was less than 0.05. Totally 208 correlated pairs among biochemical indices were found, among them 106, 63, 68 correlated pairs were found by Pearson coefficient , Jaccard coefficient , LDSC, respectively. Only 1 correlated pairs was found by all three methods. 10 correlated pairs both by Pearson coefficient and LDSC, 10 correlated pairs were found both by Pearson coefficient and LDSC, 15 by Pearson and Jaccard coefficient, and 5 by Jaccard coefficient and LDSC. ( Figure S3, A). The related traits were integrated if they fulfilled the following conditions: the Pearson coefficient was greater than 0.3 and pvalue less than 0.01, the Jaccard coefficient was greater than 0.6, or the LDSC p value was less than 0.05. Six traits (IgA, IgG, HCY, AFP, IgE and B12) were the first top factors in the network of these 29, and related to more than 20 traits. Additionally, IgM, CRP, C4, BUN, TG, Creatinine and FSH were the second top factors and connected with more than 15-20 traits, and OSTEOC, Estradiol, Glucose, FOL, TE, SHBG, FERR, BMI, ALT and HDL were the third top traits which correlated with more than 10 traits ( Figure S3, B).

Genes and SNPs those are potentially important across multiple traits
We selected SNPs with P<10 -3 for each trait, resulting in a total of 60,644 SNPs for all 27 traits. The essential genes have a tendency to express in multiple tissues, and are topologically and functionally central [12]. After integrating all 5,313 genes and removing the free notes in the total network among 29 biochemical indices, 427 genes (with P<10 -3 at least one SNP) were correlated with more than 5 traits. After filtering the genes with SNPs (P<10 -4 ), there were 71 genes correlated with more than or equal to 3 traits, especially aldehyde dehydrogenase 2 family member (ALDH2), BRCA1 associated protein (BRAP), cadherin 13(CDH13) and CUB and Sushi multiple domains 1 (CSMD1) which was related with more than 5 traits. In these 71 genes, 38 genes were found to connect more than 5 other genes in the interactional network annotated from BioGRID database [19] ( Table S3). This showed that essential genes related with multiple traits were sure to locate in the central gene interactional network.
Among all the genome wide variations SNPs, 481 (P<1✕10 -3 ) were associated with three or more clinical biochemical quantitative traits, and 13 of these 481 SNPs were related with more than 5 traits. In these SNPs, rs12229654 (near cut like homeobox 2 (CUX2)), rs2188380 (locates in CUX2) rs3809297 (locates in CUX2) and rs3782886 (locates in BRAP) was related with more than 10 traits. Six SNPs in CUX2 were correlated with more than 5 traits, which denote that CUX2 should play an important role on this net. In addition, for all the SNPs with P<1✕10 -4 , 29 SNPs were related with three or more biochemical indices ( Figure 4). After annotated 29 SNPs with P<1✕10 -4 using the HaploReg database [22], we found that almost all these SNPs were related to enhancer histone binding, promoter DNase binding and transcript binding, which affected protein binding or the presence of eQTL (Table S4).
After integrating the SNPs associated with more than 2 traits P<1✕10-4 with GWAS catalog [23], we found that 31 SNPs in 18 genes were found in GWAS catalog (Table S5).
Among those SNPs, five SNPs (rs579459, rs649129, rs507666, rs495828, and rs651007) in ABO were associated with more than 10 quantitative traits and diseases. One SNP (rs671) in ALDH2 was related to 21 traits, six SNPs (rs10519302, rs16964211, rs2305707, rs2414095, rs6493487 and rs727479) in or near CYP19A1 were associated with mainly with hormone measurements. This finding supports the idea that shared genetics on traits can produce correlations among these traits.

The rs671 polymorphism in ALDH2 affects osteogenic and adipogenic differentiation of 3T3-L1 preadipocytes
An interaction between SNP (rs671) in ALDH2 was related to 13 traits were found in this study. The relations between rs671 and lipid metabolism or osteocalcin have been found in some literatures [24,25], however, their function need to invest. Rs671 is a nonsynonymous (ns) SNP (G504L) in the ALDH2 gene, which is located on chromosome 12.
To evaluate the effects of the rs671 polymorphism on osteogenic and adipogenic differentiation of 3T3-L1 preadipocytes, a lentivirus vector was used to overexpress ALDH2-WT or ALDH2-G504L-mut in 3T3-L1 preadipocytes ( Figure S4). The cell growth curve of ALDH2-G504L-mut showed no obvious change compared with that of the control, but expression of ALDH2-WT induced a significant increase in cell proliferation ( Figure 5A).
The cell apoptosis results were consistent with this finding; overexpression of ALDH2-WT (peroxisome proliferator-activated receptor), were much higher in ALDH2-WT cells than in ALDH2-G504L-mut or control cells ( Figure 5I). Taken together, these results suggest that ALDH2-G504L-mut affected the osteogenic and adipogenic differentiation of 3T3-L1 preadipocytes.

Discussion
Immunoglobulin is produced by plasma cells and lymphocytes and characteristic of these types of cells, and plays an essential role in the body's immune system. as polycystic ovarian syndrome [47], coronary heart disease [48] , and its correlation with coronary artery disease (CAD). ALDH2 gene located on 12q24.12, encode aldehyde dehydrogenase, the second enzyme of the major oxidative pathway of alcohol metabolism. Rs671 is nonsynonymous mutation sit on exon 12. rs671 mutation was found to be associated with several traits (BMI, osteocalcin, renal function-related traits [49], response to alcohol consumption [50,51]

Measurements of 29 Biochemical indices
Overnight (≥8 h) fasting venous blood specimens were obtained between 7:00 am and

Jaccard coefficient
Phenotypes are linked if they share alterations in genetics. The pathobiology of human diseases might be understood by creating molecular and phenotypic networks [55,56]. We used SNP function[18] (https://snpinfo.niehs.nih.gov/) tool to identify the genes containing all of the SNPs which the p-value for the GWAS was less P<1×10 -3 . The human interactome was obtained by combining protein-protein interaction (PPI) information from the BioGRID database [19].
We built the correlations among 29 clinical phenomes based on the common genes/proteins between two traits. To minimize the bias in estimating of the correlation between two given traits, we calculated the molecular comorbidity index (MCI) by adapting the formula from Grosdidier S [57] to further consider the different coefficients of distance between the two diseases. The MCI was defined as follows: Where and are the proteins related to clinical traits 1 and 2, respectively. are those proteins related to trait 1 that interact with the proteins associated with trait 2 (and vice versa ). The two operators ∩ and ∪ denote the intersection and union between the two sets of elements ( and ), respectively.

Correlation analysis by LDSC
The genetic correlations derived from the summary statistics are evaluated by the GWAS effect size for a given SNP and integrated the effects of all SNPs which are in linkage disequilibrium (LD) with that SNP. The LDSC (which targets genetic correlation) uses variants across the whole genome and is a symmetrical (i.e., nondirectional) analysis for the risk factor and the outcomes [21]. In short, LDSC assumes that, for polygenic traits, SNPs will also capture information about SNPs near the LD. This relationship between the LD and the associated signal can also be used to test the relationship between the two traits for all SNPs in the genome. To further elucidate the correlations of these 29 biochemical indices in FAMHES from the genetic architecture, we applied LDSC to estimate the correlation of these 29 traits.   Table S6.

Statistical analysis
The correlations among the 29 biochemical indices were computed by the CORR procedure

Ethics approval and consent to participate
The study was approved by the Ethical Committee of Guangxi Medical University.

Consent for publication
All the authors are aware of and approve the manuscript as submitted to BMC genomics.

Availability of data and materials
All data generated or analyzed during this study are included in this published article [and its supplementary information files].

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download. Table S3.pdf Table S6.pdf Table S4.pdf Table S1.pdf Table S2.pdf