Data
Screening of DEGs
The expression profile by high throughput sequencing GSE116250 was downloaded from the GEO (https://www.ncbi.nlm.nih.gov/geo/), which contained 14 normal samples, 13 ICM samples, and 37 DCM samples [41]. All expressed data used in this study were processed using the following process. (1) The probes or genes with more than 50% missing values were deleted, and the remaining missing values were filled with the k-Nearest Neighbor method using the knnImputation function in R package “DMwR”. Specifically, for each missing value, its k nearest expression values were searched based on the Euclidean Distance, and the weighted average of these values was used to fill in the missing value. (2) Probes corresponding to multiple genes were deleted. (3) For multiple probes corresponding to the same gene, the average expression value of these probes was used as the expression value of the gene.
The SAM algorithm was used to find the DEGs between ICM and normal samples, between DCM and normal samples, and between ICM and DCM samples, through the R package “samr”. Finally, 1802 DEGs between DCM and normal (indicated as DCM_NF in this paper) samples (DCM_NF DEGs), 3253 DEGs between ICM and normal (indicated as ICM_NF) samples (ICM_NF DEGs), and 358 DEGs between DCM and ICM (indicated as DCM_ICM) samples (DCM_ICM DEGs) with |log2(FC)| > 1 and FDR adjusted p-value < 0.05 were obtained.
Metabolic network reconstruction
On the basis of metabolic responses extracted from Recon 3 of the Virtual Metabolic Human Database (https://www.vmh.life) [42], a metabolic network composed of protein-coding genes (nodes) and their interactions (edges) was reconstructed. Recon 3 was created by expanding Recon 2 through the addition of new publicly available metabolomics data. The metabolic network was reconstructed by the following process. The two enzymes were thought to interact if the product of the reaction catalyzed by one enzyme was the substrate of the reaction catalyzed by the other enzyme. The genes encoding the proteins that make up the two enzymes were connected in the reconstructed network. Ubiquitous metabolites such as H2O, CO2 and ADP were excluded to avoid bias due to their extreme connections. The reconstructed metabolic network contained 3105 nodes (containing 257 ICM_NF DEGs, 141 DCM_NF DEGs, and 34 DCM_ICM DEGs) and 85,880 edges.
Mining the initial modules
Subsequently, the metabolic network was visualized with the help of Cytoscape software (version 3.7.0). In addition, the MCODE (version 1.6.1) plug-in in Cytoscape software was used to explore important modules in the metabolic network [43]. The advanced options were set to degree cutoff = 2, K-Core = 3, and node score cutoff = 0.2. The modules containing DEGs were screened as initial modules.
Detection of candidate modules
Candidate modules with significant differences were detected from initial modules using two steps.
W was evaluated by the difference between the average of absolute values of Pearson correlation coefficients H and H′ for samples of different states (DCM_NF, ICM_NF, or DCM_ICM).
$$W\left(M\right)=\left|H-H'\right|$$
where H and H′ were calculated according to expression values for all gene pairs.
$$H=\frac{1}{C_g^2}\sum_{j=1}^{C_g^2}\left| Pearson\left({X}_j,{Y}_j\right)\right|=\frac{1}{C_g^2}\sum_{j=1}^{C_g^2}\left|\frac{\sum_{i=1}^n\left({X}_{ji}-{\overline{X}}_j\right)\left({Y}_{ji}-{\overline{Y}}_j\right)}{\sqrt{\sum_{i=1}^n{\left({X}_{ji}-{\overline{X}}_j\right)}^2}\sqrt{\sum_{i=1}^n{\left({Y}_{ji}-{\overline{Y}}_j\right)}^2}}\right|$$
$${H}^{\prime }=\frac{1}{C_g^2}\sum_{j=1}^{C_g^2}\left| Pearson\left({X}_j^{\prime },{Y}_j^{\prime}\right)\right|=\frac{1}{C_g^2}\sum_{j=1}^{C_g^2}\left|\frac{\sum_{i=1}^m\left({X}_{ji}^{\prime }-{\overline{X}}_j^{\prime}\right)\left({Y}_{ji}^{\prime }-{\overline{Y}}_j^{\prime}\right)}{\sqrt{\sum_{i=1}^m{\left({X}_{ji}^{\prime }-{\overline{X}}_j^{\prime}\right)}^2}\sqrt{\sum_{i=1}^m{\left({Y}_{ji}^{\prime }-{\overline{Y}}_j^{\prime}\right)}^2}}\right|$$
\({C}_g^2\) is the number of all gene pairs. n and m are the number of samples of different states (DCM_NF, ICM_NF, or DCM_ICM), and Xj, Yj and \({X}_j^{\prime },{Y}_j^{\prime }\) are the expression values of the j-th gene pair in two different states, \({X}_{ji},{X}_{ji}^{\prime }\) and \({Y}_{ji},{Y}_{ji}^{\prime }\) are the expression values of the j-th gene pair in the i-th sample, and \({\overline{X}}_j,{\overline{X}}_j^{\prime }\) and \({\overline{Y}}_j,{\overline{Y}}_j^{\prime }\) are their average expression value, respectively.
Second, permutation tests were performed on the DCM, ICM and D_I initial modules to screen the modules with significant differences. The null hypothesis was that the initial modules had no difference between different states (DCM_NF, ICM_NF, or DCM_ICM). From the reconstructed metabolic network, 1000 degree-conserved random modules (the same degree of nodes as the initial module) and 1000 size-conserved random modules (the same number of nodes as the initial module) were constructed for each initial module. The Pearson difference scores of every random module (degree-conserved and size-conserved) were calculated and compared with the score of the corresponding initial module, respectively. For each way of randomization, the p values of initial modules were defined as follows.
where t is the number of random modules whose Pearson difference score is less than that of the initial module.
For initial modules with p value < 0.05, the null hypothesis should be rejected, so they were significantly different between different states. The initial modules that were significant in both cases (degree-conserved and size-conserved) were retained as candidate modules (both p values < 0.05).
The performance of candidate modules in classifying samples of different states can further reveal their relationship with cardiomyopathy. A SVM classifier was constructed to classify samples of different states (DCM_NF, ICM_NF, DCM_ICM) with the expression values of all genes in each module as features. The kernel function of SVM was set to “radial”. The performance was evaluated using leave-one-out cross validation (LOOCV). LOOCV draws one sample at a time as the test set, and the rest as the training set. Then the receiver operating characteristic (ROC) curve was drawn, and the AUC was calculated to measure the classification performance according to the classification results of the test sets.
Identification of cardiomyopathy risk modules
Markov random field (MRF) refers to the random field with Markov characteristics, which is often used to build mathematical models to identify protein interaction subnetworks [12, 13]. In our study, an MRF model was used to evaluate the expression difference considering both DEGs and non-DEGs in a candidate module by module score based on MRF (MRFms).
For a candidate module M with g genes, it was assumed that the expression difference E = (E1, …, Eg) between samples of different states formed an MRF. According to the properties of Markov random fields, the expression difference of gene g depends on the difference value of its one-step neighbor genes. Gibbs distribution was employed to specify the joint probability of E:
$$P(E)=\frac{1}{K}{e}^{-\frac{1}{T}F(E)}$$
where K is a constant that guarantees the probability sum to be 1, T is a temperature parameter controlling the distribution sharpness, and
$$F(E)=-\frac{1}{\sqrt{g}}\sum_{i\in {G}_1}{E}_i+\frac{1}{b}\sum_{i,j\in {G}_2}{\left(\frac{E_i}{\sqrt{d_i}}-\frac{E_j}{\sqrt{d_j}}\right)}^2 MI\left(i,j\right)$$
Based on our previous study [44] and the calculation process in [45], the MRFms for module M incorporating Mutual Information (MI) was defined as
$$MRFms(M)=\frac{1}{\sqrt{g}}\sum_{u\in {G}_1}{E}_u-\frac{1}{b}\sum_{v,z\in {G}_2}{\left(\frac{E_v}{\sqrt{d_v}}-\frac{E_z}{\sqrt{d_z}}\right)}^2 MI\left(v,z\right)$$
where b is the number of edges, G1 and G2 are the set of DEGs and non-DEGs in the module; Eu, Ev and Ez are the expression differences of genes u, v and z between different states (DCM_NF, ICM_NF, or DCM_ICM), and dv and dz are the degrees of genes v and z in the network, respectively. MI(v, z) is the mutual information of genes v and z.
Then the same permutation test as in the previous step was used to screen out cardiomyopathy risk modules. Finally, the modules that were significant in both cases (degree-conserved and size-conserved random modules) were identified as cardiomyopathy risk modules (both p values < 0.05).
Identification of core genes
According to the connection between genes and known genes in the network, core genes were further screened in the cardiomyopathy risk modules. From the Online Mendelian Inheritance in Man database [46], 43 known genes of cardiomyopathy were extracted, and 7 of them (PPCS, RAF1, TNNI3K, ABCC9, EYA4, SDHA, and TTN) were in the metabolic network. Then, the shortest paths between known gene pairs were searched, and genes in cardiomyopathy risk modules that appeared on these shortest paths were selected as candidate genes. The number of known gene pairs linked by gene x via these shortest paths B(x) was counted as follows.
$$B(x)=\sum_{s\ne x\ne t}{\sigma}_{st}(x)$$
$${\sigma}_{st}(x)=\left\{\begin{array}{c}1\\ {}0\end{array}\genfrac{}{}{0pt}{}{\ if\ G\left(s,t\right)=G\left(s,x\right)+G\left(x,t\right)}{otherwise}\right.$$
where G (s, t) is the length of the shortest path between two nodes s and t. s and t are known genes for cardiomyopathy in the metabolic network. σst(x) is a variable that indicates whether any shortest path between nodes s and t passes through node x. If so, it is 1, otherwise it is 0.
Genes linked more known gene pairs (top upper quartile) via shortest paths were identified as core genes.
$$C=\left\{x\left|{\mathit{\operatorname{rank}}}_x\in \left[1,\left\lfloor 0.25n\right\rfloor \right],{\mathit{\operatorname{rank}}}_x=\mathit{\operatorname{rank}}\left(B(x), in\ D\right)\right.\right\}$$
where D is the set of B(x) for all x, rankx is the rank of B(x) when ranking all B(x)s in set D in descending order, n is the number of candidate genes.
Evaluation of core genes
In order to reflect the association of core genes with cardiomyopathy, they were analyzed from three aspects: literature verification, enrichment analysis and classification performance. Literature verification was conducted by searching literature showing the relationship between core genes and cardiomyopathy in the PubMed database (https://www.ncbi.nlm.nih.gov/pubmed). Enrichr was used for GO functional annotation and KEGG pathway enrichment of core genes [47]. The PubMed database was also used to validate the association of significantly enriched functional classes and pathways (FDR adjusted p < 0.05) with the disease. The expression values of the core genes were further used as features to classify samples of different states (DCM_NF, ICM_NF, or DCM_ICM) in the expression profile and independent microarray datasets. The classification performance of the core genes and of random gene sets was compared to further evaluate the classification performance of the core genes. Random gene sets were composed of randomly selected differentially and non-differentially expressed genes from the cardiomyopathy risk modules with the same number as the core genes.