Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals

Jiang, Wenqing; Joehanes, Roby; Levy, Daniel; O’Connor, George T; Dupuis, Josée

doi:10.1186/s12864-022-09026-1

BMC Genomics

Table 8 Clustering accuracy under various simulation coefficient settings with fixed \(p=500\)

From: Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals

Parameters				Median Accuracy Measure \({M}_{accuracy}\)
\(n\)	\(q\)	\(K\)	\(pcnt\)	ANCut	ANCut.subset	ANCut.overlap	ANCut.silh	ANCut.elbow	K-means
300	200	3	1/3	52.0%	41.7%	43.5%	19% (K = 2)	18.7% (K = 2)	23.9%
1500	200	3	1/3	69.6%	61.8%	63.0%	63% (K = 3)	62.6% (K = 3)	36.3%
300	500	3	1/3	55.0%	38.4%	41.1%	39.9% (K = 3)	41.1% (K = 3)	30.3%
1500	500	3	1/3	64.6%	57.1%	58.3%	58.3% (K = 3)	58.6% (K = 3)	25.3%
300	1000	3	1/3	62.6%	42.8%	44.7%	43.7% (K = 3)	45.3% (K = 3)	44.1%
1500	1000	3	1/3	71.5%	64.7%	67.2%	67.1% (K = 3)	67.6% (K = 3)	28.6%
300	200	5	1/3	66.0%	48.5%	51.8%	10.9% (K = 2)	54.6% (K = 5)	45.0%
1500	200	5	1/3	73.7%	70.3%	72.3%	72.2% (K = 5)	70.5% (K = 6)	31.3%
300	500	5	1/3	61.5%	51.7%	52.6%	46% (K = 4)	33.6% (K = 3)	46.9%
1500	500	5	1/3	67.3%	63.3%	67.9%	66.9% (K = 5)	67.7% (K = 5)	22.6%
300	1000	5	1/3	70.2%	51.2%	55.0%	10.9% (K = 2)	10.5% (K = 2)	53.6%
1500	1000	5	1/3	76.1%	68.8%	74.9%	73.7% (K = 5)	71.4% (K = 5)	38.4%
300	500	3	1/5	59.0%	28.4%	30.9%	17.4% (K = 2)	16.8% (K = 2)	34.6%
1500	500	3	1/5	66.7%	56.3%	58.3%	58.3% (K = 3)	58.3% (K = 3)	24.4%
300	1000	5	1/5	67.1%	41.2%	47.6%	6.2% (K = 2)	6.7% (K = 2)	35.6%
1500	1000	5	1/5	75.2%	60.1%	66.6%	66.8% (K = 5)	66.9% (K = 5)	23.7%
300	500	3	1/9	60.9%	20.4%	26.3%	22.7% (K = 3)	25% (K = 3)	45.1%
1500	500	3	1/9	65.9%	49.6%	53.2%	53.2% (K = 3)	53.3% (K = 3)	34.1%
300	1000	5	1/9	66.9%	37.7%	40.9%	4% (K = 2)	5.2% (K = 2)	40.8%
1500	1000	5	1/9	76.9%	55.9%	56.4%	56.2% (K = 5)	55.9% (K = 5)	25.5%

Column definitions:
ANCut uses Hidalgo’s assisted clustering approach to cluster gene expression, with no missing data [12]. This serves as the “gold standard” as we compare various clustering approaches because this approach uses the largest amount of data – the entire gene expression data matrix and methylation data matrix in Fig. 1. The true number of clusters \(K\) is assumed to be known
ANCut.subset also uses Hidalgo’s assisted clustering approach to cluster gene expression but uses only the overlapping individuals that have both GE and methylation data. The true number of clusters \(K\) is assumed to be known
ANCut.overlap uses the proposed approach assuming only a subset of the data is available, i.e. the \(X\) and \(Y\) matrices in Fig. 1. With the \(\left(pcnt*n\right)\) overlapping individuals, we can construct a regression model between GE (\({Y}^{O}\)) and the methylation regulators (\({X}^{O}\)) to improve GE clustering. The true number of clusters \(K\) is assumed to be known
ANCut.silh uses the proposed approach (ANCut.overlap) with the Silhouette method to select the optimal number of clusters
ANCut.elbow uses the proposed approach (ANCut.overlap) with the Elbow method to select the optimal number of clusters
K-means uses K-means method to cluster GE, using only the \(Y\) matrix (with missing data) in Fig. 1. The true number of clusters \(K\) is assumed to be known
Clustering accuracy under additional simulation coefficient settings is summarized in the supplementary materials

Back to article page

ISSN: 1471-2164

Contact us

Submission enquiries: bmcgenomics@biomedcentral.com
General enquiries: ORSupport@springernature.com