Skip to main content

Table 8 Clustering accuracy under various simulation coefficient settings with fixed \(p=500\)

From: Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals

Parameters

Median Accuracy Measure \({M}_{accuracy}\)

\(n\)

\(q\)

\(K\)

\(pcnt\)

ANCut

ANCut.subset

ANCut.overlap

ANCut.silh

ANCut.elbow

K-means

300

200

3

1/3

52.0%

41.7%

43.5%

19% (K = 2)

18.7% (K = 2)

23.9%

1500

200

3

1/3

69.6%

61.8%

63.0%

63% (K = 3)

62.6% (K = 3)

36.3%

300

500

3

1/3

55.0%

38.4%

41.1%

39.9% (K = 3)

41.1% (K = 3)

30.3%

1500

500

3

1/3

64.6%

57.1%

58.3%

58.3% (K = 3)

58.6% (K = 3)

25.3%

300

1000

3

1/3

62.6%

42.8%

44.7%

43.7% (K = 3)

45.3% (K = 3)

44.1%

1500

1000

3

1/3

71.5%

64.7%

67.2%

67.1% (K = 3)

67.6% (K = 3)

28.6%

300

200

5

1/3

66.0%

48.5%

51.8%

10.9% (K = 2)

54.6% (K = 5)

45.0%

1500

200

5

1/3

73.7%

70.3%

72.3%

72.2% (K = 5)

70.5% (K = 6)

31.3%

300

500

5

1/3

61.5%

51.7%

52.6%

46% (K = 4)

33.6% (K = 3)

46.9%

1500

500

5

1/3

67.3%

63.3%

67.9%

66.9% (K = 5)

67.7% (K = 5)

22.6%

300

1000

5

1/3

70.2%

51.2%

55.0%

10.9% (K = 2)

10.5% (K = 2)

53.6%

1500

1000

5

1/3

76.1%

68.8%

74.9%

73.7% (K = 5)

71.4% (K = 5)

38.4%

300

500

3

1/5

59.0%

28.4%

30.9%

17.4% (K = 2)

16.8% (K = 2)

34.6%

1500

500

3

1/5

66.7%

56.3%

58.3%

58.3% (K = 3)

58.3% (K = 3)

24.4%

300

1000

5

1/5

67.1%

41.2%

47.6%

6.2% (K = 2)

6.7% (K = 2)

35.6%

1500

1000

5

1/5

75.2%

60.1%

66.6%

66.8% (K = 5)

66.9% (K = 5)

23.7%

300

500

3

1/9

60.9%

20.4%

26.3%

22.7% (K = 3)

25% (K = 3)

45.1%

1500

500

3

1/9

65.9%

49.6%

53.2%

53.2% (K = 3)

53.3% (K = 3)

34.1%

300

1000

5

1/9

66.9%

37.7%

40.9%

4% (K = 2)

5.2% (K = 2)

40.8%

1500

1000

5

1/9

76.9%

55.9%

56.4%

56.2% (K = 5)

55.9% (K = 5)

25.5%

  1. Column definitions:
  2. ANCut uses Hidalgo’s assisted clustering approach to cluster gene expression, with no missing data [12]. This serves as the “gold standard” as we compare various clustering approaches because this approach uses the largest amount of data – the entire gene expression data matrix and methylation data matrix in Fig. 1. The true number of clusters \(K\) is assumed to be known
  3. ANCut.subset also uses Hidalgo’s assisted clustering approach to cluster gene expression but uses only the overlapping individuals that have both GE and methylation data. The true number of clusters \(K\) is assumed to be known
  4. ANCut.overlap uses the proposed approach assuming only a subset of the data is available, i.e. the \(X\) and \(Y\) matrices in Fig. 1. With the \(\left(pcnt*n\right)\) overlapping individuals, we can construct a regression model between GE (\({Y}^{O}\)) and the methylation regulators (\({X}^{O}\)) to improve GE clustering. The true number of clusters \(K\) is assumed to be known
  5. ANCut.silh uses the proposed approach (ANCut.overlap) with the Silhouette method to select the optimal number of clusters
  6. ANCut.elbow uses the proposed approach (ANCut.overlap) with the Elbow method to select the optimal number of clusters
  7. K-means uses K-means method to cluster GE, using only the \(Y\) matrix (with missing data) in Fig. 1. The true number of clusters \(K\) is assumed to be known
  8. Clustering accuracy under additional simulation coefficient settings is summarized in the supplementary materials