Skip to main content
Fig. 1 | BMC Genomics

Fig. 1

From: Leveraging global gene expression patterns to predict expression of unmeasured genes

Fig. 1

Gene selection a and expression prediction b workflows. A.1) The workflow starts with the TCGA HGSC Affymetrix gene expression data which is filtered to remove no/low expressed genes. A.2) A symmetrical gene-by-gene correlation matrix is created by calculating pairwise Pearson’s correlation coefficients (rP). A.3) A user-determined threshold is applied to the absolute value of the Pearson’s correlation coefficients (|rP|) in order to generate a binary adjacency matrix. Here, black indicates no correlation beyond the threshold between two genes, and white indicates the existence of such a correlation. A.4) A greedy geneset selection algorithm iteratively builds a set of genes to directly measure (DM set; red) and a set of genes that are predictable using the DM set (predictable set; blue). The predictable set is defined as those eligible but unmeasured genes that are strongly correlated to at least n DM genes where n is the redundancy chosen. B.1) The expression prediction workflow starts with splitting the TCGA HGSC data into training and testing partitions. The training partition is used to build a regression model for each gene in the predictable set. Only the genes in the DM set that are correlated to the specific predictable gene above the |rP| threshold are used as predictors. If there are multiple predictors, a forest of regression is trained. Otherwise, a polynomial regression of degree 2 is trained. B.2) The regression models are then used in the testing partition of the TCGA HGSC data to predict expression. The true and predicted values are compared using the Spearman rank correlation (rS). B.3) The accuracy of the regression models are assessed across populations and platforms using four independent HGSC datasets. In each dataset, the regression models are used to predict expression and rS is calculated

Back to article page