Rank condition of the expression profile data matrix for co-expressed genes
Given a set of n (≥2) genes and their expression levels over N (≥3) samples, the expression profiles can be expressed using a matrix
$$ G=\left[\begin{array}{ccc}\hfill {g}_{11}\hfill & \hfill \cdots \hfill & \hfill {g}_{1N}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {g}_{n1}\hfill & \hfill \cdots \hfill & \hfill {g}_{nN}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {\boldsymbol{g}}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{g}}_{\boldsymbol{n}}\hfill \end{array}\right]\in {\mathrm{\Re}}^{n\times N}, $$
(1)
with N-dimensional row vector g
i
= [g
i1, g
i2, …, g
iN
] be the expression profile for the i-th gene across the samples (i = 1, 2, …, n). If two genes i and j are perfectly co-expressed, ie, |ρ(g
i
, g
j
)| = 1 where ρ(⋅,⋅) is the Pearson correlation coefficient between two vectors, then given the linear relationship between the two vectors, we have
$$ {g}_{ik}={\alpha}_{ij}\cdot {g}_{jk}+{\beta}_{ij}\ \left(k=1,2,\dots, N\right) $$
(2)
for some constants α
ij
and β
ij
and
$$ {\boldsymbol{g}}_{\boldsymbol{i}}={\alpha}_{ij}\cdot {\boldsymbol{g}}_{\boldsymbol{j}}+{\beta}_{ij}\cdot {1}_{\boldsymbol{N}}^{\boldsymbol{T}}, $$
(3)
where 1
N
= [1, 1, …, 1]T ∈ ℜN is N-dimensional. Therefore the matrix G can be re-written as
$$ G=\left[\begin{array}{c}\hfill {\boldsymbol{g}}_1\hfill \\ {}\hfill \begin{array}{c}\hfill {\boldsymbol{g}}_2\hfill \\ {}\hfill \vdots \hfill \end{array}\hfill \\ {}\hfill {\boldsymbol{g}}_{\boldsymbol{n}}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {\boldsymbol{g}}_1\hfill \\ {}\hfill \begin{array}{c}\hfill {\alpha}_{12}{\boldsymbol{g}}_1+{\beta}_{12}\cdot {1}_{\boldsymbol{N}}^{\boldsymbol{T}}\hfill \\ {}\hfill \vdots \hfill \end{array}\hfill \\ {}\hfill {\alpha}_{1n}{\boldsymbol{g}}_1+{\beta}_{1n}\cdot {1}_{\boldsymbol{N}}^{\boldsymbol{T}}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {\boldsymbol{g}}_1\hfill \\ {}\hfill \begin{array}{c}\hfill {\alpha}_{12}{\boldsymbol{g}}_1\hfill \\ {}\hfill \vdots \hfill \end{array}\hfill \\ {}\hfill {\alpha}_{1n}{\boldsymbol{g}}_1\hfill \end{array}\right]+\left[\begin{array}{c}\hfill 0\hfill \\ {}\hfill \begin{array}{c}\hfill {\beta}_{12}\cdot {1}_{\boldsymbol{N}}^{\boldsymbol{T}}\hfill \\ {}\hfill \vdots \hfill \end{array}\hfill \\ {}\hfill {\beta}_{1n}\cdot {1}_{\boldsymbol{N}}^{\boldsymbol{T}}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill 1\hfill \\ {}\hfill \begin{array}{c}\hfill {\alpha}_{12}\hfill \\ {}\hfill \vdots \hfill \end{array}\hfill \\ {}\hfill {\alpha}_{1n}\hfill \end{array}\right]\cdot {\boldsymbol{g}}_1+\left[\begin{array}{c}\hfill 0\hfill \\ {}\hfill \begin{array}{c}\hfill {\beta}_{12}\hfill \\ {}\hfill \vdots \hfill \end{array}\hfill \\ {}\hfill {\beta}_{1n}\hfill \end{array}\right]\cdot {1}_{\boldsymbol{N}}^{\boldsymbol{T}}. $$
(4)
Thus the matrix G can be decomposed as the sum of two matrices each has rank 1. Since it has been well established in linear algebra that given two matrices A and B of the same size, rank(A + B) ≤ rank(A) + rank(B) [21], we have the following proposition:
Proposition 1
If the absolute value of the Pearson correlation coefficient (PCC) between every pair of rows of a matrix G (G ∈ ℜn × N, n ≥ 2, N ≥ 3) is 1, then rank(G) ≤ 2.
Furthermore, if any of the rows of G is shifted or scaled (e.g., g
'
ik
= λ ⋅ g
ik
+ ε), the PCC value between them will still have absolute value 1 and thus the Proposition 1 still holds.
SVD based methods for estimating the rank of G
Given the matrix G, its singular value decomposition (SVD) is G = USV
T where U ∈ ℜn × n, V ∈ ℜN × N are both orthonormal matrices and S is a diagonal matrix with all the elements being zero except for the ones on the diagonal line which are non-negative and sorted in descending order (ie, S
11 ≥ S
22 ≥ … ≥ S
KK
≥ 0, where K = min(n, N)). In addition, let ‖G‖ be the Frobenius norm of G such that \( {\left\Vert G\right\Vert}^2={\displaystyle \sum_{i=1}^n}{\displaystyle \sum_{j=1}^N}{g}_{ij}^2 \), then it is well known that
$$ {\left\Vert G\right\Vert}^2={\displaystyle {\sum}_{i=1}^n{\displaystyle {\sum}_{j=1}^N{g}_{ij}^2=}}{\displaystyle {\sum}_{i=1}^K{S}_{KK}^2,}\kern0.75em K= \min \left(n,\ N\right). $$
(5)
If G satisfies the condition of Proposition 1, then the rank of G is 2 which implies S
33 = S
44 = … = S
KK
= 0. Thus
$$ {R}_{12}=\frac{S_{11}^2+{S}_{22}^2}{{\left\Vert G\right\Vert}^2}=1. $$
(6)
In reality, given the expression profile matrix of a set of co-expressed genes, the perfect PCC value can never be reached and thus G is never really rank 2, but instead it can be approximated with a rank 2 matrix. Thus in theory, given an expression profile matrix G, we can examine if the genes (row vectors) are co-expressed by testing if the ratio R
12 defined in (6) is close to 1. We refer R
12 as the concordance index.
Data transformation and centralized concordance index
While the concordance index R
12 can be used as a potential indicator for the concordance of the rows of G and thus for evaluating co-expressed modules, it is difficult to set a hard threshold for it. This is even more challenging for real data due to noise, batch effects, and background signals that may skew the PCC values. In addition, since SVD is based on the L
2 -norm, it can be biased by any specifically large outlier or just a few genes with high expression levels. Thus the data needs to be transformed before processing. The transformation of the data we proposed involves two steps: centralization and standardization. First, each row of G needs to be centralized by subtracting its average such that
$$ \overline{G}=\left[\begin{array}{ccc}\hfill {g}_{11}-\overline{g_1}\hfill & \hfill \cdots \hfill & \hfill {g}_{1N}-\overline{g_1}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {g}_{n1}-\overline{g_n}\hfill & \hfill \cdots \hfill & \hfill {g}_{nN}-\overline{g_n}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {\boldsymbol{g}}_1^{\boldsymbol{c}}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{g}}_{\boldsymbol{n}}^{\boldsymbol{c}}\hfill \end{array}\right],\ where\ \overline{g_i}=\frac{{\displaystyle {\sum}_{k=1}^N}{g}_{ik}}{N}\ for\ i=1,2,\dots, n. $$
(7)
Next, each row of \( \overline{G} \) is standardized to have norm 1, ie
$$ \widehat{G}=\left[\begin{array}{c}\hfill \raisebox{1ex}{${\boldsymbol{g}}_1^{\boldsymbol{c}}$}\!\left/ \!\raisebox{-1ex}{$\left\Vert {\boldsymbol{g}}_1^{\boldsymbol{c}}\right\Vert $}\right.\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \raisebox{1ex}{${\boldsymbol{g}}_{\boldsymbol{n}}^{\boldsymbol{c}}$}\!\left/ \!\raisebox{-1ex}{$\left\Vert {\boldsymbol{g}}_{\boldsymbol{n}}^{\boldsymbol{c}}\right\Vert $}\right.\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {\widehat{\boldsymbol{g}}}_{\mathbf{1}}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\widehat{\boldsymbol{g}}}_{\boldsymbol{n}}\hfill \end{array}\right],\ where\ {\widehat{\boldsymbol{g}}}_{\boldsymbol{k}}=\raisebox{1ex}{${\boldsymbol{g}}_{\boldsymbol{k}}^{\boldsymbol{c}}$}\!\left/ \!\raisebox{-1ex}{$\left\Vert {\boldsymbol{g}}_{\boldsymbol{k}}^{\boldsymbol{c}}\right\Vert $}\right.,\ k=1,2,\dots, n. $$
(8)
The centralization step aims to mitigate the background signal while the standardization step avoids bias towards any particularly highly expressed genes. Interestingly, since the Pearson correlation coefficient between g
i
and g
j
is defined as
$$ \rho \left({g}_i,\ {g}_j\right)=\frac{{\displaystyle {\sum}_{k=1}^N}\left({g}_{ik}-{\overline{g}}_i\right)\cdot \left({g}_{jk}-{\overline{g}}_j\right)}{\sqrt{{\displaystyle {\sum}_{k=1}^N}{\left({g}_{ik}-{\overline{g}}_i\right)}^2}\cdot \sqrt{{\displaystyle {\sum}_{k=1}^N}{\left({g}_{jk}-{\overline{g}}_i\right)}^2}}=\frac{{\boldsymbol{g}}_{\boldsymbol{i}}^{\boldsymbol{c}}\cdot {\left({\boldsymbol{g}}_{\boldsymbol{j}}^{\boldsymbol{c}}\right)}^{\boldsymbol{T}}}{\left\Vert {\boldsymbol{g}}_{\boldsymbol{i}}^{\boldsymbol{c}}\right\Vert \cdot \left\Vert {\boldsymbol{g}}_{\boldsymbol{j}}^{\boldsymbol{c}}\right\Vert }={\widehat{\boldsymbol{g}}}_{\boldsymbol{i}}\cdot {\left({\widehat{\boldsymbol{g}}}_{\boldsymbol{j}}\right)}^{\boldsymbol{T}} $$
(9)
and ‖ĝ
k
‖ = 1 (k = 1, 2, …, n), therefore |ρ(g
i
, g
j
)| = 1 implies ĝ
k
= ĝ
1 or ĝ
k
= − ĝ
1. In other words,
$$ \widehat{G}=\left[\begin{array}{c}\hfill {\alpha}_1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\alpha}_n\hfill \end{array}\right]{\hat{\boldsymbol{g}}}_1,\ where\ {\alpha}_i\in \left\{+1, - 1\right\}\ for\ i=1,2,\dots, n. $$
Therefore we have the following proposition:
Proposition 2
If the absolute value of the Pearson correlation coefficient (PCC) between every pair of rows of a matrix G (G ∈ ℜn × N, n ≥ 2, N ≥ 3) is 1 and Ĝ is the centralized matrix of G with each row standardized as in (
8
), then rank(Ĝ) = 1. In addition, the inner product between every pair of rows equals the Pearson correlation coefficient of the two rows.
Thus the singular value decomposition (SVD) for Ĝ is \( \widehat{G}=\widehat{U}\widehat{S}{\widehat{V}}^T \) where \( \widehat{U}\in {\mathrm{\Re}}^{n\times n},\ \widehat{V}\in {\mathrm{\Re}}^{N\times N} \) are both orthonormal matrices and Ŝ is a diagonal matrix with all the elements being zero except for the ones on the diagonal line which are non-negative and sorted in descending order (ie, Ŝ
11 ≥ Ŝ
22 ≥ … ≥ Ŝ
KK
≥ 0, where K = min(n, N)). In fact, given that Ĝ is rank 1 and Ĝ = n, we have
$$ {\widehat{S}}_{11}^2=n\ \mathrm{and}\ {\widehat{S}}_{22}=\dots ={\widehat{S}}_{KK}=0. $$
(11)
We therefore define a centralized concordance index (CCI) as
$$ CCI=\frac{{\widehat{S}}_{11}^2}{n}. $$
Estimate the significance of the CCI
We examine two approaches for determining if the observed CCI is significantly large to reflect co-expression relationship among the n genes over the entire whole genome dataset. First, we randomly permute the entries of every row of Ĝ and calculate CCI
p
. This process is repeated M times (usually we choose M = 1000). Then we set
$$ {p}_{permute}=\#\left(CC{I}_p\ge CCI\right)/M. $$
Conceptually this gives a measurement on how significant is the observed concordance index in the background of the same data distribution.
Next, we randomly choose n genes from the whole genome gene expression data and calculate the CCI
r
. This process is repeated M times. We then calculate the z-score Z
CCI
for the CCI based on the random sampling such that \( {Z}_{CCI}=\frac{CCI- mean\left(CC{I}_r\right)}{std\left(CC{I}_r\right)} \). The significance is then estimated from the z-score. This gives a measurement on how significant is the observed CCI for the tested gene module in the entire genome. We choose z-score instead of the percentile of the CCI due to three reasons: 1) simulation and tests on real data shows that CCI
r
follow a bell-shaped distribution which can be reasonably approximated by a Gaussian distribution as shown in Figs. 1 and 2) even with 1000 times sampling, it is still relatively small comparing to all the possible combinations, thus sometimes although CCI is larger than all CCI
r
, it is not reasonable to assume that the p-value (significance) is extremely small, instead z-score gives a reasonable estimate on the deviation of the observed CCI from the random samples; and 3) last but not the least, one of our goals is to use the metric to compare results from different conditions, z-scores are standardized with the same distribution and thus allows us to compare with different conditions.
Simulation
To evaluate the performance of the concordance index, we generate a matrix of 50 × 100 with absolute value of PCC between every pair of rows being 1. The base vector is generated as a 100-dimensional row vector using uniform distribution from 0 to 1. The scaling factors (α) and shifts (β) are also generated using uniform distribution from 0 to 1. The matrix G is calculated using Eq.(4). Then Gaussian noises with zero mean at different levels (σ = 0.01, 0.02, 0.05, 0.07, 0.1, 0.15, 0.2, 0.3, 0.5, 1) are added to the matrix and corresponding concordance indices are calculated. This process is repeated 1000 times for each noise level. In addition, for each test the centralized matrix Ĝ
r
was compared with the original Ĝ using ratio \( {R}_F=\frac{\left\Vert {\widehat{G}}_r-{\widehat{G}}_F\right\Vert }{\left\Vert \widehat{G}\right\Vert F} \), where ‖ ⋅ ‖
F
is the Frobenius norm of the matrix.
Comparison with density metric
Since a focus of mining co-expression network is to identify densely connected gene modules, the metric density defined for network mining is often used. For a module with n in weighted network, its density is defined as
$$ d=\frac{{\displaystyle {\sum}_{i=1}^{n-1}}{\displaystyle {\sum}_{j=i+1}^n}{w}_{ij}}{n\left(n-1\right)/2}. $$
For co-expression networks, the weight w
ij
is often defined as the correlation coefficient |ρ(g
i
, g
j
)| or its transformation. In order to examine the relationship between CCI and density, we compare CCI with the density defined using |ρ(g
i
, g
j
)| as weights. Specifically, we consider two scenarios. The first is to test the responses of the metrics to outliers. We first generate the simulated matrix G as described above. Then outlier (independently generated vectors) will be added. We calculate both metrics under different number of outliers and different noise levels for G. The second scenario is to consider the possibility that two modules sometimes can be erroneously linked together. To test this, we generate two gene modules and test the responses of the two metrics with respect to different sizes and noise levels of the modules. Each test is repeated 100 times.
Datasets
We test the concordance index and its significance using a large gene expression dataset. The dataset was downloaded from NCBI Gene Expression Omnibus (GEO). The dataset is GSE18842 containing gene expression microarray data from 46 non-small cell lung cancer (adenocarcinoma) tumor samples and 45 non-cancer control tissue samples [22]. The GSE18842 dataset was generated using Affymetrix HU133 2.0Plus GeneChip. The normalization of the dataset was verified by inspecting the boxplot and data distributions. We also tested some of the findings using TCGA non-small cell lung cancer adenocarcinoma [23] and squamous cell tumor data [24] from cBioPortal (http://www.cbioportal.org/).
Weighted co-expression network analysis
While the R package WGCNA developed by the Horvath’s group is a widely adopted co-expression gene module discovery tool, it has some limitation as it is based on hierarchical clustering algorithm that does not allow overlap between modules and does not control the density of the detected modules [11]. In this paper, we apply a recently developed algorithm called Normalized lmQCM [15]. This algorithm takes a network mining approach allowing overlaps between modules and also is guaranteed to have a lower bound on the density of the detected modules.
Using CCI to detect condition specific modules
The concordance index and its significance evaluation provide us a means to evaluate if a co-expressed gene module (CGM) in one condition is also strongly co-expressed in another condition. We first apply the Normalized lmQCM to each conditions (normal and disease) in both datasets using selected parameters (to be discussed in the Results section). For each gene module, we then calculated two CCIs, one using the data from the condition it was generated and one using the data from the opposite condition. For instance, if the module was generated from the Parkinson’s disease patients, CCIs for the same gene module is calculated for both Parkinson’s disease samples and the control sample. Then the p
permute
and Z
CCI
are calculated for both conditions too. Gene modules that are significant (Z
CCI
≤ τ) in one condition but not significant (Z
CCI
> τ) in the other condition are reported for further analysis. The threshold τ is determined based on the significance requirement. For instance, τ is often chosen such that the one-tail p-value for the Z
CCI
is less than 0.05 for single gene module or 0.05/m if m gene modules are being tested (for multiple test compensation).
Enrichment analysis for gene modules
For the reported modules, we carry out enrichment analysis using TOPPGene (https://toppgene.cchmc.org/enrichment.jsp). We specifically pay attention to significantly enriched Gene Ontology (GO) terms, cytobands, transcription factor binding sites, and human/mouse phenotypes.