In this section, we first briefly introduce the various concepts related to disease module on incomplete interactome, especially a quantity SAB, called module separation, as given in [4], to measure relationship between two disease modules A and B. Then we explain in detail our method of finding missing common genes for a given pair of diseases formulated as an optimization problem to minimize SAB.
Disease module on Interactome and module separation
Interactome contains all protein-protein interactions in the cell, and can be conveniently represented as a graph (or network), in which proteins are represented as nodes and interaction between two proteins is represented as an edge connecting the two corresponding nodes. Reconstructing the interactome is a central task in systems biology, which studies the cell as a system in a holistic way instead of simple ensemble of isolated items. Due to the limitation of the current technology, interactome for most organisms, even model organisms, is incomplete, with missing nodes and edges. Nonetheless, the incomplete interactome can already provide valuable insights into many biological processes which cannot be obtained otherwise. In [4], it is shown how to uncover disease-disease relationships through the incomplete interactome. Diseases with genetic causes have been studied widely, often with a focus to identify the culprit gene only, to find that in many cases the cause cannot be attributed to a single gene; instead it is very common that multiple genes involving in multiple cellular processes may be at play. Without putting these pieces in a bigger context, it is difficult to fully understand the pathological mechanisms. Work in [4] presents a systematic study to uncover disease-disease relationships by mapping the associated genes onto the interactome.
As mentioned by [4], given a pair of diseases A and B, the genes known to be associated with them are put into two separate sets GA and GB respectively. Let graph G be the interactome, with node set V, and edge set E. Let map the genes in GA and GB onto G with two different colors, say, nodes in G corresponding to genes in GA are colored red and nodes in G corresponding to genes in GB are colored blue. For any shared gene, i.e., a gene is known to be associated to both disease A and disease B, then the corresponding node will be colored half red and half blue. Although all the red nodes are genes associated with disease A, indicating relatedness among them, they may not form a single connected component (or subgraph) of graph G of the interactome; often they form several connected components. This may be due to either incompleteness of the interactome (i.e., missing edges) or unknown associated genes, or a combination of both. However, if the connected components are too fragmented, say not significantly different from what can be formed by randomly mapped genes, then it is difficult to reliably infer useful relationships. So, in [4], the size of the largest connected component, as a percentage of the total number of genes associated to a disease, must be maintained beyond a threshold, which is set based on percolation theory and the data used in the study. And the largest connected component, meeting the size requirement, is then called module as representative for the disease. For example, multiple sclerosis (MS) has 69 known associated genes and the largest connected component, which is qualified as a module with a size of 11, and rheumatoid arthritis (RA) has 51 associated gene and the largest connected component, which is qualified as module with a size of 9.
To uncover disease-disease relationships, a quantity called module separate SAB is introduced as follows.
$$ {\mathit{\mathsf{s}}}_{\mathit{\mathsf{AB}}}\kern0.5em \equiv <{\mathit{\mathsf{d}}}_{\mathit{\mathsf{AB}}}>-\frac{<{\mathit{\mathsf{d}}}_{\mathit{\mathsf{AA}}}>+<{\mathit{\mathsf{d}}}_{\mathit{\mathsf{BB}}}>}{\mathsf{2}} $$
(1)
where <dAB> is the average of the shortest distance for each gene of disease A to reach a gene of disease B and vice versa, <dAA> is the average of the shortest distance for every gene in disease A to reach another gene in disease A, and <dBB> the average of the shortest distance for genes of disease B to reach another gene in disease B. Figure 1 shows how SAB is computed for a toy example. More comprehensive results in [4] demonstrate that this network-based measurement of disease module separation is more indicative of pathological manifestations of disease pairs than simply measuring the overlap between the associated gene sets, such as Jaccard Index:
$$ \mathrm{J}=\kern0.5em \mid {\mathrm{G}}_{\mathrm{A}}\cap {\mathrm{G}}_{\mathrm{B}}\left|/\right|{\mathrm{G}}_{\mathrm{A}}\cup {\mathrm{G}}_{\mathrm{B}}\mid $$
(2)
It is reported in [4] that, when the disease history of 30 million individuals aged 65 and older is used to determine the relative risk RR of disease comorbidity for each disease pair, the relative risk drops from RR ≥ 10 for SAB < 0 to the random expectation of RR ≈ 1 for SAB > 0.
Detection of missing shared genes
To further explore the predictive power of the disease module separation, we use it to tackle the incompleteness of the data. Specifically, for disease pairs that are known to share high comorbidity and therefore are expected to have a small, preferably negative, module separation, but instead have large positive SAB, we hypothesize that the discrepancy is due to some missing pieces of information, such as a missing shared gene, which if recovered should bring the two disease modules closer, i.e., to decrease SAB. Therefore, we formulate the detection of missing common genes between two disease modules as an optimization problem as follows.
$$ {\mathrm{x}}^{\ast }=\mathrm{argmin}\ \mathrm{SAB}\left[+\mathrm{x}\right] $$
(3)
$$ \mathrm{x}\in \left({\mathrm{G}}_{\mathrm{A}}\cup {\mathrm{G}}_{\mathrm{B}}\right)-\left({\mathrm{G}}_{\mathrm{A}}\cap {\mathrm{G}}_{\mathrm{B}}\right) $$
where x goes over genes distinctly associated to either disease A or disease B, and SAB[+x] is the module separation when x is added as a shared gene between disease A and B, and x* is the predicted missing shared gene which minimizes the module separation. The minimization can be achieved either by exhaustive search when the sets GA and GB are not very large or by some heuristics when the search space becomes huge. Note that, although Eq. (3) is formulated for finding a single (most probable) missing common gene, in practice, Eq. (3) can be applied sequentially multiple times for recovering multiple missing common genes. It is also worthwhile to note that the set of missing common genes recovered by using Eq. (3) iteratively one gene at a time may likely be different from a set of missing common genes should their candidacy as common gene be evaluated altogether, possibly due to the topology of the interactome and how these genes are located. So, if the number of missing common genes k is known, an alternative formulation of the optimization problem can be defined as follows.
$$ {\mathrm{X}}^{\ast }=\mathrm{argmin}\ {\mathrm{S}}_{\mathrm{AB}}\left[+\mathrm{X}\right] $$
(4)
$$ \mathrm{X}\in \left({\mathrm{G}}_{\mathrm{A}}\cup {\mathrm{G}}_{\mathrm{B}}\right)-\left({\mathrm{G}}_{\mathrm{A}}\cap {\mathrm{G}}_{\mathrm{B}}\right) $$
where X* is the optimal set of missing common genes, and X is any subset of size k from the genes that are distinctly associated with either disease A or disease B. This formulation, while theoretically sound and appealing, has two practical issues: a) the number of missing common genes k is not known a priori; and b) the increased computational complexity due to combinatorial in selecting k out n, where n = |GA ⋃ GB| - |GA ∩ GB|. Because of these issues, we only tested Eq. (4) for k = 2 and k = 3, while the results reported in the next section are mainly based on Eq. (3).