NET-GE: a novel NETwork-based Gene Enrichment for detecting biological processes associated to Mendelian diseases

Background Enrichment analysis is a widely applied procedure for shedding light on the molecular mechanisms and functions at the basis of phenotypes, for enlarging the dataset of possibly related genes/proteins and for helping interpretation and prioritization of newly determined variations. Several standard and Network-based enrichment methods are available. Both approaches rely on the annotations that characterize the genes/proteins included in the input set; network based ones also include in different ways physical and functional relationships among different genes or proteins that can be extracted from the available biological networks of interactions. Results Here we describe a novel procedure based on the extraction from the STRING interactome of sub-networks connecting proteins that share the same Gene Ontology(GO) terms for Biological Process (BP). Enrichment analysis is performed by mapping the protein set to be analyzed on the sub-networks, and then by collecting the corresponding annotations. We test the ability of our enrichment method in finding annotation terms disregarded by other enrichment methods available. We benchmarked 244 sets of proteins associated to different Mendelian diseases, according to the OMIM web resource. In 143 cases (58%), the network-based procedure extracts GO terms neglected by the standard method, and in 86 cases (35%), some of the newly enriched GO terms are not included in the set of annotations characterizing the input proteins. We present in detail six cases where our network-based enrichment provides an insight into the biological basis of the diseases, outperforming other freely available network-based methods. Conclusions Considering a set of proteins in the context of their interaction network can help in better defining their functions. Our novel method exploits the information contained in the STRING database for building the minimal connecting network containing all the proteins annotated with the same GO term. The enrichment procedure is performed considering the GO-specific network modules and, when tested on the OMIM-derived benchmark sets, it is able to extract enrichment terms neglected by other methods. Our procedure is effective even when the size of the input protein set is small, requiring at least two input proteins.

Algorithm 1 Enrichment pipeline 1: procedure Enrichment-Pipeline(P, List GOA , List ST RIN G , th) 2: P is a set of UniProtAC identifiers.

3:
List GOA is a collection of sets of proteins. Each set is related to the same GO term.

4:
List ST RIN G is a collection of sets of proteins. Each set is related to the same GO term. 5: th is the P-value threshold. 6: E GOA ← Enrichment(P, List GOA , th) 7: E ST RIN G ← Enrichment(P, List ST RIN G , th) 8: for each (L, GO, P value) ∈ E GOA do 9: Print-Report(P,L,GO,Pvalue) 10: end for 11: for each (L, GO, P value) ∈ E ST RIN G do 12: if GO is not in E GOA then 13: Print-Report(P,L,GO,Pvalue) 14: end if 15: end for 16: end procedure Algorithm 2 Enrichment analysis 1: function Enrichment(P, List, th) 2: P is a set of UniProtAC identifiers.

3:
List = {L1, ..., L N } is a collection of sets of proteins. Each set is related to the same GO term.

4:
th is the p-value threshold.

5:
E ← ∅ List of enriched terms 6: m ← 0 Total number of distinct proteins in List.

Modules extraction
We extract modules for 8,098 out of 12,621 GO BP terms represented in the STRING network. For each reference GO BP term, all the proteins in the network that are directly annotated with the same term are collected in a seed set . Each seed set is then extended into a function-specific module, i.e. a compact and connected subgraph of the STRING network. The function-specific module is built in three steps: extraction of the shortest path network, reduction to the minimal network and quality filtering, as detailed below

Extraction of the shortest path network
We extract the sub-network of STRING consisting of all the shortest paths between the proteins in the seed set (see Alg. 3). Recall that, given a GO term t, we define a seed set as the set of proteins that are annotated in GOA with term t. Seed proteins not appearing in STRING are kept as isolated nodes in the shortest path network.

Algorithm 3 Shortest Path Network
S is the set of seed nodes. 4: Extract the subgraph G = (V , E ) of all the shortest paths between u, v in G. 8: end for 12: V ← V ∪ S Add to V seed nodes not appearing in G.

13:
return G = (V , E ) 14: end function For the shortest paths computation (SP procedure in Alg. 3), we do not make use of the edge-scores provided in STRING, i.e. we treat STRING as an undirected and unweighted graph, without self-loops. The size of the shortest path networks extracted from STRING is usually large, even for relatively small input protein sets. On average, the shortest path networks extracted for the different BP GO terms contain 15 times more proteins than their seed sets.

Minimal connecting network
Due to the large number of retrieved connecting nodes, a minimization is applied to the shortest path network in order to simplify its topology, and thus highlight its main structure. In particular, the computational goal of the minimization procedure is to extract from the shortest path network the smallest distance-preserving network, i.e. the smallest subgraph that preserves the shortest distances between the seed proteins. We call such subgraph minimal connecting network.
We can formalize the definition of minimal connecting network as follows: the shortest path length between all pair of seed nodes is the same in both G and G ), where d G (u, v) denotes the shortest path length between u and v in G.
We say that a distance-preserving subnetwork G is a minimal connecting network if V is the smallest possible set that is consistent with properties (1) and (2).
The minimization procedure of a shortest path network is the most computationally expensive step of the module construction, as it closely resembles the Steiner tree problem [1]. Furthermore, the optimal solution is usually not unique. Our implementation makes use of the following heuristic approach (see Alg. 4): i) The nodes in the network are split into two disjoint groups: seed nodes (i.e. the nodes related to the seed proteins) and connecting nodes (i.e. the remaining nodes in the shortest path network). Line 4 in Alg. 4.
ii) The connecting nodes are ranked according to three predefined relevance criteria. Line 6 in Alg. 4. Their description is detailed in the Ranking scores section.
iii) The ranked list is iteratively processed starting from the least important node. Lines 7-13 in Alg. 4.
iv) The currently evaluated node is removed from the shortest path network only if its deletion does not increase the shortest distance between any pair of seed nodes. Lines 10-12 in Alg. 4. S ⊆ V is the set of seed nodes in G.

4:
C ← V \ S C is the set of connecting nodes in G.

5:
G ← G Make a copy of G.

6:
Sort(C) Sort C wrt some node-ranking criteria. 7: for each w ∈ C do 8: G ← remove w and all its edges from G 9: Check whether the SP distances between seed nodes are preserved in G . 10: end if 13: end for 14: return G 15: end function As for the shortest path network, seed proteins not appearing in STRING are kept as isolated nodes in the minimal networks. Differently from the shortest path networks, the minimal connecting networks are quite compact. On the average, they contain only 1.5 times more proteins that their seed sets.

Ranking scores
In the current version, the ranking of a connecting node is obtained by applying three scores (sc,ss,cc), which are used as primary, secondary and tertiary sort key, respectively. i) Seed centrality (sc). We say that a node connects two seed nodes if it appears in some shortest path connecting them. Thus, the seed centrality measure simply counts the number of distinct seed pairs connected by a node.
Definition 2.2 (Seed centrality) Let G = (V, E) be a shortest path network. Let S ⊆ V and C = V \ S be the set of connecting and seed nodes in G, respectively. The seed centrality of a connecting node w ∈ C is defined by where sp w (u, v) is the set of shortest paths between u and v in G passing through node w.
Note that, if |S| = n we have a total number of n · (n − 1)/2 distinct (unordered) pairs in S. Then ∀w ∈ C, 0 ≤ cc(w) ≤ n · (n − 1)/2. The seed centrality property implicitly assumes that the higher sc(w), the higher the probability that node w appears in a minimal connecting network.
ii) Maximum semantic similarity with the reference GO term (ss). The semantic similarity measures to which extent the annotation terms of each connecting node are related to the reference GO term: a connecting node with a high semantic similarity score is more likely to be functionally-related to the seed nodes. The semantic similarity is defined as the Lin's information-theoretic metric [4].
The information-theoretic semantic similarity measures rely on the information content of individual terms t in the GO hierarchy: where P r(t) is the relative frequency of GO term t with respect to some background distribution. The background for the information content measure used here is given by the entire set of UniProt-GOA annotations for human proteins [2]. The Resnik's similarity [3] between two terms t 1 and t 2 is defined as the maximum information content among the common ancestors of t 1 and t 2 : where A(t) denotes the set of all the ancestors of term t, recursively propagated up to the root of the GO hierarchy. The Lin's similarity [4] between two terms t 1 and t 2 is the normalized version of the Resnik's similarity: We use Lin's similarity to evaluate how well a protein is related to a reference GO term. In detail, we define the maximum semantic similarity of a connecting node with respect to the reference GO term as the highest Lins score between the GO terms associated to the connecting node/protein and the reference GO term: Definition 2.3 (Semantic similarity with the reference GO term) Let G = (V, E) be a shortest path network built with respect to the reference GO term t. For each connecting node w ∈ C, we define the semantic similarity with respect to reference GO term t by ss(w, t) = max{sim Lin (t , t) | t is a GO term associated to w}.
The maximum semantic similarity property explicitly gives more importance to connecting proteins whose annotations are more closely related to the reference GO term.
iii) Betweenness centrality (bc). The betweenness centrality is a measure of centrality of a node in a network [5]. Differently from the standard definition of betweenness centrality, here we compute this measure by considering uniquely the shortest paths connecting seed nodes.
Definition 2.4 (Betweenness centrality) Let G = (V, E) be a shortest path network, let S ⊆ V be the set of seed nodes and C = V \ S the set of connecting nodes. The betweenness centrality of a connecting node w ∈ C is defined by is the set of shortest paths between u and v in G passing through node w.
sp(u, v) is the set of shortest paths between u and v in G. We have that, This property is mainly used to assess a local ranking for those connecting nodes that have exactly the same ranking with respect to the previous two properties. In large shortest path networks, this happens quite often, due to the limited range of values of the previous two above properties.

Quality filtering
A quality filtering procedure is applied to the minimal connecting networks built in the previous step (see Alg. 5). The idea is to filter out those networks for which the GO annotations of the connecting nodes are weakly related to the reference GO term. In particular, rare BP terms (i.e. BP terms with few related proteins) tend to produce minimal networks consisting uniquely of long paths. In most of such cases, the annotations of the connecting proteins are unrelated to the reference GO, and then the resulting minimal network is unlikely to include many proteins related to the reference GO. Such network-modules are discarded and not considered for the enrichment. The quality filtering procedure makes use of the maximum semantic similarity measure, as defined above (Definition 2.3). In particular, a minimal network is retained if, with respect to the reference GO term, the average maximum similarity of the connecting nodes is significantly higher than the average maximum similarity of all the nodes in STRING, as assessed by a Students t-test with significance set to 5% (Line 12-23 in Alg. 5). The quality test filters out 1,205 networks out of 12,621, with sizes ranging from 3 to 137 nodes, with an average of 13. In this step, we filter out also minimal networks that do not contain any connecting node (Lines 9-11 in Alg. 5). Such networks are uninformative for a network-based enrichment analysis, since they do not contain more knowledge than their seed sets. The number of BP GO terms for which we extract a non trivial network is then 8,098.

3:
S ⊆ V is the set of seed nodes in G.

4:
t is the reference GO term for the minimal connecting network G.

5:
µt is the mean semantic similarity in STRING wrt GO term t.

6:
σt is the semantic similarity variance in STRING wrt GO term t.

7:
n is the number of nodes in STRING. 8: C ← V \ S C is the set of connecting nodes in G.