- Research article
- Open Access
Topology and weights in a protein domain interaction network – a novel way to predict protein interactions
- Stefan Wuchty^{1}Email author
https://doi.org/10.1186/1471-2164-7-122
© Wuchty; licensee BioMed Central Ltd. 2006
Received: 31 January 2006
Accepted: 23 May 2006
Published: 23 May 2006
Abstract
Background
While the analysis of unweighted biological webs as diverse as genetic, protein and metabolic networks allowed spectacular insights in the inner workings of a cell, biological networks are not only determined by their static grid of links. In fact, we expect that the heterogeneity in the utilization of connections has a major impact on the organization of cellular activities as well.
Results
We consider a web of interactions between protein domains of the Protein Family database (PFAM), which are weighted by a probability score. We apply metrics that combine the static layout and the weights of the underlying interactions. We observe that unweighted measures as well as their weighted counterparts largely share the same trends in the underlying domain interaction network. However, we only find weak signals that weights and the static grid of interactions are connected entities. Therefore assuming that a protein interaction is governed by a single domain interaction, we observe strong and significant correlations of the highest scoring domain interaction and the confidence of protein interactions in the underlying interactions of yeast and fly.
Modeling an interaction between proteins if we find a high scoring protein domain interaction we obtain 1, 428 protein interactions among 361 proteins in the human malaria parasite Plasmodium falciparum. Assessing their quality by a logistic regression method we observe that increasing confidence of predicted interactions is accompanied by high scoring domain interactions and elevated levels of functional similarity and evolutionary conservation.
Conclusion
Our results indicate that probability scores are randomly distributed, allowing to treat static grid and weights of domain interactions as separate entities. In particular, these finding confirms earlier observations that a protein interaction is a matter of a single interaction event on domain level. As an immediate application, we show a simple way to predict potential protein interactions by utilizing expectation scores of single domain interactions.
Keywords
Background
The depiction of interactions between genes, proteins and metabolites as networks has uncovered unexpected similarities in the organization of various biological networks, indicating that generic principles and mechanics give rise to their structure. Although such networks vary extensively in their complexity, corroborative evidence points to a series of simple organizing principles that characterize all complex networks. The most dramatic is the scale-free nature of these networks, a remarkable inhomogeneity that highlights a small number of highly connected nodes which secure the networks integrity [1]. The special role such proteins play for the stability of protein interaction networks is further indicated by their significant propensity to be simultaneously essential as well as evolutionary conserved [2]. Reflecting their inherent cohesive nature, complex networks are characterized by the accumulation of discernible modules. Such clusters of densely interconnected nodes combine in an overlapping manner, share well defined functions and hubs as the modules connectors [1, 3, 4]. Similarly to hubs, cohesively bound motifs of protein networks are frequently conserved as a whole, suggesting their role as evolutionary relevant units [5]. While these findings allowed spectacular insights into the inner workings of a cell, biological networks are generally not only determined by their layout of links. In fact, we expect that the heterogeneity in the utilization of connections has a major impact on the organization of cellular activities as well. Recently, attention turned to weighted scientific collaborations and airways networks [6], allowing a first insight into the intricate interplay between links and their weights. Concluding, analysis of real world networks indicate that the static grid of links and their weights can not be regarded as separate entities. Here, we present a first statistical analysis of a weighted biological network by considering a web of PFAM domain interactions. Each link between domains is weighted by an expectation score, reflecting the probability that a particular domain interaction indeed gives rise to observed protein interactions. Applying metrics that combine the static layout of interactions and their weights, we observe that the patterns of correlations are similar for weighted and unweighted network parameters. In contrast to other real world networks, we find weak signals that do not support an entanglement of static grid and weights of domain interactions, allowing us to confirm that a protein interactions are largely governed by single domain interactions.
Assuming that pairs of interacting proteins in S. cerevisiae and D. melanogaster are indeed dominated by the highest scoring domain interaction their domain architectures suggest, we find that the confidence score of a protein interaction correlates well with its highest scoring domain interaction. As an application, this observation indicates a simple method to model interactions between proteins of the human malaria parasite P. falciparum. Assuming an interaction between proteins if we find at least one high scoring domain interaction we predict 1, 428 novel protein interactions among 321 proteins. The quality of each predicted interaction is assessed by a logistic regression model, allowing us to uncover reliable interactions between proteins that share similar functions and are preferably conserved in evolution.
Results
As a source of high quality interaction data of protein domains we utilized the results of a recent study by Riley et al. [7]. In this statistical approach, called domain pair exclusion analysis (DPEA), a likelihood ratio test is applied to assess the contribution of each potential PFAM-A and PFAM-B domain [8] interaction to the likelihood of a set of observed protein interactions as of DIP [9]. Applying a statistical framework which evaluates the confidence that domains i and j indeed interact, the authors obtain a network of 1, 566 domains that are embedded in a web of 2, 767 interactions. Weighting each interaction by its probability score – the expectation value [7] – we are primarily interested in the interplay between topology and the reliability of the underlying interactions.
Investigating further if the topology of the underlying domain interaction network and their weights are indeed independent from each other, we combine both topology and weights by a series of measures that enable a more significant assessment of the impact of weights [6]. In an unweighted domain interaction network, the domains degree is defined as k_{ i }= ∑_{ j }a_{ ij }where a_{ ij }= 1 if there exists a link between domains i and j. Extending this definition, the strength of a domain i is defined as
${s}_{i}={\displaystyle \sum _{j}{a}_{ij}{E}_{ij},}\left(1\right)$
Statistics of single domains. Domains in the underlying interaction network are characterized according to their degree k and their strength s, defined as the sum of all weights the domain in question is involved in. Here, we show the 10 most connected and strongest PFAM domains.
PFAM domain | description | degree k | PFAM domain | description | strength s |
---|---|---|---|---|---|
PF01423 | LSM | 72 | PF01423 | LSM | 777.7 |
PF00071 | ras | 50 | PF00118 | TCP-1/cpn60 | 294.5 |
PF00022 | actin | 50 | PF00022 | actin | 291.5 |
PF00069 | pkinase | 49 | PF00069 | pkinase | 289.0 |
PF00076 | rrm1 | 45 | PF00071 | ras | 263.5 |
PF00118 | TCP-1/cpn60 | 43 | PF00076 | rrm1 | 253.4 |
PF00096 | zf-C2H2 | 39 | PB075870 | – | 248.8 |
PB075780 | – | 39 | PF00227 | proteasome | 237.1 |
PF00036 | efhand | 36 | PF01008 | IF-2B | 226.5 |
PF01008 | IF-2B | 35 | PF00001 | 7tm-1 | 226.0 |
Investigating the local cohesiveness of network areas, the unweighted representation of the clustering coeffcient C_{ i }measures the degree of cohesiveness around a particular domain i [12]. The dependence of the average clustering coeffcient C from the domains degree k recovers further information about the structure of the underlying network. In most real world networks C(k) exhibits a highly nontrivial behavior as exemplified by a power-law decay with increasing degree k. Averaging over the clustering coeffcients of domains with a certain degree k, we find this particular signature, indicating the presence of a nested hierarchy of modules [1] (Figure 1c). Accounting for weights, Barrat et al. [6] extended the initial definition of the clustering coeffcient to combine topological information with weights of network links. Considering the expectation value of each domain interactions E as the weight of links, we define the weighted clustering coeffcient as
${C}_{i}^{w}=\frac{1}{{s}_{i}({k}_{i}-1)}{\displaystyle \sum _{j,h}\frac{{E}_{ij}+{E}_{ih}}{2}}{a}_{ij}{a}_{ih}{a}_{jh}.\left(2\right)$
Since the structure essentially follows the concept of the original clustering coeffcient, we expect that ${C}_{i}^{w}$ retains its dependence from the degree k. Indeed, we find a power-law dependence in both networks (Figure 1c). Considering the mean weighted clustering coeffcient of the whole network as the arithmetic mean over all domains N, , we obtain 0.097. Comparing this result to the value of the mean unweighted clustering coeffcient of 0.093, we find that ⟨C^{ w }⟩/⟨C⟩ ≈ 1.0. Since the weighted clustering coeffcient reflects a domain's neighborhood to be connected to domains of similar strength the latter result indicates that local clustering predominately occurs on the level of comparable strength.
Another measure that allows insights in the relationship of network layout and weights are degree-degree correlations. Similarly to C^{ w }, we define the average weighted nearest-neighbors degree as [6]
In an unweighted network the definition of k_{nn,i}recovers the average nearest neighbor degree of a node, where ${k}_{nn,i}=\frac{1}{{k}_{i}}{\displaystyle {\sum}_{j}{a}_{ij}{k}_{j}}$. In the presence of correlations with connectivity k, the behavior of the latter measure k_{nn,i}(k) identifies two classes of networks. If k_{ nn }(k) is an increasing function with k, vertices with higher degree have an increased probability to be connected with large-degree vertices, a feature that is known as assortative mixing. If k_{ nn }(k) decreases with k, the underlying network is disassortative, indicating that high degree vertices predominantly are connected to sparsely linked ones. Similarly to other biological networks [13], we find a weak albeit significant trend toward disassortativity in both the unweighted and weighted domain interaction networks (Figure 1d). Considering the nearest neighbor degree of the whole network as the arithmetic mean over all nodes N, $\u3008{k}_{nn}^{w}\u3009=\frac{1}{N}{\displaystyle {\sum}_{i=1}^{N}{k}_{nn,i}^{w}}$, we obtain 12.81. Comparing this result to the value of the mean unweighted nearest neighbor degree of 12.84, we find that 10216;${k}_{nn}^{w}$10217;/κ_nnˆw≈ 1.0, indeed confirming that in both the weighted as well as unweighted representation the disassortative behavior prevails.
${Y}_{2}(i)={\displaystyle \sum _{j\in \Gamma (i)}\frac{{E}_{ij}^{2}}{{s}_{i}^{2}}}\left(4\right)$
where Γ(i) is the set of neighbors of domain i. In Figure 2b we observe a clear power-law in the dependence of the disparity value Y_{2} from the degree k, Y_{2}(k) ~ k^{-0.9}. Similarly to the dependence of the strength from the degree (Figure 2a), an exponent close to 1 suggests that the expectation values of domain interactions are distributed in an uncorrelated manner [6, 14].
The absence of any correlations between the structure of the web of domain interactions and their confidence suggests that domain interactions hardly interfere with each other. As a consequence, protein interactions are primarily governed by a single domain interaction. Indeed, a recent survey of protein interactions uncovered a rate of 94% that protein interactions are determined by a single pairwise domain interaction [15] while protein interactions that involve interactions between two or more domains are hardly found. A high E reflects the probability that the domains in question indeed interact while a low E_{ ij }suggests that other potential domain interactions are roughly as good at explaining the observed protein interactions [7]. Therefore, we assume that a protein interaction is governed by the domain interaction with the highest expectation value. In order to uncover a potential correlation between the quality of a particular protein interaction and the highest scoring domain interaction, we utilize two well curated sets of protein interactions in S. cerevisiae [16] and D. melanogaster [17] where each interaction is evaluated by a confidence score. Utilizing information about the domain composition of proteins as of the Integr8 database, we screen each domain pair that is suggested by the domain architectures of the underlying proteins. Provided these pairs indeed map to high scoring domain interactions, each protein interaction is assumed to be governed by the domain interaction with highest expectation score. Applied to the evaluated protein interaction sets of S. cerevisae and D. melanogaster, we observe a strong and significant correlation between an interactions confidence and the expectation value of the underlying highest scoring domain interaction (Figure 4a). In turn, we can potentially use the previous conclusion that the absence of correlations between interactions and their probability indicates the dominance of single domain interactions as a means to infer protein interactions. As an organism, we chose the human malaria parasite P. falciparum. Utilizing domain information from the Integr8 database we annotate Plasmodium proteins with their corresponding PFAM domains. In order to avoid interactions between proteins that appear in different compartments we additionally assign each protein with its cellular component terms as of the GO Slim database [18]. Considering all protein pairs of Plasmodium we select those that share at least one GO Slim term. The domain architectures of candidate protein pairs are screened for domain pairs that have at least one high scoring domain interaction. In case we find more than one high scoring domain interaction, we choose the highest scoring one, according to the statistical argument that domain interactions with higher expectation score have a better chance to explain the underlying protein interaction. In Figure 3a, we give a schematic survey of the procedure. Applying this method to the proteome of P. falciparum we find 1, 428 interactions between 361 proteins [see Additional file 1]. In order to evaluate each of these potential protein interactions, we characterize each link by measures that reflect biological significance. In particular, we are interested in parameters that are independent of the initial assumption that the highest scoring domain interaction indeed can be used to predict protein interactions. As such, we choose co-expression correlation values of interacting proteins, since similar expression profiles tend to indicate interacting proteins. For P. falciparum, we utilized gene expression data over 48 time points. Compiling gene expression data derived from micro-array analysis [19–21], we determine Pearson's correlation coeffcients r_{ P }of each interaction (see Materials and Methods). In addition, we calculated hypergeometric clustering coeffcients C_{ vw }for each interaction, a topological measure that reflects local cohesiveness around a certain link and strongly correlates with the quality of the underlying protein interaction [22] (see Material and Methods). Combining these measures, we utilized a logistic regression method (see Material and Methods) trained by carefully selected sets of 213 true positive and 173 negative interactions, allowing us to assess the quality of each interaction by a confidence score between 0 and 1 (Figure 3b). As a quality measure of the utilized training sets, we performed a leave-one-out strategy, allowing us to obtain 95% accuracy.
Domain interactions in predictions of protein interactions in Plasmodium. Predicting protein interactions by their highest scoring domain interaction in P. falciparum we find the following 20 most frequent domain interactions. N refers to the domain interactions occurrence in the predicted set, %_{ sl }depicts the percentage of self protein interactions, and E is the expectation value of the underlying domain interaction.
domain | description | domain | description | N | %_{ sl } | E |
---|---|---|---|---|---|---|
PF00076 | rrm1 | PF01423 | LSM | 137 | - | 14.5 |
PF00227 | proteasome | PF00227 | proteasome | 120 | 12.5 | 103.1 |
PF01423 | LSM | PF01423 | LSM | 120 | 12.5 | 387.1 |
PF00005 | ABC transporter | PF00005 | ABC transporter | 83 | 16.5 | 4.9 |
PF00097 | zf-C3HC4 | PF00240 | ubiquitin | 74 | - | 5.7 |
PF00076 | rrm1 | PF00076 | rrm1 | 56 | 28.7 | 14.5 |
PF00022 | actin | PF00022 | actin | 55 | 18.1 | 8.5 |
PF00125 | histone | PF00125 | histone | 36 | 22.1 | 11.6 |
PF01423 | LSM | PF06220 | zf-U1 | 30 | - | 20.3 |
PF02953 | Tim10/DDP zinc finger | PF00153 | mitochondrial carrier | 30 | - | 6.6 |
PF00097 | zf-C3HC4 | PF01283 | Ribosomal protein S | 24 | - | 3.0 |
PF00097 | zf-C3HC4 | PF01775 | Ribosomal L18ae | 24 | - | 7.3 |
PF00097 | zf-C3HC4 | PF00833 | Ribosomal S17 | 24 | - | 3.6 |
PF00097 | zf-C3HC4 | PF00827 | Ribosomal L15 | 24 | - | 3.2 |
PF00118 | TCP-1/cpn60 chaperonin | PF00118 | TCP-1/cpn60 chaperonin | 23 | 34.8 | 17.9 |
PF00928 | Adaptor complexes | PF01217 | Clathrin adaptor | 20 | - | 21.7 |
PF00076 | rrm1 | PF01974 | tRNA intron endonuclease | 18 | - | 3.0 |
PF00076 | rrm1 | PF06220 | zf-U1 | 18 | - | 6.2 |
PF01602 | Adaptin N terminal region | PF01217 | Clathrin adaptor complex | 16 | - | 9.2 |
PF00125 | histone | PF00956 | Nucleosome assembly protein | 16 | - | 14.5 |
Discussion & conclusion
Assessing the statistical characteristics of a weighted domain interaction network we show that the confidence as exemplified by the expectation value of domain interactions is far from being evenly distributed. Characterizing the underlying weighted domain interactions network, we observe that weighted and unweighted measures of topology follow the same trends. Despite these observations we do not find any significant proof that topology and weights in the domain interaction network are necessarily dependent from each other. In fact, correlations between strength and connectivity as well as disparity suggest that weights as exemplified by the expectation score of each domain interaction are randomly distributed, allowing us to (i) treat the static layout of links and their weights as separate entities and (ii) conclude that protein interactions are indeed governed by a single protein domain interaction [15].
The presence of highly reliable domain interactions offers potential new ways for the prediction and evaluation of protein interactions. In particular, we observe a correlation between an elevated confidence level of a protein interaction in yeast and fly and an increase in the reliability of the underlying domain interactions. As an application, we propose a novel method for the inference of potential protein interactions. While this method can be applied to the prediction of protein interactions in any organism for which PFAM annotation of the organisms proteome is available, we chose the human malaria parasite P. falciparum. Screening through all pairs of proteins that provide at least one high scoring domain interaction, we sample potential candidates. Here, we stress that the determination of a high scoring domain interaction has been used as a preselection step of potential protein interaction candidates. In order to evaluate each interaction we resort to interaction specific parameters that are independent from the underlying domain interactions. We find interactions between proteins, that not only show an elevated degree of functional similarity and evolutionary conservation, but also validate our assumption that high scoring domain interactions indeed give rise to reliable interactions. Predominately, we find an enrichment of protein interactions caused by domain interactions that represent functions in the ribosome, proteasome and spliceosome. As reported in protein complexes in other eukaryotes, these functions emphasize a considerable amount of self interactions, we also find in our predictions.
Comparing with existing experimental data sets, we only find a minimal overlap, caused by the fact that many proteins of P. falciparum currently are not annotated with PFAM domains. On the other hand, experimental determination of protein interactions in P. falciparum is in its starting phase covering about a quarter of known proteins. As such, our predictions can help focus experimental studies on specific interactions unique to this pathogen.
Methods
Domain-domain interactions
As a source of high quality interaction data of protein domains we utilized the results of a recent study by Riley et al. [7]. In this statistical approach called domain pair exclusion analysis (DPEA), a likelihood ratio test is applied to assess the contribution of each potential PFAM-A and PFAM-B domain [8] interaction to the likelihood of a set of observed protein interactions. DPEA consists of three steps: (i) Utilizing protein interaction data from DIP [9], the frequency S_{ ij }of an interaction between i and j in relation to their abundance in the data is computed. (ii) Using S_{ ij }as an initial guess, an expectation maximization algorithm is applied to obtain a maximum likelihood estimate of Θ_{ ij }which stands for the probability of domain interaction ij among all the possible domain interactions which are suggested by the domain architectures of the interacting protein pairs where domain i and j co-occur. In a third step, all possible interactions of domains i and j are excluded from the mixture of competing hypotheses for the presence of corresponding protein interactions, EM is rerun, and the change in likelihood is expressed as a log odds score, E_{ ij }, reflecting the confidence that domains i and j indeed interact. As such, a high value of E_{ ij }indicates that there is extensive evidence in protein interaction data that domains i and j interact while a low E_{ ij }suggests that other potential domain interactions are roughly as good at explaining the observed protein interactions [7]. As a proof of concept, domain pairs inferred to interact with high E are significantly enriched among domain pairs known to interact in the Protein Data Bank (PDB). The domain interaction network thus obtained comprises 1, 566 domains which are embedded in 2, 767 interactions that score E_{ ij }≥ 3.
Protein interactions
We utilized a large scale compilation of yeast protein interactions. In particular, this data set combines 47, 783 experimentally obtained protein interactions among 4, 175 proteins in S. cerevisiae [16] obtained from sources as diverse as mRNA expression studies and yeast2hybrid screens. Each interaction was characterized by a confidence score obtained by the application of a logistic regression model. Analogously, the quality of experimentally protein interactions in D. melanogaster was assessed, allowing for 6, 222 proteins and 16, 914 links [17]. As for direct experimental observations of protein interactions in P. falciparum, we utilized a set of 2, 475 interactions among 1, 304 proteins that have been obtained by the modification of a yeast2hybrid method [27]. Additionally, we utilized a large-scale compilation of human interactions totaling 89, 572 interactions among 9, 018 proteins [23, 24].
Protein domain data
The advent of fully sequenced genomes of various organisms has facilitated the investigation of proteomes. The Integr8 database has been set up to provide comprehensive statistical and comparative analyzes of complete proteomes of fully sequenced organisms. The initial version of the application contained data for genomes and proteomes of 182 sequenced organisms (including 19 archae, 150 bacteria and 13 eukaryotes) and proteome analyzes derived through the integration of UniProt [31], InterPro [32], CluSTr [33], GO/GOA [34], EMSD, Genome Reviews and IPI [35]. In particular, we utilized IPI (International Protein Index) files to elucidate the domain architecture of the corresponding proteins. For our analysis, we focused on domain data retrieved from the PFAM database, a reliable collection of multiple sequence alignments of protein families and profile hidden Markov models [36].
Orthologous protein data
The InParanoid database [25] provides putative orthologous sequence information for the complete proteomes of organism pairs S. cerevisiae, D. melanogaster, H. sapiens and P. falciparum. The algorithm for detecting orthologous relationships is based on pairwise similarity scores which are by default calculated with the BLASTP program. InParanoid detects mutual best hits between sequences from two different species. These are two main orthologs that form an orthologous group. Other sequences are added to this group if they are closely related to one of the main orthologs. These members of the orthologous group are called in-paralogs. A confidence value provided by a standard bootstrap procedure for each in-paralog shows how closely related it is to the main ortholog. In our study, we only selected the main sequence pairs of each orthologous group allowing us to obtain 2, 319 yeast proteins, 1, 351 in D. melanogaster and 1, 525 in H. sapiens with putative orthologs in P. falciparum.
Co-expression data
Genes with similar expression profiles are likely encoding interacting proteins. For P. falciparum, we utilized gene expression data, compiling 5, 156 genes over 48 time points as of Winzeler et al., [19, 21] and of Bozdech et al. collecting 4, 318 genes over 48 time points [37]. As a gene similarity metric we calculated Pearson's correlation coeffcient for every protein interaction over m time points defined as
${r}_{p}=\frac{\frac{1}{m}{\displaystyle {\sum}_{i=1}^{m}{x}_{i}{y}_{i}-\u3008x\u3009\u3008y\u3009}}{{\sigma}_{i}{\sigma}_{j}}\left(5\right)$
where ⟨x⟩ and ⟨y⟩ are the sample means of expression values x_{ i }and x_{ j }, and σ_{ i }and σ_{ j }are their standard deviations.
Logistic regression
In order to get an estimate of an interactions reliability, we employed a logistic regression model. According to the logistic regression, the probability of a true interaction T_{ vw }given the two input variables, hypergeometric clustering coeffcient x_{1} = C_{ vw }and co-expression correlation coeffcient x_{2} = r_{ P }, X = (x_{1}, x_{2})
$Pr({T}_{vw}|X)=\frac{exp({\beta}_{0}+{\beta}_{1}{x}_{1}+{\beta}_{2}{x}_{2})}{1+exp({\beta}_{0}+{\beta}_{1}{x}_{1}+{\beta}_{2}{x}_{2})}\left(6\right)$
where β_{ n }are the parameters of the distribution. Given training data we optimized the distribution parameters by maximizing the likelihood of the data. Here, we applied the corresponding routines as of the Biopython package [38]. As a training set for true positives we choose 213 high scoring protein-interactions in yeast [16] that are fully conserved in Plasmodium. In the same way, we selected 173 low scoring interactions as true negative training set. Applying a leave-one-out analysis to determine the prediction accuracy, our model is recalculated from the training data after removing the interaction to be predicted (leave-one-out), allowing us to obtain the correct result in 95% of cases.
Hypergeometric clustering coeffcient
Recently, a network topology based approach uncovered a remarkable correlation between enhanced quality of protein interactions and the degree of clustering of their immediate network neighborhood [22]. Considering a network with N nodes, we define the hypergeometric clustering coeffcient as
${C}_{vw}=-\mathrm{log}{\displaystyle \sum _{i=|N(v)\cap N(w)|}^{\mathrm{min}(|N(v)|,|N(w)|)}\frac{\left(\begin{array}{c}|N(v)|\\ i\end{array}\right)\left(\begin{array}{c}N-|N(v)|\\ |N(w)|-i\end{array}\right)}{\left(\begin{array}{c}N\\ |N(w)|\end{array}\right)}\left(7\right)}$
where N(x) represents the neighborhood of a vertex x. Given fixed neighborhood sizes N(v) and N(w) of nodes v and w, the hypergeometric clustering coeffcient increases with elevated overlap between the nodes neighborhoods. Provided that the neighborhoods are independent, the summation can be interpreted as a p value, reflecting the probability of obtaining a number of mutual neighbors between nodes v and w at or above the observed number by chance.
GO annotation data and functional homogeneity
Similarly to the hypergeometric clustering coeffcient, we define the functional homogeneity of a domain pair ij
$f{h}_{ij}=-\mathrm{log}{\displaystyle \sum _{i=|GO(v)\cap GO(w)|}^{\mathrm{min}(|GO(v)|,|GO(w)|)}\frac{\left(\begin{array}{c}|GO(v)|\\ i\end{array}\right)\left(\begin{array}{c}T-|GO(v)|\\ |GO(w)|-i\end{array}\right)}{\left(\begin{array}{c}T\\ |GO(w)|\end{array}\right)}\left(8\right)}$
where GO(i) is the set of GO Terms of protein i, and T is the total number of different GO terms [18]. In analogy, the summation can be interpreted as a p value, reflecting the probability that a protein pair shares a certain number of GO terms at or above the observed number by chance.
Declarations
Acknowledgements
This project was entirely funded by the Northwestern Institute on Complexity (NICO).
Authors’ Affiliations
References
- Barabaśi A, Oltvai Z: Network Biology: Understanding the Cell's Functional Organization. Nat Rev Genet. 2004, 101-113. 10.1038/nrg1272. 5Google Scholar
- Wuchty S: Topology and Evolution in Yeast Interaction Networks. Genome Res. 2004, 14: 1310-1314. 10.1101/gr.2300204.PubMedPubMed CentralView ArticleGoogle Scholar
- Han J, Bertin N, Hao T, Goldberg DS, Berriz G, Zhang L, Dupuy D, Walhout A, Cusick M, Roth F, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004, 430: 88-93. 10.1038/nature02555.PubMedView ArticleGoogle Scholar
- Guimera R, Amaral L: Functional cartography of complex metabolic networks. Nature. 2005, 433: 895-900. 10.1038/nature03288.PubMedPubMed CentralView ArticleGoogle Scholar
- Wuchty S, Oltvai Z, Barabási AL: Evolutionary conservation of motif constituents within the yeast protein interaction network. Nature Genetics. 2003, 35: 176-179. 10.1038/ng1242.PubMedView ArticleGoogle Scholar
- Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A: The architecture of complex weighted networks. Proc Natl Acad Sci USA. 2004, 101 (11): 3747-3752. 10.1073/pnas.0400087101.PubMedPubMed CentralView ArticleGoogle Scholar
- Riley R, C Lee CS, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Biol. 2005, 6 (10): R89-10.1186/gb-2005-6-10-r89.PubMedPubMed CentralView ArticleGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E, Studholme D, Yeats C, Eddy S: The Pfam protein families database. Nucl Acids Res. 2004, 32: D138-D141. 10.1093/nar/gkh121.PubMedPubMed CentralView ArticleGoogle Scholar
- Xenarios I, Salwinski L, Duan X, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucl Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.PubMedPubMed CentralView ArticleGoogle Scholar
- Park J, Lappe M, Teichmann S: Mapping Protein Family Interactions: Intramolecular and Intermolecular Protein Family Interaction Repertoires in the PDB and Yeast. J Mol Biol. 2001, 307: 929-938. 10.1006/jmbi.2001.4526.PubMedView ArticleGoogle Scholar
- Albert R, Barabási AL: Statistical mechanics of complex networks. Rev Mod Phys. 2002, 74: 47-10.1103/RevModPhys.74.47.View ArticleGoogle Scholar
- Watts D, Strogatz S: Collective dynamics of 'small-world' networks. Nature. 1998, 393: 440-442. 10.1038/30918.PubMedView ArticleGoogle Scholar
- Newman M: Assortative mixing in networks. Phys Rev Lett. 2002, 89: 208701-10.1103/PhysRevLett.89.208701.PubMedView ArticleGoogle Scholar
- Barthelemy M, Gondran B, Guichard E: Spatial structure of the internet traffc. Physica A. 2003, 319: 633-642. 10.1016/S0378-4371(02)01382-1.View ArticleGoogle Scholar
- Aloy P, Böttcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, Russell R: Structure-Based Assembly of Protein Complexes in Yeast. Science. 2004, 303: 2026-2029. 10.1126/science.1092645.PubMedView ArticleGoogle Scholar
- Bader JS, Chaudhuri JRD, Chant J: Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol. 2004, 22: 78-85. 10.1038/nbt924.PubMedView ArticleGoogle Scholar
- Giot L, Bader J, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao Y, Ooi C, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon C, Finley R, White K, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets R, McKenna M, Chant J, Rothberg J: A Protein Interaction Map of Drosophila melanogaster. Science. 2003, 302: 1727-1736. 10.1126/science.1090289.PubMedView ArticleGoogle Scholar
- Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucl Acids Res. 2004, 32: D258-D261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Le Roch K, Zhou Y, Blair P, Grainger M, Moch J, Haynes J, De la Vega P, Holder A, Batalov S, Carucci D, Winzeler E: Discovery of Gene Function by Expression Profiling of the Malaria Parasite Life Cycle. Science. 2003, 301: 1503-1508. 10.1126/science.1087025.PubMedView ArticleGoogle Scholar
- Johnson KLJ, Florens L, Zhou Y, Santrosyan A, Grainger M, Yan S, Williamson K, Holder A, Carucci D, Yates III J, Winzeler E: Global analysis of transcript and protein levels across the Plasmodium falciparum life cycle. Genome Res. 2004, 14: 2308-2318. 10.1101/gr.2523904.PubMedPubMed CentralView ArticleGoogle Scholar
- Winzeler E: Applied systems biology and malaria. Nat Rev Microbiol. 2006, 4: 145-151. 10.1038/nrmicro1327.PubMedView ArticleGoogle Scholar
- Goldberg D, Roth F: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA. 2003, 100: 4372-4376. 10.1073/pnas.0735871100.PubMedPubMed CentralView ArticleGoogle Scholar
- Lehner B, Fraser A: A first-draft human protein-interaction map. Genome Biol. 2004, 5 (9): R63-10.1186/gb-2004-5-9-r63.PubMedPubMed CentralView ArticleGoogle Scholar
- Ramani A, Bunescu R, Mooney R, Marcotte E: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 2005, 6 (5): R40-10.1186/gb-2005-6-5-r40.PubMedPubMed CentralView ArticleGoogle Scholar
- Remm M, Storm C, Sonnhammer E: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314: 1041-1052. 10.1006/jmbi.2000.5197.PubMedView ArticleGoogle Scholar
- Ge H, Liu Z, Church G, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genetics. 2001, 29: 482-486. 10.1038/ng776.PubMedView ArticleGoogle Scholar
- LaCount D, Vignali M, Chettier R, Phansalkar A, Bell R, Hesselberth J, Schoenfeld L, I Ota SS, Kurschner C, Fields S, Hughes R: A protein interaction network of the malaria parasite Plasmodium falciparum. Nature. 2005, 438: 103-107. 10.1038/nature04104.PubMedView ArticleGoogle Scholar
- Bochtler M, Ditzel L, Groll M, Hartmann C, Huber R: The Proteasome. Annu Rev Biophys Biomol Struct. 1999, 28: 295-317. 10.1146/annurev.biophys.28.1.295.PubMedView ArticleGoogle Scholar
- Matadeen R, Patwardhan A, Gowen B, Orlova E, Pape T, Cuf M, Mueller F, Brimacombe R, van Heel M: The Escherichia coli large ribosomal subunit at 7.5A resolution. Structure Fold Des. 1999, 7: 1575-1583. 10.1016/S0969-2126(00)88348-3.PubMedView ArticleGoogle Scholar
- Mura C, Cascio D, Sawaya M, Eisenberg D: The crystal structure of a heptameric archaeal SM protein: implications for the eukaryotic snRNP core. Proc Natl Acad Sci USA. 2001, 98: 5532-5537. 10.1073/pnas.091102298.PubMedPubMed CentralView ArticleGoogle Scholar
- Apweiler R, Bairoch A, Wu C, Barker W, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin M, Natale D, O'Donovan C, Redaschi N, Yeh L: Uniprot: the universal protein knowledgebase. Nucl Acids Res. 2004, 32: D115-D119. 10.1093/nar/gkh131.PubMedPubMed CentralView ArticleGoogle Scholar
- Mulder N, Apweiler R, Attwood T, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley R, Courcelle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffths-Jones S, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Lonsdale D, Silventoinen V, Orchard S, Pagni M, Peyruc D, Ponting C, Selengut J, Servant F, Sigrist C, Vaughan R, Zdobnov E: The InterPro Database, 2003 brings increased coverage and new features. Nucl Acids Res. 2003, 31: 315-318. 10.1093/nar/gkg046.PubMedPubMed CentralView ArticleGoogle Scholar
- Kriventseva E, Fleischmann W, Zdobnov E, Apweiler R: CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucl Acids Res. 2001, 29: 33-36. 10.1093/nar/29.1.33.PubMedPubMed CentralView ArticleGoogle Scholar
- Consortium GO: The gene ontology (go) database and information resource. Nucl Acids Res. 2004, 32: D258-D261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Kersey P, Duarte J, Williams A, Apweiler R, Karavidopoulou Y, Birney E: The international protein index: An integrated database for proteomics experiments. Proteomics. 2004, 4 (7): 1985-1988. 10.1002/pmic.200300721.PubMedView ArticleGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E, Studholme D, Yeats C, Eddy S: The Pfam protein families database. Nucl Acids Res. 2004, 32: D138-D141. 10.1093/nar/gkh121.PubMedPubMed CentralView ArticleGoogle Scholar
- Bozdech Z, Llinas M, Pulliam B, Wong E, Zhu J, DeRisi J: The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum. PLoS Biol. 2003, 1: 1-16. 10.1371/journal.pbio.0000005.View ArticleGoogle Scholar
- The Biopython package. [http://www.biopython.org]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.