Effect of dataset selection on the topological interpretation of protein interaction networks
© Hakes et al; licensee BioMed Central Ltd. 2005
Received: 12 July 2005
Accepted: 20 September 2005
Published: 20 September 2005
Studies of the yeast protein interaction network have revealed distinct correlations between the connectivity of individual proteins within the network and the average connectivity of their neighbours. Although a number of biological mechanisms have been proposed to account for these findings, the significance and influence of the specific datasets included in these studies has not been appreciated adequately.
We show how the use of different interaction data sets, such as those resulting from high-throughput or small-scale studies, and different modelling methodologies for the derivation pair-wise protein interactions, can dramatically change the topology of these networks. Furthermore, we show that some of the previously reported features identified in these networks may simply be the result of experimental or methodological errors and biases.
When performing network-based studies, it is essential to define what is meant by the term "interaction" and this must be taken into account when interpreting the topologies of the networks generated. Consideration must be given to the type of data included and appropriate controls that take into account the idiosyncrasies of the data must be selected
In recent years, there has been an unprecedented growth in both the volume and the type of experimental data available to researchers interested in elucidating the biological networks that underpin the functions of living cells. To date, the majority of available eukaryotic data comes from the yeast Saccharomyces cerevisiae, where a variety of different networks have been subject to investigation, including gene regulatory , metabolic [2–4] and protein interaction networks . As the majority of cellular processes are mediated by protein-protein interactions, much attention has been focused on their study in the hope that their investigation on a "global" scale will help us to understand how a dynamically interconnected system manages to perform multiple functionally related tasks while maintaining stability against deleterious perturbations.
The recent deluge of protein interaction data generated from large-scale high-throughput systematic screens [6–9] has presented us with an opportunity to create networks consisting of thousands of interacting proteins. Analysis of the resulting networks has shown that, in common with other naturally occurring and artificial networks, protein-interaction networks display a scale-free topology [10, 11] and exhibit "small-world" properties . The scale-free property of these networks is thought to be of particular biological significance as it confers robustness to random node loss, allowing the network to maintain its overall integrity even when a significant number of nodes are removed . The concept of network-mediated robustness appears to be reinforced by the presence of a correlation between the connectivity of neighbouring nodes within the network (a feature not observed in random networks) . In the yeast protein-interaction network, the observed negative correlation between the connectivity of a protein and the average connectivity of its binding partners has been seen as a possible adaptation which allows the network to be resilient to the propagation of deleterious perturbations . Recently, Pereira-Leal and co-workers showed that this correlation is valid only for the yeast protein-interaction network as a whole, and that the network formed by the proteins essential for yeast growth has its own unique topological properties, including a very high degree of connectivity (97% of the proteins form a single distinct sub-network), which they postulate may have some implications for our understanding of the network's evolution .
Protein interaction networks are generally described using a graph theoretical approach, in which proteins within the graph (nodes) are connected by undirected links (edges) if they are found to interact. While creating a representation of the network is relatively straight forward, deciding what should be represented is often more difficult. Typically, networks are generated using interactions derived from a plurality of different experimental types, which may include protein interactions identified in both individual small-scale studies and larger systematic genome-scale screens – such as those from yeast two-hybrid (Y2H) and affinity-purification experiments. More often than not, less thought than appropriate is given to how the interactions derived from these different systems have been, or should be, combined and the possible implications that different methodologies for achieving this might have on the outcome of analyses.
The issue of data handling is of particular importance in the study of protein interactions derived from purified protein complexes. For any given purified complex that results from a FLAG or TAP tag-based experiment, it is very unlikely that every "prey" protein identified within the complex interacts directly with the "bait" protein. Other proteins or molecules (such as RNA) present within the mixture may act as scaffolds or bridges between the protein constituents. Consequently, we are unable to determine the true topology of the complex. In order to integrate this type of data with those from other experimental sources, we must first derive a set of hypothetical pair-wise protein interactions using either a "spoke", or "matrix" model . The spoke model assumes that the bait protein physically interacts with each of the prey proteins in the complex but does not acknowledge any type of association between the preys. In contrast, the matrix model assumes that any two proteins within the "complex" are connected.
Here, we investigate the effect that the choice of datasets, and modelling methodology (matrix or spoke), has on the topological properties of the yeast protein interaction network and discuss our results with respect to the notion of a negative correlation between nodes within the network (in some studies, this is referred to as an "anticorrelation"). We go on to investigate the notion of a highly connected essential sub-network and, finally, we discuss the nature of the term "interaction" and how the interpretation of that term might affect research within the field.
Variation in modelling methodology
A second prominent finding of earlier work investigating the yeast protein interaction network is that the essential sub-network is very highly connected, with ≈ 97% of all proteins within it being connected in a single giant component . The significance of this result was previously highlighted using a standard randomisation strategy, in which a number of nodes equivalent to that in the essential network were randomly selected from the global network and the connectivity of the resultant sub-network determined. To assess the validity of this finding, a "biased" randomisation strategy was employed that took into account the connectivity of the proteins within the essential sub-network. By mimicking the degree distribution of nodes within the essential network in the generated random networks, the average number of nodes encompassed within the largest connected component increased from 33% (using a standard randomisation strategy) to 88% over 1000 iterations. Although connectivity levels equal to that of the essential sub-network were not observed, levels as high as 92% connectivity were achieved.
In this study, we have shown how the choice of dataset and modelling methodology can profoundly affect the outcome of investigations into the topology of the yeast protein interaction network. We show that, while these variables have little effect on the apparent power-law degree distribution of nodes within the network, they can dramatically alter the correlation between the connectivities of neighbouring nodes. These results raise the question of what data should be included in these studies and, in the case of protein complex data, which of the two proposed modelling methods is the most appropriate for its incorporation? In a recent study, Bader and Hogue  showed that pairs identified using the spoke model were more likely to be correct (i.e. in agreement with published literature) than interactions derived using the matrix model. However, Cornell and co-workers  showed that there is little difference between the two modelling methods when the annotations of protein pairs found using each model were compared. This indicates that pairs derived using the matrix model are equally as meaningful (in terms of their functional annotation) as those derived using the spoke model, suggesting that either method provides a valid approach to modelling interactions. In fact, if we wish to include the "classical" hand-annotated MIPS complexes within our analyses, the matrix model becomes our only viable option, as it is the only method that allows us to define a set of pair-wise interactions for a protein complex whose topology is completely unknown.
In addition to the observations made about the correlations between neighbouring nodes, we have also shown the importance of using the correct control when selecting nodes for randomization studies involving network connectivity. We found that, by simply matching the degree distribution of the nodes within the essential network in that of the randomly selected sample (composed entirely of non-essential genes), we were able to achieve very similar levels of network connectivity. This result suggests that the highly connected nature of the sub-network of essential proteins previously reported by Pereira-Leal and co-workers is primarily a consequence of the high-degree bias of its nodes, rather than a manifestation of some specific evolutionary process.
We conclude that, before embarking on these network-based analyses, we must first be clear as to what we mean when we use the term "interaction". Interactions derived from direct physical studies, such as Y2H experiments, are very different from those found in synthetic genetic screens, which (in turn) are different again from "associations" between the proteins found within protein complexes. However, in several recent studies, many of these different interaction types have been lumped together as though they were equivalent and directly comparable. For instance, both Y2H data and synthetic lethal gene pairs count as 'interactions' in the GRID database , although the protein products of the latter rarely interact physically .
While graph theoretical analysis approaches have been successfully applied to a number of man-made and naturally occurring networks , these networks differ from the biological systems investigated in that every link between pairs of nodes within the network is of the same type and is generally independent of other factors. For example, analysis of the HTML pages that make up the content of the World Wide Web is relatively simple. In this network, both the nature of the relationships (hyperlinks) between nodes (pages) and the nodes themselves are usually homogeneous and well-defined. Therefore, meaningful and representative visualizations and quantifications of the structure of the network and its properties are possible. However, the "biological networks" we construct are not representative of the underlying system. Biological systems essentially comprise protein "machines"  and biological function is mediated through associations between proteins, either directly through physical contact, or indirectly within protein complexes, or as part of the same biological pathway. Although it is technically possible to create an abstract representation of these associations; in reality, heterogeneity and the spatial and temporal restrictions imposed upon the links mean that the resulting topology and parameters of the network need not convey biologically meaningful information.
Network analysis was performed by extracting all machine-readable, yeast-derived protein interactions from the DIP database (release 20050605). Node connectivities and network topology were investigated using custom software written in the Perl programming language. Random networks were generated from a pool of non-essential proteins only. Construction of the random network continued until the appropriate number of proteins had been selected, whose degree distribution within the sample was similar to that actually observed in the essential network. This was done using an algorithm that created a sample of nodes that, at each level of connectivity, matched as closely as possible (data-permitting) the observed node numbers in the essential network. In instances where an exact match was not possible, another node with a degree within the same range of the desired node was selected. The essential sub-network is defined by taking into consideration only interactions between essential genes, as defined by the Saccharomyces Gene Deletion Project . Correlations between variables were determined by computing the Pearson's correlation coefficient, r. We also report the slope, α, of a linear fit to the data.
We thank Dr. Simon Lovell and for insights and discussions. LH is supported by a CASE Studentship from the Biotechnology & Biological Sciences Research Council (BBSRC) and AstraZeneca. Work on protein interactions is supported by grants from the BBSRC to SGO and DR, and from the Beacon initiative of the UK Department of Trade & Industry to SGO.
- Amoutzias GD, Robertson DL, Oliver SG, Bornberg-Bauer E: Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep. 2004, 5 (3): 274-279. 10.1038/sj.embor.7400096.PubMedPubMed CentralView ArticleGoogle Scholar
- Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature. 2000, 407 (6804): 651-654. 10.1038/35036627.PubMedView ArticleGoogle Scholar
- Ma H, Zeng AP: Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics. 2003, 19 (2): 270-277. 10.1093/bioinformatics/19.2.270.PubMedView ArticleGoogle Scholar
- Wagner A, Fell DA: The small world inside large metabolic networks. Proc R Soc Lond B Biol Sci. 2001, 268 (1478): 1803-1810. 10.1098/rspb.2001.1711.View ArticleGoogle Scholar
- Wuchty S: Evolution and topology in the yeast protein interaction network. Genome Res. 2004, 14 (7): 1310-1314. 10.1101/gr.2300204.PubMedPubMed CentralView ArticleGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415 (6868): 141-147. 10.1038/415141a.PubMedView ArticleGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415 (6868): 180-183. 10.1038/415180a.PubMedView ArticleGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A. 2001, 98 (8): 4569-4574. 10.1073/pnas.061034498.PubMedPubMed CentralView ArticleGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403 (6770): 623-627. 10.1038/35001009.PubMedView ArticleGoogle Scholar
- Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411 (6833): 41-42. 10.1038/35075138.PubMedView ArticleGoogle Scholar
- Wagner A: The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol. 2001, 18 (7): 1283-1292.PubMedView ArticleGoogle Scholar
- Wuchty S: Interaction and domain networks of yeast. Proteomics. 2002, 2 (12): 1715-1723. 10.1002/1615-9861(200212)2:12<1715::AID-PROT1715>3.0.CO;2-O.PubMedView ArticleGoogle Scholar
- Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature. 2000, 406 (6794): 378-382. 10.1038/35019019.PubMedView ArticleGoogle Scholar
- Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science. 2002, 296 (5569): 910-913. 10.1126/science.1065103.PubMedView ArticleGoogle Scholar
- Pereira-Leal JB, Audit B, Peregrin-Alvarez JM, Ouzounis CA: An exponential core in the heart of the yeast protein interaction network. Mol Biol Evol. 2005, 22 (3): 421-425. 10.1093/molbev/msi024.PubMedView ArticleGoogle Scholar
- Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol. 2002, 20 (10): 991-997. 10.1038/nbt1002-991.PubMedView ArticleGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004, 32 (Database issue): D449-51. 10.1093/nar/gkh086.PubMedPubMed CentralView ArticleGoogle Scholar
- Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics. 2002, 1 (5): 349-356. 10.1074/mcp.M100037-MCP200.PubMedView ArticleGoogle Scholar
- Cornell M, Paton NW, Oliver SG: A critical and integrated view of the yeast interactome. Comparative and Functional Genomics. 2004, 382-402. 10.1002/cfg.412.Google Scholar
- Vidalain PO, Boxem M, Ge H, Li S, Vidal M: Increasing specificity in high-throughput yeast two-hybrid experiments. Methods. 2004, 32 (4): 363-370. 10.1016/j.ymeth.2003.10.001.PubMedView ArticleGoogle Scholar
- Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C: Global mapping of the yeast genetic interaction network. Science. 2004, 303 (5659): 808-813. 10.1126/science.1091317.PubMedView ArticleGoogle Scholar
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286 (5439): 509-512. 10.1126/science.286.5439.509.PubMedView ArticleGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A. 2003, 100 (21): 12123-12128. 10.1073/pnas.2032324100.PubMedPubMed CentralView ArticleGoogle Scholar
- Saccharomyces Gene Deletion Project. [http://www-sequence.stanford.edu/group/yeast_deletion_project/Essential_ORFs.txt]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.