Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression
© De Bodt et al; licensee BioMed Central Ltd. 2009
Received: 11 February 2009
Accepted: 29 June 2009
Published: 29 June 2009
Large-scale identification of the interrelationships between different components of the cell, such as the interactions between proteins, has recently gained great interest. However, unraveling large-scale protein-protein interaction maps is laborious and expensive. Moreover, assessing the reliability of the interactions can be cumbersome.
In this study, we have developed a computational method that exploits the existing knowledge on protein-protein interactions in diverse species through orthologous relations on the one hand, and functional association data on the other hand to predict and filter protein-protein interactions in Arabidopsis thaliana. A highly reliable set of protein-protein interactions is predicted through this integrative approach making use of existing protein-protein interaction data from yeast, human, C. elegans and D. melanogaster. Localization, biological process, and co-expression data are used as powerful indicators for protein-protein interactions. The functional repertoire of the identified interactome reveals interactions between proteins functioning in well-conserved as well as plant-specific biological processes. We observe that although common mechanisms (e.g. actin polymerization) and components (e.g. ARPs, actin-related proteins) exist between different lineages, they are active in specific processes such as growth, cancer metastasis and trichome development in yeast, human and Arabidopsis, respectively.
We conclude that the integration of orthology with functional association data is adequate to predict protein-protein interactions. Through this approach, a high number of novel protein-protein interactions with diverse biological roles is discovered. Overall, we have predicted a reliable set of protein-protein interactions suitable for further computational as well as experimental analyses.
The complex regulation of diverse biological processes acting in eukaryotic organisms is only possible through interactions between different components in the cell. Proteins, for instance, can be part of extensive complexes, such as transcription factor complexes for the combinatorial control of their target genes or proteasome complexes for protein degradation. It is the specific timing and location of the activity of these protein complexes that defines their role in the cell. Several attempts have been made to infer protein-protein interaction maps in diverse model organisms through large-scale experimental methods [1, 2]. In yeast [3–8], human [9, 10], Drosophila  and C. elegans , genome-wide Y2H screens and large-scale affinity purification/mass spectrometry studies have been performed. Nevertheless, the reliability of the results of these studies is thought to be relatively poor because in general quite a small overlap between datasets of experimentally identified interactions is observed. However, there is growing evidence that this observation is due to the complementarity of different methods (e.g. sensitivity of Y2H versus TAP, or different experimental conditions) rather than to a high number of false interactions . Yu et al. (2008) conclude that both Y2H and affinity-purification followed by mass spectrometry (AP/MS) data are of equally high quality but of a fundamentally different and complementary nature. These authors show that, compared to interaction maps based on complex purification and identification, the binary interaction map of yeast proteins is enriched for transient signaling interactions and inter-complex connections . In any case, assessment of the data quality is necessary, not only to design future experiments but also to construct high confidence datasets (or gold standard datasets) used for the training and evaluation of computational methods [13–16]. Several efforts have been made to centralize protein-protein interaction data through the construction of databases such as DIP , MINT , BioGRID  and IntAct .
To make full use of the currently available interaction data, computational methods are being developed to assess the quality of experimentally generated protein-protein interactions and to predict new interactions [2, 21, 22]. Whereas earlier analyses focused on the relation between gene expression and protein-protein interaction only [12, 23–26], the integration of several lines of evidence (further referred to as genomic features) in the prediction or validation of protein-protein interactions is highly valued in recent studies, as it increases the performance as well as the coverage of the method [15, 27–29]. Typically used genomic features encompass (1) functional features such as Gene Ontology (GO) annotation of the proteins, co-expression of the encoding genes, coordinated protein abundance and co-essentiality, (2) structural features such as co-occurrence of protein domains and overrepresented sequence motifs, (3) comparative genomics-based features such as orthology and, primarily exploited in prokaryotes, phylogenetic profiles, gene neighborhood, co-evolution, and Rosetta Stone (gene split or fusion), and (4) network topology-based features, such as connectivity [30, 31, 21, 27]. In principle, two approaches can be discerned in the prediction of protein-protein interactions. The first approach starts from protein pairs that are identified to be orthologous to known interacting proteins in other species (interolog detection). The interolog detection strategy was initially developed to transfer information on protein-protein interactions from yeast to higher organisms [32–34]. This method assumes that protein-protein interactions are conserved between organisms and that pairs of proteins whose orthologs are known to interact in other species probably interact in the species of interest as well. Although some shortcomings can be identified, protein complexes do show evolutionary conservation [35, 36]. Numerous studies have been published in which, mainly human protein-protein interactions are predicted based on interolog detection [37, 38]. Furthermore, predictions are made through integrative approaches in a probabilistic framework [39–41]. Other studies start from all possible protein pairs, often incorporating interolog detection as a genomic feature or do not include interolog detection at all [27, 42]. The latter approach has the advantage that interactions do not need to be conserved over long evolutionary distances, but often identify associations between genes rather than protein-protein interactions [28, 29, 43, 44].
For the model plant Arabidopsis thaliana, attempts to construct large-scale protein-protein interaction maps as well as the application and critical assessment of computational methods have been rather limited [45, 46]. In this study, we aim to predict a reliable set of protein-protein interactions suitable for experimental validation as well as further computational analyses. First of all, we investigate whether the necessary assumptions taken in our approach are valid in the model plant Arabidopsis thaliana: namely, (1) (some) protein-protein interactions in yeast and animals (source organisms) are conserved in Arabidopsis thaliana (target organism), (2) interacting proteins co-localize, (3) interacting proteins function in the same biological process, and (4) genes encoding interacting proteins show similar expression patterns. Hereby, the relative contribution of these features to the prediction of protein-protein interactions in Arabidopsis thaliana is assessed. The prediction of Arabidopsis protein-protein interactions is performed, exploiting the conservation of these interactions between species on the one hand, and utilizing functional association data on the other hand. Finally, protein complexes are delineated from the predicted protein-protein interaction network and the function and evolutionary conservation of these protein complexes is studied.
Integration of orthology, GO annotation and gene expression
Experimentally identified proteins and protein-protein interactions
Number of proteins
Number of proteins with OG
Number of proteins with a BP GO annotation (depth = 5)
Number of proteins with a CC GO annotation (depth = 5)
Number of proteins with gene expression data
Number of interactions
Number of interactions with OG
Number of interactions between proteins with sufficient information on different genomic features
Number of interactions with a BP GO annotation (depth = 5, depth = 8)
Number of interactions with BP score > = 5, > = 8
Number of interactions with a CC GO annotation (depth = 5)
Number of interactions with CC score > = 5
Number of interactions with gene expression data
Number of interactions with PCC> = 0.3
Number of interactions with a BP GO annotation (depth = 5) and expression data
Number of interactions with BP score > = 5 and PCC > = 0.3)
Total number of filtered interactions with CC score > = 5 or BP score > = 5 and PCC > = 0.3 or BP score > = 8)
To decide if two proteins co-localize and/or function in the same biological process, both the GO cellular component (CC) and the GO biological process (BP) annotation of the interacting proteins were evaluated. To measure the similarity between the GO annotations of two proteins, we calculated the maximum depth of the common ancestor of all pairs of GO terms assigned to both proteins (see Methods; Additional file 2; Fig. 2). Similarly, the co-expression in positive and negative datasets was investigated by calculating the Pearson correlation coefficient, describing the global similarity between expression profiles for interacting and non-interacting proteins (see Fig. 2). When inspecting the distributions, a smaller difference between positive and negative datasets can be observed for expression correlation than for GO annotations. Therefore, it is impossible to define a threshold in Pearson correlation coefficient that identifies a considerable number of true positives while having few false positives (Fig. 2). Different genomic features need to be combined to maximize the coverage of the prediction (see Additional file 3). Through this combination of genomic features, thresholds may be decreased while maintaining an acceptable number of false positives (see Additional file 1 for an estimation of the false positive rates). The use of a threshold of 5 for the GO biological process similarity score in combination with a threshold of 0.3 for the Pearson correlation coefficient is acceptable (see Additional file 3; Additional file 1). Through this combination a lower threshold can be chosen for both genomic features. Nevertheless, one should keep in mind that the co-expression measure takes into account expression data from 86 microarray experiments. It can be expected that the number and type of experiments influences the overall Pearson correlation coefficients for all pairs of genes. Networks of experimentally identified protein-protein interactions in yeast and human, although containing false positive interactions as well, show an average Pearson correlation coefficient of 0.3 and 0.241 respectively, supporting our choice of the Pearson correlation coefficient threshold . In addition, the GO biological process similarity score (threshold score of 8) and GO cellular component similarity score (threshold score of 5) were also used independently. These relatively high thresholds result in inferences based on very detailed GO annotations, mostly not inferred based on computational analyses (ISS)). The depth of GO terms such as ubiquitin-dependent protein catabolic process (BP ≥ 8) or peroxisome (CC ≥ 5) is high, pointing to specific processes or localizations that are mostly assigned based on experimental evidence (e.g. inferred from direct assay). Systematic assessment of the evidence codes of the GO terms used for the interaction filtering (BP8 and CC5) shows that a minority of terms are inferred from sequence or structural similarity (ISS) (see Additional file 4). GO terms used in the BP5+PCC0.3 filter are more often inferred from sequence or structural similarity. However, as in this case the BP score is combined with the PCC, we believe that these ISS terms do not affect the quality of our predictions considerably. Therefore, we decided to include ISS-based GO annotations throughout our analysis and only removed IPI, ND, NR, NAS and IEA-based annotations (see methods). Finally, we added up the different sets of predicted protein-protein interactions resulting from the application of different thresholds and different source organisms. We opted to add up the different sets rather than to take the intersection. On the one hand, our randomization studies show that the individual filters result in acceptable numbers of false positives. On the other hand, we encompass the relative high amount of missing functional data in cases where, for instance, BP annotation but no CC annotation is available. However, using this approach it is possible that a protein-protein interaction passed the BP8 filter, while it does not pass the CC5 filter (contradictory CC information – annotation for BP and CC available). Our systematic assessment shows that this possibility occurs for only a minority of interactions for BP8, while more often for BP5 which is used in combination with PCC0.03 (see Additional file 5). Moreover, only half of these contradictory CC annotations is inferred from experimental evidence (see Additional file 5).
In summary, 52.6% of the experimentally identified protein-protein interactions (767 out of the 1457 interactions) meet the conditions of the genomic features (GO biological process, GO cellular component, co-expression) (see Table 1 and Figs. 2 and Additional file 3), without taking into account orthology (see next paragraph).
Prediction of protein-protein interactions in Arabidopsis thaliana
Predicted protein-protein interactions
Yeast BP5+PCC 0.3
Yeast total filtered
Human BP5+PCC 0.3
Human total filtered
C. elegans CC5
C. elegans BP5+PCC 0.3
C. elegans BP8
C. elegans total filtered
Drosophila total filtered
All species – Interologs with CC annotation (CC > = 5)
All species – Interologs with BP annotation (BP > = 5, BP > = 8)
3087 (2111, 416)
79475 (25037, 1890)
All species – Interologs with gene expression data (PCC > = 0.3)
All species – Filtered
All species – Predicted
All species – interologs (without filtering)
Accessibility of the interactome
We have developed an easy-to-use query and visualization system to represent the inferred interactome. The discussed protein clusters as well as the complete predicted interactome can be observed through a web-start version of Cytoscape that can be found at http://bioinformatics.psb.ugent.be/supplementary_data/stbod/athPPI/. A node and edge attribute system is employed to represent the different types of information. The color of the edge represents the degree of co-expression calculated as the Pearson correlation coefficient (green: correlation, purple: anticorrelation), while the line width of the edge represents the GO biological process similarity score (thick: similar biological process) and the line style of the edge represents the GO cellular component similarity score (solid: similar localization). The color of the nodes corresponds to the protein cluster the protein belongs to (see further). Proteins belonging to small or no clusters are colored in grey. TAIR functional descriptions are shown as node labels. Subsets of the interactome that are of interest to the researcher can be visualized easily by querying the interactome for (a) protein(s) or a functional description.
Delineation of protein clusters in the predicted interactome
In an attempt to reveal the functional repertoire of the predicted interactome, we have delineated highly interconnected regions in the protein-protein interaction network (hereafter called protein clusters). In addition, we tried to assign a function to the identified protein clusters. Finally, we investigated the evolutionary conservation of the protein clusters.
Identification of protein clusters is performed using the CAST clustering algorithm (see Methods). This clustering procedure employs the connectivity of the proteins. Overall, 1802 proteins taking part in 16,498 interactions could be identified in protein clusters, accounting for the majority of originally identified interactions (see Additional file 9). The biological roles of the identified protein clusters were studied through identification of overrepresented GO categories (biological process, molecular function and cellular component) (see Methods). To judge the validity of the protein clusters, we inspected clusters involved in particular biological processes together. Using this approach of clustering and subsequent GO enrichment analysis, we can elegantly pinpoint protein complexes, the relationships between them and the encompassing biological processes they are involved in (see Supplementary Data site and more details below).
More detailed analysis of the 'response to' interaction network identified many proteins functioning in, for instance, response to DNA damage or response to oxidative stress caused by the accumulation of reactive oxygen species (see Supplementary data site). Proteins active in these stress responses are DNAJ heat shock proteins, calcium-dependent and MAP kinases (CAST cluster 1), DNA repair proteins (RAD1, RAD5, RAD50, RAD51, RAD54 – CAST 18), superoxide dismutases (CAST 49), glutathione peroxidases (CAST 74) and glutathione S-transferases (CAST 15). These proteins are involved in very diverse biological processes, such as response to heat, response to toxins and response to chemical stimulus, which is reflected in the dashed edges in the network (see Supplementary data). In order to verify that these predicted interactions also reflect actual conserved biological stress responses, we compared the expression patterns of all 500 Arabidopsis genes in these 'response to' protein clusters using a recently compiled comparative stress response matrix . Whereas 11% of all stress-induced Arabidopsis gene families shows a conserved stress response in human or yeast (44/390 families in the complete matrix), the 'response to' set (representing 137 gene families) shows a more than five-fold enrichment for conserved stress response (16/25 gene families with responsive Arabidopsis genes are also responsive in human or yeast). These findings confirm that several components as well as protein-protein interactions in Arabidopsis indeed function in diverse responses and that this response is evolutionary conserved in eukaryotes.
Besides interactions functioning in well-conserved biological processes, we could also identify the recruitment of well-conserved proteins and protein-protein interactions in plant-specific processes. In the following examples, the regulatory mechanisms such as chromatin remodeling and actin polymerization are common to different lineages. However, these mechanisms are put into play in highly lineage-specific biological processes, such as seed and trichome development.
In a second example, we have predicted interactions between different actin-related proteins (ARPs), constituting the ARP2/3 complex involved in leaf development (see Fig. 5B and Supplementary data). ARP2 or WURM, ARP3 or DISTORTED TRICHOMES 1, ARPC1, ARPC1A, ARPC2A or DISTORTED TRICHOMES 2, ARPC3 and ARPC5 or CROOKED, as well as a number of other non-actin-related proteins, were predicted to interact. The Arp2/3 complex has been shown to be composed of two actin-related proteins (Arps), a seven-bladed beta propeller (ARPC1), and four other subunits (ARPC2-5). Only ARPC4 is missing from our predictions, which is due to the fact that ARPC4 is not present in the input dataset of orthologous relationships. In addition to the ARP2/3 complex, we could identify a complex of three proteins (KLUNKER, GNARLED and ARAC4), involved in leaf development as well. At least two of these proteins of the SCAR complex are thought to regulate the ARP2/3 complex. The ARP2/3 complex functions as a central player in the precise regulation of both the initiation of actin polymerization and the organization of the resulting filaments, common to all organisms used in this study. However, disruption of the ARP2/3 complex has distinct effects in different organisms, going from growth defects in yeast, defects in eye and axon development in worm and fly, migration of cancer cells in human, and epidermal cell shape determination in Arabidopsis (Goley and Welch 2006). For the ARP2/3 protein complex, the power of the co-expression measure was limited. Nevertheless, similarity in localization and biological process showed to be very valuable. This observation points to the complementarity of the different genomic features employed in this study.
In this study, we have predicted protein-protein interactions in the model plant Arabidopsis thaliana through interolog detection with yeast, worm, fly and human as source organisms and using genomic features (expression correlation, localization and biological process) as filters to increase the confidence in our predictions. As such, a set of highly reliable interactions could be delineated that can be further employed in both computational and experimental studies. In contrast to previous efforts to predict protein-protein interactions in Arabidopsis thaliana, we do not only provide a list of all interologs and their associated genomic feature values, but rather focus on the subset of protein-protein interactions that is supported by the different genomic features (or the so-called filtered interactome). The extensive study of the behavior of the genomic features for experimentally identified protein-protein interactions compared to random protein pairs allowed us to validate the protein-protein interactions adequately.
The setup of this study allowed us to rigorously investigate the functional repertoire and evolutionary conservation of the identified protein-protein interactions. We conclude that although this type of protein-protein interaction prediction is highly dependent on the degree of conservation of protein-protein interactions between Arabidopsis and yeasts or animals, we were able to predict interactions with roles in diverse biological processes. Interestingly, these cover both interactions in evolutionary conserved and plant-specific processes.
Future improvements to the prediction of protein-protein interactions are manifold. For instance, it has been shown that not all protein-protein interactions are equally stable. Some interactions are permanent while others are transient, meaning that proteins only come together at certain time points or locations possibly resulting in different expression profiles of the encoding genes (just-in-time assembly) . Although we could identify at least some transient interactions (e.g. interactions with transmembrane activity), the use of global expression correlation measures such as the Pearson correlation coefficient might be replaced by a measure of "partial" expression similarity that is able to capture even less stable protein-protein interactions. Recent proteomics profiling (e.g. ) will allow the consideration of protein activity rather than transcript activity. On the other hand, with the large-scale experimental identification of protein-protein interactions in many species, a gold standard positive set of interactions can be built more rigorously. This gold standard positive set will increase the strength of machine learning methods in protein-protein interaction detection. Nevertheless, a gold standard negative set of interactions remains problematic.
In conclusion, this study showed that the integration of orthology with functional association data, such as localization, biological process and co-expression, is adequate to predict protein-protein interactions. In particular, for organisms with limited existing knowledge on protein-protein interactions, such as Arabidopsis, our approach is very valuable. On the contrary, sophisticated machine learning approaches perform poorly because of the lack of gold standard sets of interactions. We could predict a high number of new protein-protein interactions, and analysis of the functional repertoire of identified protein clusters supports the significance of these putative interactions. The approach described here can easily be adapted for estimating the reliability of experimentally identified interactions. Finally, with the growing availability of expression and gene ontology information, this approach can be applied to the detection of protein-protein interactions in agronomically and economically interesting plants, such as rice, corn and poplar.
Although numerous efforts have been made to obtain uniformity in interaction databases, interaction datasets for yeast, animals and Arabidopsis are not readily available. Protein interaction data sets were compiled from DIP , BioGRID , MINT  and IntAct  for the source organisms Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens and the target organism Arabidopsis thaliana (for Arabidopsis, see Table 1), containing most of the large-scale interaction studies. Binary interaction data was extracted. For each interaction, the method of detection, PMID number and the bait/prey information were downloaded. As such, identical entries in the different databases were identified. In addition to the combined interaction datasets from DIP, BioGRID, MINT and IntAct, two manually curated datasets of literature-derived interactions were employed. For yeast, a set of 15,456 interactions involving 4554 proteins available at MIPS (Munich Information Center for Protein Sequences) was used. Although this curated set is half as large as the interaction dataset downloaded from the four databases, a considerable number of proteins is covered and the quality of the data is believed to be considerably higher . For human, a set of 37,072 interactions involving 9565 proteins is provided by the HPRD (Human Protein Reference Database) . For Arabidopsis, the experimentally identified protein-protein interactions available at TAIR http://www.arabidopsis.org were included. Altogether, 89,537 interactions among 6515 proteins could be found in public databases of Saccharomyces cerevisiae, 8167 interactions among 4126 proteins of Caenorhabditis elegans, 56,088 interactions among 14,112 proteins of Drosophila melanogaster, 60,775 interactions among 15,126 proteins of Homo sapiens, and 3587 interactions among 1722 proteins of Arabidopsis thaliana (see Table 1).
Negative datasets were built by randomizing protein pairs. To analyze equal sample sizes and to take into account the availability of genomic feature data, the negative as well as the positive datasets contained 1000 protein pairs for the assessment of individual genomic features and 500 protein pairs for the combined genomic features. This approach has the disadvantage that some positive pairs may be included in the dataset. However, the number of positive pairs will be extremely low taking into account the number of possible combinations between all Arabidopsis proteins and the estimated size of interactomes of higher organisms. An alternative approach would be to consider pairs that consist of proteins that are not present in the same cellular compartment. However, this method can be biased as not all proteins that localize in the same cellular compartment interact. Moreover, in this study, we use the cellular component annotation to identify positive interactions. Consequently, we opted for the randomization approach throughout our study.
Positive and negative datasets were compared to choose appropriate thresholds for the genomic features (similarity in expression, biological process and cellular localization, see further). We have compared balanced datasets (equal number of positive and negative interactions) to estimate the reliability of our genomic feature filtering. The thresholds chosen in this study would correspond to a positive predictive value (number of true positives/(number of true positives + number of false positives)) of 95% in a one to one ratio. However, in reality, although difficult to estimate, the positive and negative protein pairs do not occur in a one to one ratio. Therefore, the positive predictive value is actually smaller. In a one to ten or one to 100 ratio, the PPV drops to ~50% or ~30%, respectively. However, these positive predicted values are probably not robust and should be considered with caution due to the small sample size (sparse distributions of genomic features of the positive dataset compared to the negative dataset). The number of positive interactions for which the genomic features pass the thresholds is extremely low, probably due to the fact that so few protein-protein interactions have been experimentally identified and/or possess sufficient gene ontology information. Even more importantly, the calculation of these positive predictive values does not take into account the fact that, through the interolog detection, our initial predictions (before application of genomic feature thresholds) are already enriched for true interactions. Through this step, the PPV increases to 88% and only the most likely interactions remain.
Identification of interologs
Whereas earlier methods used BLAST to identify orthologous genes, nowadays more dedicated tools such as OrthoMCL and INPARANOID, which take into account in-paralogs (duplicates arisen after speciation), are applied [59–61]. In this study, orthologous relationships were identified based on the OrthoMCL database containing orthologous groups for 87 species . Like INPARANOID , the OrthoMCL software takes into account in-paralogs (genes duplicated in one species after speciation with another species) . We extracted data from five organisms, namely the source organisms Saccharomyces cerevisiae, Caenorabditis elegans, Drosophila melanogaster and Homo sapiens, and the target organism Arabidopsis thaliana. For a certain pair of interacting proteins in the source organism, all combinations between the (co-)orthologous proteins in the target organism were made (see Fig. 1). From all combinations only reliable interactions are identified using the genomic feature filters described below. As the genomic feature filters can not be used to assess the reliability of self-interactions (homodimers), these interactions are excluded from the filtered interactome.
Gene ontology information
The Gene Ontology (GO) consortium provides a structured standard vocabulary for describing the function of gene products . It is divided into three ontologies: biological process, molecular function and cellular component, represented by directed acyclic graphs in which nodes correspond to GO terms and edges to their relationships. For each protein, GO terms (GO cellular component and biological process annotation) were extracted from the Gene Ontology database  and annotations for Arabidopsis proteins were downloaded from TAIR . These GO terms were used to assess the relatedness of interacting proteins by calculating a GO similarity score (see Additional file 2). We test if interacting proteins are localized in the same cellular compartment and if interacting proteins function in the same biological process. For each protein pair, all GO terms of both proteins are compared to each other. For each pair of GO terms, the depth of the common ancestor of the terms, which is the shortest path of the common ancestor to the root (GO:0003673), is calculated. Subsequently, the maximum value of the calculated depths is taken as the GO similarity score for a certain protein pair. Although this disregards how far away the GO terms are from their common ancestor, this approach has proven valuable for the aims put forward in this study, namely distinguishing between actual protein-protein interactions and random protein-protein pairs (see Results).
For both cellular component and biological process annotation, GO term assignments based on physical interactions (IPI, see http://www.geneontology.org/GO.evidence.shtml for details on evidence codes) or electronically assigned and less reliably assigned GO terms (with evidence codes ND, NR, NAS and IEA) are removed. Although the number of proteins with a GO annotation decreases considerably, the reliability of the GO similarity scores increases through this procedure (data not shown). Nevertheless, ISS-based annotations were included. We rigorously assessed the possibility of including annotations based on ISS as this accounts for a considerably number of annotations and could conclude that the low reliability of these annotations does not pose a problem to our approach (see Additional file 4; Additional file 5; see Results).
Gene expression data
A heterogenic set of microarray expression data containing amongst others growth, stress, and mutation experiments (86) was compiled from NASC (Nottingham Arabidopsis Stock Centre) to detect co-expression between Arabidopsis genes . Microarray experiments with at least 2 replicates were taken. Expression values were processed using RMA (robust multichip average) [67, 68]. Co-expression was identified through the calculation of Pearson correlation coefficients (PCC) between the expression profiles of genes possibly encoding interacting proteins.
Clustering of protein-protein interactions
The predicted interactome can be represented as a graph of nodes, corresponding to the proteins, and connecting edges, corresponding to the interactions. Protein complexes were delineated from this graph making use of the Cluster Affinity Search Technique (CAST) algorithm . This algorithm was originally designed to identify clusters of co-expressed genes. For this purpose, a measure for co-expression of two genes, e.g. the Pearson correlation coefficient, is used as the weight of an edge. However, in this study, all edges were treated equally, avoiding a bias towards protein-protein interactions for which the encoding genes have highly similar expression profiles. A cluster is initiated by choosing the protein with the maximum number of neighbors using a heuristic independent from the CAST algorithm. Subsequently, neighbors of that protein are added to the protein cluster if the neighbor is connected to more than 25% of the proteins already present in the cluster. Although the connectivity of the protein clusters depends on the functional role of the cluster, we could conclude that a connectivity of 25% (compared to 0%, 50%, 75% and 100%) yielded the most robust and functionally relevant protein clusters (data not shown).
GO overrepresentation analysis
The identified protein complexes are subjected to functional analysis. The assignments of genes to the original GO categories were extended to include parental terms (that is, a gene assigned to a given category was automatically assigned to all the parent categories as well). All GO categories containing less than 20 genes were discarded for further analysis. Enrichment values were calculated as the ratio of the relative occurrence in a set of genes to the relative occurrence in the genome. Overrepresentation of GO categories (biological process, molecular function and cellular component) was tested using the Fisher exact test. P values were adjusted using the Bonferroni correction for multiple hypotheses testing. GO categories are assumed to be significantly overrepresented when the corrected P value is smaller than 0.01.
S.D.B. and S.P. are indebted to the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT). S.D.B. and K.V. are postdoctoral fellows of the Fund for Scientific Research Flanders (FWO). This work is also supported by the Belgian Federal Science Policy Office: IUAP P6/25 (BioMaGNet).
- Shoemaker BA, Panchenko AR: Deciphering Protein-Protein Interactions. Part I. Experimental Techniques and Databases. PLoS Comput Biol. 2007, 3 (3): e42-10.1371/journal.pcbi.0030042.PubMed CentralView ArticlePubMedGoogle Scholar
- Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM: Protein interaction networks from yeast to human. Curr Opin Struct Biol. 2004, 14 (3): 292-299. 10.1016/j.sbi.2004.05.003.View ArticlePubMedGoogle Scholar
- Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y: Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci USA. 2000, 97 (3): 1143-1147. 10.1073/pnas.97.3.1143.PubMed CentralView ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415 (6868): 141-147. 10.1038/415141a.View ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415 (6868): 180-183. 10.1038/415180a.View ArticlePubMedGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403 (6770): 623-627. 10.1038/35001009.View ArticlePubMedGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440 (7084): 637-643. 10.1038/nature04670.View ArticlePubMedGoogle Scholar
- Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al: High-Quality Binary Protein Interaction Map of the Yeast Interactome Network. Science. 2008, 322 (5898): 104-110. 10.1126/science.1158684.PubMed CentralView ArticlePubMedGoogle Scholar
- Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005, 437 (7062): 1173-1178. 10.1038/nature04209.View ArticlePubMedGoogle Scholar
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al: A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005, 122 (6): 957-968. 10.1016/j.cell.2005.08.029.View ArticlePubMedGoogle Scholar
- Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, et al: Protein interaction mapping: a Drosophila case study. Genome Res. 2005, 15 (3): 376-384. 10.1101/gr.2659105.PubMed CentralView ArticlePubMedGoogle Scholar
- Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al: A map of the interactome network of the metazoan C. elegans. Science. 2004, 303 (5657): 540-543. 10.1126/science.1091403.PubMed CentralView ArticlePubMedGoogle Scholar
- von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417 (6887): 399-403. 10.1038/nature750.View ArticlePubMedGoogle Scholar
- Sprinzak E, Sattath S, Margalit H: How reliable are experimental protein-protein interaction data?. J Mol Biol. 2003, 327 (5): 919-923. 10.1016/S0022-2836(03)00239-0.View ArticlePubMedGoogle Scholar
- Ramirez F, Schlicker A, Assenov Y, Lengauer T, Albrecht M: Computational analysis of human protein interaction networks. Proteomics. 2007, 7 (15): 2541-2552. 10.1002/pmic.200600924.View ArticlePubMedGoogle Scholar
- Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks?. Genome Biol. 2006, 7 (11): 120-10.1186/gb-2006-7-11-120.PubMed CentralView ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004, D449-D451. 10.1093/nar/gkh086. 32 Database
- Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res. 2007, D572-574. 10.1093/nar/gkl950. 35 Database
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: a general repository for interaction datasets. Nucleic Acids Res. 2006, D535-539. 10.1093/nar/gkj109. 34 Database
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A: IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004, D452-455. 10.1093/nar/gkh052. 32 Database
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol. 2007, 3 (4): e43-10.1371/journal.pcbi.0030043.PubMed CentralView ArticlePubMedGoogle Scholar
- Salwinski L, Eisenberg D: Computational methods of analysis of protein-protein interactions. Curr Opin Struct Biol. 2003, 13 (3): 377-382. 10.1016/S0959-440X(03)00070-8.View ArticlePubMedGoogle Scholar
- Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet. 2001, 29 (4): 482-486. 10.1038/ng776.View ArticlePubMedGoogle Scholar
- Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FC: Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell. 2002, 9 (5): 1133-1143. 10.1016/S1097-2765(02)00531-2.View ArticlePubMedGoogle Scholar
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res. 2002, 12 (1): 37-46. 10.1101/gr.205602.PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn A, Rahnenfuhrer J, Talwar P, Lengauer T: Confirmation of human protein interaction data by human expression data. BMC Bioinformatics. 2005, 6: 112-10.1186/1471-2105-6-112.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu LJ, Xia Y, Paccanaro A, Yu H, Gerstein M: Assessing the limits of genomic data integration for predicting protein networks. Genome Res. 2005, 15 (7): 945-953. 10.1101/gr.3610305.PubMed CentralView ArticlePubMedGoogle Scholar
- Patil A, Nakamura H: Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics. 2005, 6: 100-10.1186/1471-2105-6-100.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin N, Wu B, Jansen R, Gerstein M, Zhao H: Information assessment on predicting protein-protein interactions. BMC Bioinformatics. 2004, 5: 154-10.1186/1471-2105-5-154.PubMed CentralView ArticlePubMedGoogle Scholar
- Goldberg DS, Roth FP: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA. 2003, 100 (8): 4372-4376. 10.1073/pnas.0735871100.PubMed CentralView ArticlePubMedGoogle Scholar
- Saito R, Suzuki H, Hayashizaki Y: Interaction generality, a measurement to assess the reliability of a protein-protein interaction. Nucleic Acids Res. 2002, 30 (5): 1163-1168. 10.1093/nar/30.5.1163.PubMed CentralView ArticlePubMedGoogle Scholar
- Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal M: Protein interaction mapping in C. elegans using proteins involved in vulval development. Science. 2000, 287 (5450): 116-122. 10.1126/science.287.5450.116.View ArticlePubMedGoogle Scholar
- Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JD, Bertin N, Chung S, Vidal M, Gerstein M: Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 2004, 14 (6): 1107-1118. 10.1101/gr.1774904.PubMed CentralView ArticlePubMedGoogle Scholar
- Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res. 2001, 11 (12): 2120-2126. 10.1101/gr.205301.PubMed CentralView ArticlePubMedGoogle Scholar
- Mika S, Rost B: Protein-protein interactions more conserved within species than across species. PLoS Comput Biol. 2006, 2 (7): e79-10.1371/journal.pcbi.0020079.PubMed CentralView ArticlePubMedGoogle Scholar
- Brown KR, Jurisica I: Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol. 2007, 8 (5): R95-10.1186/gb-2007-8-5-r95.PubMed CentralView ArticlePubMedGoogle Scholar
- Tirosh I, Barkai N: Computational verification of protein-protein interactions by orthologous co-expression. BMC Bioinformatics. 2005, 6: 40-10.1186/1471-2105-6-40.PubMed CentralView ArticlePubMedGoogle Scholar
- Lehner B, Fraser AG: A first-draft human protein-interaction map. Genome Biol. 2004, 5 (9): R63-10.1186/gb-2004-5-9-r63.PubMed CentralView ArticlePubMedGoogle Scholar
- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-protein interaction network. Nat Biotechnol. 2005, 23 (8): 951-959. 10.1038/nbt1103.View ArticlePubMedGoogle Scholar
- Scott MS, Barton GJ: Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics. 2007, 8: 239-10.1186/1471-2105-8-239.PubMed CentralView ArticlePubMedGoogle Scholar
- Xia K, Dong D, Han JD: IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model. BMC Bioinformatics. 2006, 7: 508-10.1186/1471-2105-7-508.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang TW, Lin CY, Kao CY: Reconstruction of human protein interolog network using evolutionary conserved network. BMC Bioinformatics. 2007, 8: 152-10.1186/1471-2105-8-152.PubMed CentralView ArticlePubMedGoogle Scholar
- Sprinzak E, Altuvia Y, Margalit H: Characterization and prediction of protein-protein interactions within and between complexes. Proc Natl Acad Sci USA. 2006, 103 (40): 14718-14723. 10.1073/pnas.0603352103.PubMed CentralView ArticlePubMedGoogle Scholar
- Ben-Hur A, Noble WS: Kernel methods for predicting protein-protein interactions. Bioinformatics. 2005, 21 (Suppl 1): i38-46. 10.1093/bioinformatics/bti1016.View ArticlePubMedGoogle Scholar
- Geisler-Lee J, O'Toole N, Ammar R, Provart NJ, Millar AH, Geisler M: A predicted interactome for Arabidopsis. Plant Physiol. 2007, 145 (2): 317-329. 10.1104/pp.107.103465.PubMed CentralView ArticlePubMedGoogle Scholar
- Cui J, Li P, Li G, Xu F, Zhao C, Li Y, Yang Z, Wang G, Yu Q, Li Y: AtPID: Arabidopsis thaliana protein interactome database – an integrative platform for plant systems biology. Nucleic Acids Res. 2008, D999-1008. 36 Database
- Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH: SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res. 2007, D213-218. 10.1093/nar/gkl863. 35 Database
- Vandepoele K, Vlieghe K, Florquin K, Hennig L, Beemster GT, Gruissem W, Peer Van de Y, Inze D, De Veylder L: Genome-wide identification of potential plant E2F target genes. Plant Physiol. 2005, 139 (1): 316-328. 10.1104/pp.105.066290.PubMed CentralView ArticlePubMedGoogle Scholar
- Vandenbroucke K, Robbens S, Vandepoele K, Inze D, Peer Van de Y, Van Breusegem F: Hydrogen peroxide-induced gene expression across kingdoms: a comparative analysis. Mol Biol Evol. 2008, 25 (3): 507-516. 10.1093/molbev/msm276.View ArticlePubMedGoogle Scholar
- Makarevich G, Leroy O, Akinci U, Schubert D, Clarenz O, Goodrich J, Grossniklaus U, Kohler C: Different Polycomb group complexes regulate common target genes in Arabidopsis. EMBO Rep. 2006, 7 (9): 947-952. 10.1038/sj.embor.7400760.PubMed CentralView ArticlePubMedGoogle Scholar
- Brzeski J, Podstolski W, Olczak K, Jerzmanowski A: Identification and analysis of the Arabidopsis thaliana BSH gene, a member of the SNF5 gene family. Nucleic Acids Res. 1999, 27 (11): 2393-2399. 10.1093/nar/27.11.2393.PubMed CentralView ArticlePubMedGoogle Scholar
- Bezhani S, Winter C, Hershman S, Wagner JD, Kennedy JF, Kwon CS, Pfluger J, Su Y, Wagner D: Unique, shared, and redundant roles for the Arabidopsis SWI/SNF chromatin remodeling ATPases BRAHMA and SPLAYED. Plant Cell. 2007, 19 (2): 403-416. 10.1105/tpc.106.048272.PubMed CentralView ArticlePubMedGoogle Scholar
- Noh YS, Amasino RM: PIE1, an ISWI family gene, is required for FLC activation and floral repression in Arabidopsis. Plant Cell. 2003, 15 (7): 1671-1682. 10.1105/tpc.012161.PubMed CentralView ArticlePubMedGoogle Scholar
- Jullien PE, Mosquna A, Ingouff M, Sakata T, Ohad N, Berger F: Retinoblastoma and its binding partner MSI1 control imprinting in Arabidopsis. PLoS Biol. 2008, 6 (8): e194-10.1371/journal.pbio.0060194.PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Jensen TS, de Lichtenberg U, Brunak S, Bork P: Co-evolution of transcriptional and post-translational cell-cycle regulation. Nature. 2006, 443 (7111): 594-597.PubMedGoogle Scholar
- Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S: Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 2008, 320 (5878): 938-941. 10.1126/science.1157956.View ArticlePubMedGoogle Scholar
- Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006, D436-441. 10.1093/nar/gkj003. 34 Database
- Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM: Human protein reference database – 2006 update. Nucleic Acids Res. 2006, D411-414. 10.1093/nar/gkj141. 34 Database
- O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005, D476-480. 33 Database
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Chen F, Mackey AJ, Stoeckert CJ, Roos DS: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006, D363-368. 10.1093/nar/gkj123. 34 Database
- Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13 (9): 2178-2189. 10.1101/gr.1224503.PubMed CentralView ArticlePubMedGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, D258-261. 32 Database
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al: The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003, 31 (1): 224-228. 10.1093/nar/gkg076.View ArticlePubMedGoogle Scholar
- Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S: NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res. 2004, D575-577. 10.1093/nar/gkh133. 32 Database
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31 (4): e15-10.1093/nar/gng015.PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6 (3–4): 281-297. 10.1089/106652799318274.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.