HAPPI: an online database of comprehensive human annotated and predicted protein interactions
© Chen et al. 2009
Published: 7 July 2009
Skip to main content
© Chen et al. 2009
Published: 7 July 2009
Human protein-protein interaction (PPIs) data are the foundation for understanding molecular signalling networks and the functional roles of biomolecules. Several human PPI databases have become available; however, comparisons of these datasets have suggested limited data coverage and poor data quality. Ongoing collection and integration of human PPIs from different sources, both experimentally and computationally, can enable disease-specific network biology modelling in translational bioinformatics studies.
We developed a new web-based resource, the Human Annotated and Predicted Protein Interaction (HAPPI) database, located at http://bio.informatics.iupui.edu/HAPPI/. The HAPPI database was created by extracting and integrating publicly available protein interaction databases, including HPRD, BIND, MINT, STRING, and OPHID, using database integration techniques. We designed a unified entity-relationship data model to resolve semantic level differences of diverse concepts involved in PPI data integration. We applied a unified scoring model to give each PPI a measure of its reliability that can place each PPI at one of the five star rank levels from 1 to 5. We assessed the quality of PPIs contained in the new HAPPI database, using evolutionary conserved co-expression pairs called "MetaGene" pairs to measure the extent of MetaGene pair and PPI pair overlaps. While the overall quality of the HAPPI database across all star ranks is comparable to the overall qualities of HPRD or IntNetDB, the subset of the HAPPI database with star ranks between 3 and 5 has a much higher average quality than all other human PPI databases. As of summer 2008, the database contains 142,956 non-redundant, medium to high-confidence level human protein interaction pairs among 10,592 human proteins. The HAPPI database web application also provides …” should be “The HAPPI database web application also provides hyperlinked information of genes, pathways, protein domains, protein structure displays, and sequence feature maps for interactive exploration of PPI data in the database.
HAPPI is by far the most comprehensive public compilation of human protein interaction information. It enables its users to fully explore PPI data with quality measures and annotated information necessary for emerging network biology studies.
Protein-protein interactions (PPIs) is an important foundation for understanding how biological processes take place in cells, how cellular signals are modulated, and how molecules orchestrate in response to external environmental stimuli . High-throughput projects that map protein-protein interactions in model organisms were first initiated less than a decade ago, including those for Saccharomyces cerevisiae, (resulted in the detection of 957 putative interactions involving 1,004 proteins) , Drosophila melanogaster (20,405 interactions from 7048 proteins), Caenorhabditis elegans (~5,500 interactions), and Mus musculus [3–5]. In 2003, Chen et al. first reported the generation of 13,656 high-throughput human protein interactions in homogenized human brain using a random yeast two-hybrid platform ; in 2005, Stelzl et al. identified 3,186 mostly novel interactions among 1,705 human proteins ; then, Rual et al. reported the mapping of ~2,800 proteins in a human protein-protein interaction network ; in 2007, Ewing et al. reported a large-scale study of protein-protein interactions in human cells using a mass spectrometry-based approach, producing a data set of 6,463 interactions among 2,235 distinct human proteins .
These high-throughput experimental determinations of PPIs have led to an influx of PPI experimental data. By early 2008, BioGrid reported a comprehensive collection of 198,000 protein and genetic interactions from major organisms, including S. cerevisiae, S. pombe, D. melanogastor, C. elegans, M. musculus, and H. sapiens . However, the coverage of data directly captured from experimental platforms in human is still quite poor. In the most recent release 7 of the Human Protein Reference Database (HPRD) , there are only 38,167 protein interactions reported – an average of only 1.5 interactions reported for each of the 25,661 human proteins included in HPRD.
While it remains an open question how many measurable human protein interactions there are, the use of PPI data in building disease-relevant molecular interaction network models has already emerged as a major theme for "translational bioinformatics", studies that aim to facilitate the transformation of bioinformatics discoveries from "Omics" experiments into biomedical applications via bi-directional information exchange [12, 13]. Recent research studies have shown that, by building comprehensive disease-relevant PPI sub-networks, researchers can generate and validate biological hypothesis that could lead to novel biomarkers or therapeutic developments for many complex diseases such as Huntington's disease, Alzheimer's disease, Breast Cancer, Fanconi Anemia, and Ovarian Cancer [14–18]. These studies, however, were primarily based on available human PPIs in existing PPI database repositories with limited coverage and/or uncertain qualities. It is expected that new comprehensive database collections of human PPIs, with expanded data coverage and quantifiable reliability measures, could significantly enhance the impact of future network modeling research.
Several human PPI databases have begun to expand experimental human PPI data coverage that is bottlenecked by experimental data throughput and cost. There are four common approaches for PPI data expansions: 1) manual curation from the biomedical literature by experts; 2) automated PPI data extraction from biomedical literature with text mining methods; 3) computational inference based on interacting protein domains or co-regulation relationships, often derived from data in model organisms; and 4) data integration from various experimental or computational sources. Partly due to the difficulty of evaluating qualities for PPI data, a majority of widely-used PPI databases, including DIP, BIND, MINT, HPRD, and IntAct [11, 19–22], take a "conservative approach" to PPI data expansion by adding only manually curated interactions. Therefore, the coverage of the protein interactome developed using this approach is poor. In the second literature mining approach, computer software replaces database curators to extract protein interaction (or, association) data from large volumes of biomedical literature . Due to the complexity of natural language processing techniques involved, however, this approach often generates large amount of false positive protein "associations" that are not truly biologically significant "interactions". The advantages of computational inferences are attributable to various biological models that can be used to expand data coverage. For example, the HPID database was developed from existing structural and experimental data by homology searching ; OPHID was also constructed by mapping interacting proteins from model organisms to their human protein orthologs . In an integrative approach, PPI data from different sources are evaluated and combined, thus providing maximal likelihood for quality and coverage. For example, the STRING database (version 7)  has now integrated known and predicted interactions from a variety of sources, and covers all domains of life (prokaryotes to higher eukaryotes). Xia et al. applied a probabilistic model and integrated 27 heterogeneous genomic, proteomic and functional annotation datasets to predict human PPI networks . UniHI and IntNetDB are both based on several major interaction maps derived by computational and experimental methods [27, 28]. The challenge for the integrative approach is how to balance quality with coverage. In particular, different databases may contain many redundant PPI information derived from the same sources, while the overlaps between independently derived PPI data sets are quite low [29, 30].
In this work, we describe a new PPI web database resource, Human Annotated Protein-Protein Interactions (HAPPI), located at http://bio.informatics.iupui.edu/HAPPI/. As of early 2008, HAPPI (version 1.1) contains 142,956 non-redundant, medium to high-confidence human protein interaction pairs among 10,592 human proteins identified by UniProt protein names. The HAPPI database aims to become the most comprehensive public compilation of human protein interaction information. The protein interactions are integrated from multiple data sources including both experimental and computationally-derived PPI. Each protein interaction in HAPPI is assigned a PPI confidence grade of 1, 2, 3, 4, or 5 to help users evaluate the reliability and confidence of reported interactions. Each interaction is computationally annotated with information including biological pathways, gene functions, protein families, protein structures, sequence features, and literature sources. These database capabilities will enable both biomedical researchers and network biology users to evaluate the biological significance of specific protein interactions, from which they can build network models for future translational bioinformatics research.
HAPPI database protein interaction data quality grade and coverage.
noisy and uncertain interactions
All interacting proteins in the HAPPI database were annotated with gene function, pathway, protein domain, protein structure, and sequence feature map data. The data were separately imported into the Oracle 10g data warehouse from UniProt , GenBank , HUGO Nomenclature , Ensembl , PubMed , PDB , Pfam , and KEGG  databases. Altogether, we organized inside the data warehouse 70,829 curated human proteins and their descriptions, of which 13,601 proteins contain protein interaction information in the HAPPI database. We kept 361,975 literature abstract IDs where human gene/protein co-occurrence was detected by the STRING database, 52,186 protein domains/families from Pfam, 715 pathways from KEGG, 2,282 protein 3-D structures from PDB, and 76,797 annotated human gene features from GeneBank. All the information was linked to the original source databases on the HAPPI web site, so that HAPPI users can navigate to database sources to determine the reliability of queried PPIs.
In this study, we chose to apply evolutionarily conserved co-expression pairs to the assessment and comparisons of PPI data qualities for different sources, including the HAPPI database. High-quality conserved gene co-expression profiles were used to assess protein interaction quality. Many protein interaction data sets were cross-validated with human gene co-expression profiles such as . While interacting proteins may share highly similar gene expression profiles, it was often suggested that such expected correlation between protein interactions and gene expression is quite weak in human and in transient protein interactions. Furthermore, comprehensive expression profiles are difficult to compile for all cellular conditions. To improve the development of a co-expression based confidence measure for interacting proteins, Tirosh and Barkai showed that a method using co-expression of orthologs of interacting partners performed quite well . Their method was based on the assumption that conserved co-expression relationship preserved true protein interactions that required the presence of both interacting proteins through evolution. Therefore, it is more sensitive overall than using information purely from the organism, e.g., simple co-expression, cellular co-localization, and similarity in gene's gene ontology functional annotations. In a similar study, Bhardwaj and Lu also verified that reliable predictions of interactions from heterogeneous data sources could be strengthened by evolutionary conserved gene co-expression measurements .
Our computational method was based on the degree of overlap between protein interactions and the use of an evolutionarily conserved co-expressed gene data set called MetaGene. MetaGene consists of 22,163 evolutionary conserved co-expression relationships from humans, flies, worms, and yeast, based on the analysis of over 3182 published DNA microarray experiments by Stuart et al . It is a comprehensive compilation of evolutionary conserved gene co-expression pairs from a diverse set of DNA microarray experiments that were obtained from four different organisms: 1,202 DNA microarrays from H. sapiens, 979 from C. elegans,155 from D. melanogastor, and 643 from S. cerevisiae. The relative quality of each PPI database, including HAPPI, OPHID , IntNetDB , ProNet , UniHI , and HPRD , was estimated as the count of overlaps between protein interactions in the PPI database of interest and MetaGene conserved co-expressed gene pairs. The human subset of MetaGene data involves 6,591 human genes and 22,154 MetaGene co-expression gene pairs. 6,297 of the 22,154 human MetaGene co-expression gene pairs can be found in the union (U 0 set) of all the known human PPI databases, including HAPPI, OPHID, IntNetDB, ProNet, UniHI, and HPRD; furthermore, 6,145 of the 6,297 MetaGene pairs form a large connected MetaGene co-expression association network that showed the scale-free property commonly observed of most molecular interaction networks. Therefore, we regarded 6,145 Metagene pairs (M0 Set) to be most relevant high-quality subset of U0 and could be used as a gold standard for evaluating unknown PPIs from large databases. To facilitate comparisons of overlaps for different databases with MetaGene, we also developed an artificially synthesized protein-protein "random interaction" set (R 0 Set) of 37,000 PPIs (comparable to the size of all PPIs in HPRD), by randomly reconnecting proteins observed in U 0. Therefore, the lower-bound of any protein interaction data set derived from U 0 could be given by counting the overlap between R 0 and M 0. To adapt to the different sizes of PPI databases, we took a random sample of 1000 PPIs each time from each database in comparison (including R 0), and repeated this random sampling process 1000 times to obtain a distribution of normalized overlap counts with M 0.
HAPPI was developed as a web-based PPIs database application and is freely accessible to the public at http://bio.informatics.iupui.edu/HAPPI/. In the current release, HAPPI contains 13,601 proteins and 1,209,463 PPIs integrated from five databases collected with both experimental and computationally methods as described in the previous section. Users of the HAPPI web application software can search for PPIs using common protein identifiers. Typical web query results display all HAPPI PPIs at a default quality grade (star rank 3 and above). Users can drill down to explore annotations of the protein interaction or proteins involved.
While there are several methods for validating PPI data, including those based on interacting domains, gene co-expression profiles, or gene ontology (GO) annotation semantic distances [42, 45–49], we assessed the quality of the new HAPPI database by comparing the extent of overlap between PPIs and MetaGene pairs, using a new computational approach described earlier in the Method section.
Figure 3A shows that the 4-star quality grade HAPPI database subset has the highest MetaGene overlap at approximately 72 out of 1000, among all databases compared (including UniHI, at approximately 8 overlaps, data not shown). The overall quality of the HAPPI database (at all star grades) is comparable to that of the recently published IntNetDB or HPRD (at approximately 13–15 overlaps overall), still better than that of the ProNet  database (manually curated data set initially made public as the first database for human protein interactions; at approximately 8 overlaps overall). The overall quality of HAPPI database at all star grades is not as good as the BioGrid (at approximately 19 overlaps) or the OPHID database (at approximately 27 overlaps but with a wide spread), primarily because HAPPI database at one-star quality grade contains many literature mining based co-citation data that do not physically interact. The result also suggests that the overall quality of OPHID database exceeds that of the reference curated HPRD database. We believe that this is primarily due to the challenge in identifying false positive interactions inherent in many experimentally-derived high-throughput PPI data, which HPRD also included with minimal additional validations. The OPHID database incorporated functionally conserved sequence and structure information such as conserved interacting domain pairs (as in the case of OPHID), for developing and filtering human PPI data collected from different organisms, and may have therefore enriched its database with these computationally-derived plausible PPIs.
In Figure 3B, we show a sample frequency distribution of MetaGene overlaps among different quality grades of the HAPPI database subsets. The figure shows that while the overall data quality for the entire HAPPI database of 1.2 million PPIs may be relatively un-impressive (at an average MetaGene overlap of 14 out of 1000 in each sample), the remaining 650,000+ HAPPI database PPIs at star quality grades of 2 and above have an overall quality better than that of any of the existing public databases in the comparison, including the OPHID database. The average count of MetaGene overlaps also improves as the quality grade improves, at approximately 31 for 378,300 2-star PPIs, 47 for 142,071 3-star PPIs, 75 for 67,462 4-star PPIs, and 87 for 75,494 5-star PPIs. While the community knowledge of what constitutes "true protein interactions" in all cellular conditions remain poor, it is still challenging to validate the rest of PPIs that MetaGene data do not cover. However, our results show that the HAPPI database, particularly for star grades of 3, 4, and 5, clearly contains much higher true positive PPI interactions than all other known human PPI databases. For that reason, we only report HAPPI database results with star grades of 3 and above in our database's web user interface.
HAPPI enables users to retrieve human PPI data through multiple types of protein identifiers, such as UniProt IDs, Swiss-Prot accession numbers, RefSeq IDs, or IPI accession numbers, at its query home page. Query results that contain protein interaction data and quality rank are shown in a single web page as a data table. The query result is available for download either in a Molecular Interaction (MI) format recommended by the Proteomics Standard Initiatives (PSI) or in a Graph Markup Language (GML) format recommended by the International Molecular Exchange Consortium. Additional annotation details of the protein or protein interaction can be queried and retrieved online by selecting the hyperlinks in the protein interaction result page.
HAPPI is by far the most comprehensive public compilation of human protein interaction data that come with a unified framework of interaction data reliability scores. In its current release, the HAPPI database contains 13,601 proteins and 1,209,463 PPIs integrated from several databases derived either experimentally or computationally. By comparing the degree of overlap between PPIs of varying quality grades and evolutionarily conserved co-expressed gene pairs, we assessed the quality of HAPPI. While the overall quality of HAPPI is comparable to that of the HPRD database, HAPPI PPIs with 3-5 star rank levels have a higher average quality than all other human PPI databases considered in this study, which include ProNet, UniHI, IntNetDB, OPHID, HPRD, and BioGrid.
For future HAPPI database releases, we have three plans. First, we wish to continue integrating and linking valuable annotation data into the HAPPI database. Protein interaction data from high-precision text mining projects could be used to improve the validation of high-quality protein interactions as "re-discovered" compared to the findings reported in past literature. Gene co-expression and Gene Ontology data are also candidates for data import next, since they both can help define common functional context in which protein interactions may take place. Second, we plan on applying database customization techniques to improve the user querying experience with HAPPI. For example, we will add control buttons for users to customize interaction data quality filter thresholds, and to select a subset of retrieved protein interactions for downloading into spreadsheet programs. Third, we wish to improve existing PPI data investigation features. For example, we hope to run molecular docking programs and show computationally predicted protein binding constants and binding sites between two proteins. We also plan to improve the interplay between JMOL and Safmap Java Applets so that a highlight of sequence segments in one program may also be highlighted in the other program. With these improvements, we expect the database to play essential roles for biomedical researchers to retrieve trustworthy information on plausible human protein interaction data and for bioinformatics scientists to conduct network biology modeling studies.
The HAPPI database was developed in part with research funding from the Research and Sponsored Programs of Indiana University – Purdue University Indianapolis awarded to Dr. Jake Chen. We thank Stephanie Burks of the University Information Technology and Services at Indiana University for providing generous support in Oracle 10g database administration, Jason Sisk from Indiana University School of Informatics for configuring the Web server for the project, Dr. Sudipto Saha from Indiana University School of Informatics for helping improve the web application user interface and the initial draft of the manuscript, and Basil George for assisting in the development of viewing PDB structures in the web interface. We are particularly grateful for the generous and timely help from Michael Grobe of Indiana University in proofreading the manuscript before it goes to press.
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.