Building networks and performing analysis with Path2enet
Path2enet is a bioinformatic application tool that integrates the information of pathways, protein-protein interactions and expression datasets (obtained with microarrays, RNA-Seq or ESTs) from different tissues and cell types. Path2enet uses these datasets to build a network view of biological pathways in an expression-specific context. The tool is capable of identifying the genes/proteins that are ON in specific samples appliying the Barcode algorithm, and allows the use of specific experimental expression data to present focused views of the human pathways map as specific biomolecular networks.
In the networks built using Path2enet, the “nodes” correspond to the proteins included in the queried pathway plus the information about the active- or inactive-state of such proteins (derived from the expression data of the cell-types or the tissues studied in each case). The “edges” of the network correspond to the links or associations between the biomolecular entities (derived from the information included in the pathways). These links can be activation, inhibition, expression, phosphorylation, etc. In order to facilitate further analysis of the networks, the edges generated by Path2enet are taken as undirected.
Considering the coverage over the map of human pathways, Path2enet can generate two different types of networks. The first is the “local” network which strictly includes the nodes of the canonical pathway selected from KEGG. For example, in the case of the NOTCH Signaling Pathway (KEGG ID: hsa04330) (Fig. 1a) the “local” network retrieves the 48 genes/proteins that are included in this pathway for human (Homo sapiens). Thus, Path2enet generates a network where each node is a protein and the edges are colored according to the type of association reported in the pathway (Fig. 1b). The second type of network that Path2enet can build is the “global” pathway-network that includes all the “local” nodes and links from a given KEGG pathway, plus all the extra “external” nodes that such nodes can be linked to in other pathway charts (i.e., it provides the links to other nodes in any biological pathway of the whole human repertoire). In this way, Path2enet is not restricted to predefined pathways since it can create large networks blending multiple layers of biological information.
Once a network is built with Path2enet, calculations of the network topological parameters (such as degree, betweeness, clustering coefficient, eigenvector value, etc.) can be performed, because the tool generates igraph objects [16], that can be studied with graph analysis tools. In this way, Path2enet provides ways to identify hubs and clusters in the network.
Application of Path2enet to build the NOTCH pathway-network of B and T cells
In the case study presented in this article we used Path2enet to generate expression networks of the NOTCH signaling pathway in three types of human cells: B cells (CD 19+) and T cells (CD 4+ and CD8+) (Fig. 2). To achieve this, we used a sample dataset of microarray expression (indicated in Methods).
First, we needed to apply the gene expression Barcode analysis to the gene products present in the NOTCH pathway-network (Fig. 1b) to identify which nodes were active in these cell types. The quantitative results of these analyses are presented in Fig. 3. Using the treshold of 0.4 for the normalized expression, the B cell network expressed 34 of 48 of the NOTCH pathway proteins. In contrast, the T cell network expressed 22–24 of the NOTCH proteins. It was very interesting to show that in all lymphocytes DLL1/2/3 and JAG1/2 were absent (i.e., they were OFF). In fact, these proteins are ligands of the NOTCH receptors of lymphocytes coming from the external cells that connect to them, therefore they should not be present in the lymphocytes. This is clearly shown in the quantitative analysis (Fig. 3), since all these genes were labeled OFF (not expressed) in B cells and in T cells.
We also observed that the only NOTCH paralogs detected in the lymphocytes were NOTCH2 and some NOTCH1. It is well known that NOTCH2 is preferentially expressed in mature naive B cells and interacts with DTX1, thus playing an important role in B cell development [17]. We also saw that the level of DTX1 in B cells was much higher (DTX1 = 1.00) than in T cells CD4+ (0.41) or CD8+ (0.17) (Fig. 3). This result is also in agreement with several studies that have shown that T cells are normally developed in absence of DTX1 [18].
Finally, another differential protein found expressed in B cells but not in T cells was the transcription factor HES1. The presence and role of this transcription factor in lymphocytes has been proven in several studies [19, 20]. In fact, it has been indicated that in T cells HES1 is dispensable beyond the beta selection checkpoint [21]. This explains our detection of HES1 in B cells CD19+ and its abscence in T cells CD4+ and CD8 + .
As a whole the data presented in Figs. 2 and 3 were very consistent with our current knowledge of the role of the NOTCH pathway in human B and T lymphocytes, enhancing the value of generating well defined “pathway-expression-networks” for specific cell types which is the scope of Path2enet.
Path2enet tool for pathways: usability and formats
KEGG pathways database (http://www.kegg.jp/) provides KGML files for each biological pathway on its website. For example, in the case of the human NOTCH signaling pathway (KEGG ID reference: hsa04330) the KGML file can be downloaded freely as “hsa04330.xml”. The link for this file is: http://www.kegg.jp/kegg-bin/download?entry=hsa04330&format=kgml. In this way, any specific pathway is accessible via its KGML file in the KEGG website and Path2enet R package provides functions to download these files and create a MySQL database derived from the KGMLs (as explained in the R vignette included with Path2enet). Moreover, to facilitate the use of the pathway KMGL files within the application Path2enet, we also provided an SQL dump file (“Path2enet_KeggSQL.sql”) generated with all the KMGL files of Homo sapiens (this datafile is provided at: http://bioinfow.dep.usal.es/path2enet/). This allows the creation of the necessary SQL database within the user’s computer to query for specific pathways and to use the other functions of Path2enet. This database resource is not just a compendium of KMGL files from KEGG given that it provides some important added values: (i) it includes a mapping of all the gene and protein identifiers (IDs) from KEGG to the IDs of UniProtKB (used as the reference protein database in Path2enet); (ii) it includes a relational SQL structure, based on the extracted data from the pathways, that allocates such information in two principal indexed tables: one describing the pair-wise links or relations between protein pairs, and another one describing the characteristics of each singular protein.
With respect to the use of other formats, other than XML and KGML, Path2enet can also use any database or resource provided in a “network structure” as an igraph object, because the tool includes functions to read and load in R igraph objects. For the use of other standard formats, such as SBML or BioPAX, there are already tools that address this scope. For example KEGGtranslator [22], an easy-to-use stand-alone application that can visualize and convert KGML formatted XML-files into multiple output formats. This tool supports a plethora of output formats, being able to increase the information in translated documents beyond the scope of the KGML document. KEGGtranslator converts KEGG files (KGML formatted XML-files) to SBML, BioPAX, SIF, SBGN, SBML-qual, GML, GraphML and LaTeX. Moreover, in Bioconductor (https://www.bioconductor.org/) there are packages to parse, modify and visualize BioPAX data, like rBiopaxParser [23] or PaxtoolsR [24]. At the moment, we are working on a workflow to use these packages to create SQL databases, similiar to the SQL described above, but using data from other pathway resources such as Reactome or Pathway Commons. This work is under development, but one of main problems in the use of these resources is not the use of standard formats, like BioPAX or SBML, but the accurate mapping to standard protein identifiers from UniProtKB.