Functional module detection through integration of single-cell RNA sequencing data with protein–protein interaction networks

Background Recent advances in single-cell RNA sequencing have allowed researchers to explore transcriptional function at a cellular level. In particular, single-cell RNA sequencing reveals that there exist clusters of cells with similar gene expression profiles, representing different transcriptional states. Results In this study, we present scPPIN, a method for integrating single-cell RNA sequencing data with protein–protein interaction networks that detects active modules in cells of different transcriptional states. We achieve this by clustering RNA-sequencing data, identifying differentially expressed genes, constructing node-weighted protein–protein interaction networks, and finding the maximum-weight connected subgraphs with an exact Steiner-tree approach. As case studies, we investigate two RNA-sequencing data sets from human liver spheroids and human adipose tissue, respectively. With scPPIN we expand the output of differential expressed genes analysis with information from protein interactions. We find that different transcriptional states have different subnetworks of the protein–protein interaction networks significantly enriched which represent biological pathways. In these pathways, scPPIN identifies proteins that are not differentially expressed but have a crucial biological function (e.g., as receptors) and therefore reveals biology beyond a standard differential expressed gene analysis. Conclusions The introduced scPPIN method can be used to systematically analyse differentially expressed genes in single-cell RNA sequencing data by integrating it with protein interaction data. The detected modules that characterise each cluster help to identify and hypothesise a biological function associated to those cells. Our analysis suggests the participation of unexpected proteins in these pathways that are undetectable from the single-cell RNA sequencing data alone. The techniques described here are applicable to other organisms and tissues. Supplementary Information The online version contains supplementary material available at (doi:10.1186/s12864-020-07144-2).


Background
Liver metabolism is at the centre of many noncommunicable diseases, such as diabetes and cardiovascular disease [1]. In healthy organisms, the liver is critical for metabolic and immune functions and gene-expression studies have revealed a diverse population of distinct cell types, which include hepatocytes in diverse functional cell states [2]. As diabetes is a complex and heterogenous disease, the study of liver physiology at single-cell resolution helps us to understand the biology [3]. At a single-cell level, however, large-scale protein interaction data is not yet available [4]. The available data are in the form of single cell gene expression levels. Such single cell data provide processing challenges but it is plausible to enhance their analysis through the integration of protein interactions. In this study, we develop SCPPIN a method for the integration of single-cell RNA-sequencing data with complementary PPINs. Our SCPPIN analysis of liver single cell data and PPINs reveals biological pathways in cells of different transcriptional states that hint at inflammatory processes in a subset of hepatocytes.
In recent years, much attention has been given to scRNA-seq techniques as they allow researchers to study and characterise tissues at a single-cell resolution [5][6][7]. Most importantly, scRNA-seq reveals that there exist clusters of cells with similar gene expression profiles, commonly referred to as 'cell states' [8]. Multiple approaches have been created to reveal these cell clusters, driven by the transcriptional profile of each cell [9,10]. Computational tools can identify biomarkers for such cell clusters [11]. Specifically, the analysis of differentially expressed genes (DEGs) between these cell clusters has been shown to reveal different cell types [12], diseased cells [13], and cells that resist drug treatment [14]. Due to technological advances the quality and availability of scRNA-seq data has increased dramatically in the last decade [15]. This makes the development of computational approaches for interpreting scRNA-seq data an active field of research [16] of which one research direction is the identification of gene regulatory networks in scRNA-seq data (e.g., SCENIC [17], PIDC [18]).
These approaches do not make systematic use of available protein-protein interaction data. One can represented such data as PPINs and use PPINs to, for example, identify essential proteins [19][20][21] and to predict disease associations [22,23] or biological functions [24][25][26]. For this, researchers have used tools from network science and machine learning. Many of these methods build on the well-established evidence that in PPINs, proteins with similar biological functions are closely interacting with each other. These groups of proteins with common biological functions are called modules [27,28].
It is understood that gene-expression is context-specific and thus varies between tissues [29], changes over time [30], and differs between healthy and diseased states [31]. It follows therefore that different parts of a PPIN are active under different conditions [32]. Analysing PPINs in an integrated way, together with bulk gene-expression data, provides such biological context, helps to reveal context-specific active functional modules [33,34], and can identify proteins associated with disease [35].
Based on the success of methods where PPINs have been integrated with bulk expression data, we have developed SCPPIN, a novel method to integrate scRNA-seq data with PPINs. It is designed to detect active modules in cells of different transcriptional states. We achieve this by clustering scRNA-seq data, performing a DEGs analysis, constructing node-weighted PPINs, and identifying maximum-weight connected subgraphs with an exact Steiner-tree approach. Our method is applicable to the broad range of organisms for which PPINs are available [36].
The SCPPIN method can be used to analyse mRNA-seq data from any tissue or organ type. As a case study, we investigate scRNA-seq data from human liver spheroids because this tissue is important in many diseases and it is known to have diverse cell types with different cellular metabolic processes. This makes the application of our method particularly relevant, because we expect the identification of very different active modules in different cell clusters -a hypothesis that our investigation partially confirms.
Our method identifies proteins involved in liver metabolism that could not be detected from the scRNAseq data alone. Some of them have been shown to be involved in the liver of other organisms and for others this study is the first indicator of a specific function in liver. Furthermore, we can associate cells in a given transcriptional state with enriched biological pathways. In particular, we find that cell clusters have different biological functions, for example, translational initiation, defence response, and extracellular structure organisation. To test the SCPPIN method on scRNA-seq data from a different tissue, we investigate human adipose tissue in the Supplementary Results 6. We detect other functional modules than in the liver data and also find different biological functions enriched.
This case study demonstrates that SCPPIN provides insights into the context-specific biological function of PPINs. Importantly, these insights would not have been revealed from either data type (PPIN or scRNA-seq) alone. As this technique is, in principle, applicable to a wide range of organism and tissue types, it could reveal functional modules in these, too.

Results
In this paper, we present SCPPIN, a method that detects functional modules in different cell clusters. The method (1) clustering of scRNA-seq data (e.g., with SEURAT [9]). For each cluster, we (2) compute p-values of differential expression and use them to (3) estimate node scores by using an approach presented in [33]. (4) We combine these node scores with a PPIN to construct node-weighted PPINs for each cluster. (5) We compute functional modules as maximum-weight connected subgraphs involves multiple analysis steps (see Fig. 1 for an overview and the "Methods" section for a detailed discussion). First, we preprocess the scRNA-seq profiles. Second, we use an unsupervised clustering technique from SEURAT to identify sets of cells in similar transcriptional states. Third, for each cluster, we identify DEGs using a Wilcoxon rank sum test. Fourth, for each gene in every cluster, we compute additive scores from these p-values (see [33] and Supplementary Note 1). Fifth, for each cluster, we map these gene scores to their corresponding proteins in a PPIN, constructed from publicly available data from BIOGRID [36]. Lastly, we identify functional modules as maximum-weight connected subgraphs in these node-weighted networks. To test whether this integrative analysis of scRNA-seq data with a PPIN has been successful, we compare the detected modules with a 'biological ground truth' in the form of GO-enrichment annotations.
In order to demonstrate SCPPIN, we investigate newly measured scRNA-seq data of liver hepatocytes (see "Methods" section for a description of the experimental setup and preprocessing steps). Using a standard modularity-maximisation algorithm, we obtain ten cell clusters of which seven consist of hepatocytes (see Fig. 2), which make up a majority of the liver tissue. Hepatocytes are known to show a functional diversity and are, for example, involved in the carbohydrate metabolism [2].

Fig. 2 (Left)
Clustering the scRNA-seq data reveals ten clusters of which seven are hepatocyte cells. We visualise the cells in a two-dimensional space as obtained from a t-distributed stochastic neighbour embedding (t-SNE) of their original high-dimensional space [37], which is defined by the expression of the d = 6983 genes. (Middle) For each cluster, we perform DEGs analysis to identify genes that most differentially expressed in a given cluster. Here, we show DEGs for hepatocyte cluster 6 (H6). (Right) We use SCPPIN to integrate the p-values from DEGs analysis with the PPIN and identify a functional module for the H6 cluster. We find genes that are significantly differentially expressed (p < 0.05; disks) and proteins that are not strongly differentially expressed (p ≥ 0.05; squares). Colour indicates p-value from low (white) to high (purple) We focus on hepatocytes because it allows us to study the heterogeneity of cellular function in this single cell type.
We identify DEGs in each of the hepatocyte clusters. In Fig. 2, middle panel, we show the p-values of differential expression for some of the genes in hepatocyte cluster H6. Usually, the top-ranked genes in each of the clusters can be seen as 'marker genes' , i.e., one may use these genes to associate cells with a certain transcriptional state. For H6, for example, protein phosphatase 1 regulatory subunit 3C (PPP1R3C) has the smallest of all p-values, namely 10 −63 . It therefore could serve as a potential biomarker and is a known regulator of liver glycogen metabolism [38]. While such a DEG analysis reveals important genes in certain cell states it is not straightforward to identify the crucial biological pathways. Next, we demonstrate that integrating p-values from a DEGs analysis with PPIN information can reveal a more comprehensive picture of the biological processes. In the right panel of Fig. 2, we show a functional module identified by SCPPIN. We detect a subnetwork consisting of nine proteins. This module consist of seven proteins with small p-values (among them PPP1R3C) that are connected to each other via the amyloid precursor protein (APP) and epidermal growth factor receptor (EGFR), which have p-values ∼ 10 −7 and 10 −2 , respectively. Both proteins are integral membrane proteins and do not show significant differential expression in this cell cluster as they rank 373 and 1800 out of all differentially expressed genes. The EGFR signalling network has been identified as a key player in liver disease [39]. The precise function of APP is unknown but it is involved in Alzheimer's disease and also has been hypothesised to be involved in liver metabolism [40].
These findings demonstrate that SCPPIN can help to automate the further investigation of results from a DEG analysis by identifying parts of the PPIN that correspond to genes that are significantly differentially expressed. Furthermore, it also identifies proteins corresponding to genes that are not significantly differentially expressed in a particular cluster. These genes are candidates of a biological connector function between differentially expressed genes.

Influence of the false discovery rate
We have demonstrated that SCPPIN can reveal functional modules inside a PPIN and associate them with cells of a certain transcriptional state. Now we explore whether there is only one functional module for a given cell state or whether there are functional modules of different sizes. We anticipate in this case that the latter is true, as it has been shown that functional modules may exists at multiple scales [28].
There is only one free parameter in SCPPIN, the false discovery rate (FDR), which is defined as the proportion, out of all genes which are declared significantly differentially expressed, that are false positive and indeed not significantly differently expressed. Given the distribution of p-values of differential expression and a FDR, we can compute node weights that yield the intended FDR (see Supplementary Note 1). Intuitively, increasing the FDR identifies a larger subgraph of the PPIN as an active module. In the following, we explore this systematically, for the hepatocyte cluster H6 that we investigated above.
The size M ∈ [1, N] of the detected modules is nondecreasing with the FDR. While the size M is nondecreasing, our method is non-monotonous, i.e., proteins identified for a certain FDR are not necessarily detected for all larger FDRs. For small FDRs, we detect a module of size M = 1, which is exactly the protein with the smallest p-value 1 . For FDRs close to one, we detect a maximum weight subgraph which is spanning almost the whole network.
In Fig. 3, we show the size M of the optimal subnetworks for cluster H6 as a function of the FDR. As expected, the M(FDR) is non-decreasing. For FDR < 10 −26 , we detect a single node, which represents PPP1R3C, the protein with the smallest p-value (∼ 10 −63 ). For larger FDRs, we detect subnetworks of larger size that contain proteins that are associated with larger p-values and could not have been identified with the gene-expression data alone. For FDR = 10 −25 , for example, we detect the subnetwork of size M = 9 (shown in Fig. 2).
For FDR < 10 −22 , we detect an even larger functional module, which partially overlaps with the one identified for FDR < 10 −23 , as it also includes EGFR as connector between proteins with small p-values. The second connector is ELAV-like protein 1 (ELAVL1) with p ≈ 0.06. The precise function of ELAVL1 is unknown but it is believed to play a role in regulating ferroptosis in liver fibrosis [41]. For even larger FDRs, we identify a module with M = 42 nodes out of which 9 are not identified from the gene-expression data alone. We observe all the before-mentioned connectors, as well as, hepatocellular carcinoma-associated Antigen 88 (ECI2) and S100 calcium-binding protein A4 (S100A4). The latter regulates liver fibrogenesis by activating hepatic stellate cells [42]. Overall, the number of proteins we identify additionally with our method is moderately increasing with the FDR. In Supplementary Note 5, we show these M(FDR) curves for the hepatocyte clusters.

Functional modules for different clusters
In the "Influence of the false discovery rate" section, we investigated the influence of the FDR on the detected modules for a single cell cluster. As we obtained seven hepatocyte clusters (see Fig. 2), we can compute DEGs and The colour of the nodes shows their p-value from low (white) to high (purple). We indicate proteins that would not have been discovered from the gene-expression data alone as squares and give their names in bold red font. (Lower Panel) The size of detected modules (blue disks) depends on the FDR. A large fraction of proteins in these modules have significant p-values (red squares), however, for all FDR> 10 −26 , we also identify additional proteins as for a given FDR, the blue disks are to the right of the red disks thus also active modules for each cluster separately. As each cluster represents a different cell state, different genes are identified by a DEG analysis, which then results in different functional modules. To compare the detected modules, we use SCPPIN with a FDR of 10 −27 for all of them. In Fig. 4, we show the functional modules for six clusters (we omit the largest cluster due to illustration limitation). The detected clusters differ in size, with the largest consisting of 52 nodes (cluster H2) and the smallest consisting only of a single node (Cluster H6). This heterogeneity occurs because the p-values of differential expression are differently distributed for each cluster. Cluster Two has the smallest p-values as its gene expression is most different from those in all other clusters, which indicates a special function of these cells in comparison to the rest. As shown for cluster H6 in Fig. 3, increasing the FDR increases also the size of the detected functional module.
In four out of the six modules, we find proteins that we could not have identified with a DEG analysis alone. For cluster H1, these are APP, ELAVL1, TRIM25, ACTN4, PTEN, KRAS, TFG, and RPL4. For cluster H2, these are VKORC1, APOA1, SNX27, CYCS, ECI2, APP, EGFR, UBE3A, HNRNPL, COPS5, TP53, YWHAE, RCHY1, and TERF2IP. For cluster H3, this is APP. For H5, these are HSPA8 and FN1. We find that APP is identified as part of the active module in three of these clusters, which indicates that this membrane-bound protein may play a role in different biological contexts.
To systematically access these biological contexts, we perform a GO-term enrichment test to assess the hypothesis that the detected modules represent biologically relevant pathways (see Methods). We find that all but the two smallest modules have GO terms enriched (see Table 1). The GO terms hint at distinct biological functions for the different cell clusters. Clusters H1 and H3 are involved in Fig. 4 Detected modules for six hepatocyte clusters for FDR = 10 −27 .We find that the detected modules vary in size, with the smallest consisting of a single protein and the largest consisting of 253 proteins (we omit the largest cluster for visualisation purposes). Colour indicates p-value of associated gene from low (white) to high (purple). We show nodes as squares if they would not have been detected without PPIN information. For the larger modules, we only give the names of these proteins that we would not have detected by a DEG analysis. See SI for an illustration which includes protein names translational initiation, H2 in response to stress, and H5 in the extracellular structure organisation. All of these identified cellular processes represent different hepatocytes functions that have been found in vivo [2,43].
The analysis of different cell states in the scRNA-seq with SCPPIN indicates that genes associated with different parts of the PPIN are active in different transcriptional states. Different biological functions of the cell clusters are reflected by different enriched GO terms. Overall, the integration of scRNA-seq with a PPIN suggests that the cells utilise the underlying PPIN differently to fulfil their diverse biological functions.

Discussion
In this study, we integrated scRNA-seq data with PPINs to construct node-weighted networks. For each cell cluster, detecting a maximum-weight connected subgraph identifies an active module, i.e., proteins that interact with each other and taken together the corresponding genes are significantly differently expressed. Our method SCPPIN builds on advances in DEG analysis, which are standard tools for the interpretation of scRNA-seq data. As a case study, we investigated data from healthy human livers. We find that the seven identified cell clusters have different subnetworks of the PPIN as functional modules in which the corresponding genes exhibiting most significantly changed expression levels. A GO-term enrichment analysis indicates that these are also associated with different biological functions. Furthermore, these subnetworks identify proteins for which the corresponding genes are not differently expressed in a given cluster but do interact with proteins for which the corresponding genes are strongly differentially expressed. These proteins are candidates for regulatory functions in these cells. It is only through our combination of single-cell data with PPIN data that these candidate proteins can be identified. Often, they are integral membrane proteins such as FN1, EGFR, and APP, drivers of cell fate such as P53 and KRAS, or Table 1 For each of the seven clusters we give the three most enriched GO terms and the multiple-testing-corrected p-values that we received from Fisher's exact tests. For these GO terms, we also give the fold-enrichment as the fraction of genes in the module with this annotation over the total number of genes with this annotation in the whole data set proteins of so-far unknown function such as TERF2IP and TFG.
In a more general setting, SCPPIN can be used to systematically analyse DEGs in scRNA-seq data. The identified networks that characterise each cluster help to identify and hypothesise a biological function associated to those cells. For example, we identified the gene S100A4 in the hepatocyte cluster H6. S100A4 has been identified as a key component in the activation of stellated cells in order to promote liver fibrosis [42]. Although previously identified in a population of macrophages [42,44], observing the expression of S100A4 in this cluster of hepatocytes may indicate that a subpopulation of hepatocytes promotes fibrogenesis in paracrine. We also identified the amyloid precursor protein (APP) and interaction partners active in multiple hepatocyte clusters. Although little is known about liver-specific functions of APP, in the central nervous system it is a key driver of Alzheimer's disease, as source of the amyloid-β-peptide (Aβ) [45]. Due to the major role of liver in the clearance of plasma Aβ, it would be interesting to study the contribution of Aβ produced in the liver and the impact in the central nervous system. This systemic view of Alzheimer's disease [46,47] may reveal alternative treatments. A more holistic approach is the comparison of the detected functional modules with disease associations to identify disease modules. We undertake and discuss this approach in Section 6 in the SI.
Despite their success, scRNA-seq techniques have methodological limitations (e.g., zero-inflation [48]). The presented technique might be further improved by considering such specific challenges, e.g., by constructing a different mixture model (see Supplementary Note 1) or implementing an imputation/noise reduction methodology. Furthermore, while we demonstrated exclusively a modularity-based cell clustering, other clustering algorithms might be appropriate, depending on the biological question and the experimental platform [49]. As they may reveal different cell states, SCPPIN may also reveal different functional modules in these.
In this study, we used DEGs as the foundation to identify the active modules in different cell types. We choose this approach because DEG analysis is a common tool for the identification of biomarkers in scRNA-seq experiments [50]. Our scPPIN method, however, could also be used on other statistics derived from scPPIN data, such as, scores indicating the abundance of gene expression.
There is a rich literature of alternative ways to identify active modules in PPINs [51][52][53]. Here, we decided to use a maximum-weight connected subgraph approach because it allows an exact solution and is widely-used for bulk RNA-seq analysis [33]. It is an open question whether other approaches, such as, the maximum clique method [54] or methods integrating gene-coexpression data [55] are also fruitful in the single-cell setting. Active-module detection methods in general could also be explored to identify potential novel protein interactions because there is an association between functional similarity of protein pairs and whether or not they interact [28]. In this study, we used GO terms to computationally test the hypothesis that the detected modules facilitate a joint biological function. Despite its shortcomings, such as annotation bias [56], this is a widely-used approach. In future work experimental knockout studies could be used to test the function of these proteins in vivo in cellular context.
In conclusion, we demonstrate that integrating scRNAseq data with PPINs detects distinct enriched biological pathways and demonstrates a functional heterogeneity of cell clusters in the liver. It suggest the participation of unexpected proteins in these pathways that are undetectable from a gene-expression analysis alone. We provide an R package SCPPIN, so that our method can easily be integrated to current analytical workflows for single cell RNA-seq analysis.

Protein-protein interaction network
We construct a PPIN from the publicly available BIOGRID database [36], version 3.5.166. The obtained network for Homo sapiens has n = 17, 309 nodes and m = 296, 637 undirected, unweighted edges. While the PPIN might be directed and edge-weighted [57] (e.g., considering confidence in an interaction [58]), we consider here exclusively undirected networks without edge weights.

Liver spheroid and bioinformatics
Human primary hepatocytes from a mixture of 10 donors grown in a 3D spheroid, were purchased from InSphero AG (Switzerland) and maintained in the culture medial provided by the company. Single cell libraries were prepared with a 10X Genomics 3' kit and sequenced in an Illumina NextSeq 500. Sequencing data demultiplexing and alignment was carried out with CELLRANGER with default parameters [59]. As a quality control, we only kept cells with between 500 and 6000 genes detected. A total of 2597 cells passed this quality control, of which 2123 are hepatocytes. We identified the cell types by using gene markers as identified in [2,60,61]. The data is publicly available and we make the process available under https:// github.com/floklimm/scPPIN.

Preprocessing
We analyse the scRNA-seq data with the SEURAT R package v2.3.4 [62]. As a preprocessing step, we align the data with a canonical correlation analysis [62] with usage of the first nine dimensions. We identify clusters with the default resolution of one with the function FindClusters.
To identify cell types, we use gene markers expression and in-house reference datasets.
To compute a p-value of differential expression for each obtained cluster, we use the function FindAllMarkers with the argument RETURN.THRESH equal to 1 and LOGFC.THRESHOLD set to 0.0 because we would like to obtain p-values for all genes (significant and nonsignificant ones). For the same reason, we do not employ a threshold for fold-change in gene expression. We exclude genes that are expressed in less than 10 % of a cluster to avoid comparing sparsely expressed genes.

Node-weighted network construction
The SCPPIN pipeline builds on a method for the identification of functional modules as introduced by Dittrich et al. for analysing bulk gene-expression data [33]. Dittrich et al. compute maximum-weight connected subgraphs to find subnetworks that change their expression significantly in a certain disease. Here, we use a similar approach to identify subnetworks that change significantly in different clusters of cells.
Given a network G = {V , E} with node set V and edge set E ⊂ V × V , we construct a node-weighted network G nw = {V , E, W } by assigning each node i ∈ V a realvalued node weight w i , which we represent as a function W : V → R (see Eq. 1). We construct these node-weighted networks from a PPIN and gene-expression information. The former is in the form of a network and the latter are p-values of differential expression. We assume a bijection between genes and proteins, i.e., each protein is expressed by exactly one gene, which is a simplification of the biological processes. We find this bijection by mapping GeneIDs [36].
We delete all nodes from the PPIN for which no geneexpression data is available. In the Supplementary Note 4, we present an alternative approach that can incorporate proteins with missing expression data.
We assign each node a score which is a function of the p-value x and we vary the significance threshold τ to tune the false discovery rate (FDR). We estimate α by fitting a beta-uniform mixture model to the observed p-values (see Supplementary Note 1). This score S(x) is negative for proteins below the significance threshold τ and positive otherwise.

Mathematical optimisation algorithm
Mathematically, the problem of identifying a subnetwork with maximal change of expression is a maximum-weight connected subgraph problem. Algorithmically, it is easier to solve an equivalent prize-collecting Steiner tree (PCST) problem [33]. Steiner trees are generalisations of spanning trees [63] and 'prize-collecting' indicates that the nodes have weights. To find a PCST, we use the dual ascentbased branch-and-bound framework DAPCSTP [64,65]. For all calculations in this paper the algorithm identified an optimal solution in less than 10 s. For details see Supplementary Note 2.