PRANA: an R package for differential co-expression network analysis with the presence of additional covariates
BMC Genomics volume 24, Article number: 687 (2023)
Advances in sequencing technology and cost reduction have enabled an emergence of various statistical methods used in RNA-sequencing data, including the differential co-expression network analysis (or differential network analysis). A key benefit of this method is that it takes into consideration the interactions between or among genes and do not require an established knowledge in biological pathways. As of now, none of existing softwares can incorporate covariates that should be adjusted if they are confounding factors while performing the differential network analysis.
We develop an R package PRANA which a user can easily include multiple covariates. The main R function in this package leverages a novel pseudo-value regression approach for a differential network analysis in RNA-sequencing data. This software is also enclosed with complementary R functions for extracting adjusted p-values and coefficient estimates of all or specific variable for each gene, as well as for identifying the names of genes that are differentially connected (DC, hereafter) between subjects under biologically different conditions from the output.
Herewith, we demonstrate the application of this package in a real data on chronic obstructive pulmonary disease. PRANA is available through the CRAN repositories under the GPL-3 license: https://cran.r-project.org/web/packages/PRANA/index.html.
The RNA-sequencing (RNA-Seq) leverages the rapid breakthroughs of the next-generation sequencing platform for profiling high-quality gene expression. Over the span of years, the RNA-Seq has emerged as an alternative to other gold standard techniques in transcriptomes [1, 2]. In contrast to microarrays, RNA-Seq achieves a higher resolution and lower technical variability [2,3,4] which leads to a higher reproducibility . Another advantage of RNA-Seq relative to previously developed transcriptomic sequencing methods is that it has the ability to track transcriptomic dynamics (or gene expression changes) of tissues during physiological changes [5, 6], which thus allows a comparison of biological samples from patients with or without a specific disease or condition.
In response to these advantages, a vast number of statistical methods have become available to elucidate the genes or biological pathways associated with biological conditions or health outcomes, such as differential expression (DE) analysis [7, 8] and pathway enrichment (PE) analysis [9,10,11] of read counts (or gene expression) of an RNA-Seq data. However, it can be argued that results of DE analysis may provide limited information with the increased evidence that genes work in conjunction each other [12, 13]. The PE analysis appears to be a useful complement to the analysis of DE. The fundamental hypothesis in a PE analysis is that genes are regulated under common biological processes and clustered as a ‘pathway’ [13, 14], which borrows a priori pathway knowledge from the public repositories, namely, Gene Ontology , Kyoto Encyclopedia of Genes and Genomes , or Reactome . To put it another way, PE analysis is primarily restricted to its use in a reference collection of well-studied biological processes only. Thereby, the idea of ‘network’ is introduced to pursue the veiled information that are obscured in those well-defined pathways .
The differential network (DN) analysis provides novel insights for identifying changes in gene-gene interactions under different biological conditions . In theory, such changes are assessed through a comparison in characteristics of a network structure (i.e., network topology) between two or more networks that are perturbed by a specific biological condition such as the development of cancer.
Despite the growing popularity, none of existing methods [20,21,22] fully addresses how to adjust for additional covariates (e.g. patient-age, patient-reported family histories, and other comorbidities) that may be associated with network topology.
Recently, we have adopted a pseudo-value regression  that allows covariate adjustment for the DN analysis while maintaining a high precision and recall values via a Monte Carlo simulation comparing with other methods available in R packages such as DINGO  and dnapath . In addition, the computation time of this approach was shown competitive. To date, this is the first attempt of statistical method for the DN analysis with the inclusion of additional covariates.
In this article, we describe the software built as an R package, namely PRANA (Pseudo-value Regression Approach in Network Analysis). PRANA is tailored to incorporate additional covariates information that may be associated with measures of connectivity of a gene (i.e. centrality) and a binary group indicator. This differs from previous statistical framework (or softwares) in DN analysis such as dnapath and DINGO.
The algorithm below summarizes how the pseudo-value regression approach is embedded in a function named with PRANA. Briefly, the association measures are marginal quantities, such as degree centralities of each gene. Through the use of jackknife pseudo-values , we find the contribution of each individual data point to these quantities. Therefore, we could regress them on additional covariates as shown in studies with multi-state survival data [25, 26]. More details on methodological aspects are fully described elsewhere .
Details of functions in PRANA
The main R function to perform the pseudo-value regression for the DN analysis with additional covariates is PRANA. The PRANA function imports two R scripts for the calculation of (1) total connectivity of an association matrix estimated from an observed expression data (as in thetahats function) and (2) adjusted p-values with the empirical Bayes screening procedure (as in EBS function) . A list of three data.frame objects (coefficient estimates, p-values, and adjusted p-values of each predictor variable included in the regression for each gene) are returned upon the execution of PRANA function.
For user convenience, we provide six additional R functions for extracting adjusted p-values (adjpval, adjpval_specific_var), coefficient estimates (coeff, coeff_specific_var), and genes that are significantly DC (sigDCGnames, sigDCGtab) from the output from PRANA function.
The PRANA package is fully implemented in R statistical programming language. The package depends on the base R packages (parallel, stats) and other R packages from the Comprehensive R Archive Network library (CRAN; dnapath, dplyr, robustbase) and Bioconductor (minet). Of important note, minet package should be directly installed from Bioconductor for a full utilization of PRANA package. This can be done by executing the code below in the R console.
In this section, we illustrate how PRANA can be applied in practice using the sample dataset available from the package. This case study is the same as the one analyzed in our methodology paper .
The PRANA package includes a sample dataset named combinedCOPDdat_RGO with 406 samples. combinedCOPDdat_RGO consists of an RNA-Seq expression data for 28 genes that were spotlighted as associated with the chronic obstructive pulmonary disease (COPD) from a genome-wide association study . It is a subset of the original study stored in the Gene Expression Omnibus (GEO) database with the accession number GSE158699 . In this sample dataset, a phenotype data on six clinical and demographic variables is also available: current smoking status (main grouping variable), smoking pack years, age, gender, race, and FEV1 percent. The user can call the sample data into R by executing the following code:
Alternatively, the user can also assign the data to an object by running the code below:
The PRANA function requires a user to provide each expression and phenotype data separately.
The main predictor variable in this example analysis is the current smoking status. As discussed in the Algorithm subsection, the estimation of association matrices (or networks) and the calculation of jackknife pseudo-values are carried out for each group separately. Hence, we add another step that locates the indices of subjects from each ‘current’ vs. ‘non-current smokers’ group. These indices are used to dichotomize expression dataset into ‘non-current smokers (Group A)’ and ‘current (Group B).’
Apply PRANA function for DN analysis with additional covariates
Once the data processing is complete, a user can perform a DN analysis with additional covariates. PRANA function takes an expression and phenotype data, separately, in which a user specifies each for RNASeqdat and clindat, respectively. To be more specific, the variables included in phenotype data are included in the regression. In addition, the group-specific indices for the main binary indicator variable are provided as groupA and groupB within the function.
The output of the PRANA function is a list containing three data.frame objects for coefficient estimates, p-values, and adjusted p-values of all covariates included in the fitted model for each gene. Results are shown as following:
Some supporting functions
The package offers some auxiliary features. A user can get a table of adjusted p-values and coefficient estimates for all variables with adjptab and coeff functions as following:
Suppose, for instance, we are interested in looking at the adjusted p-values for the current smoking status variable instead of a table with all variables. adjpval_specific_var function is available for that purpose:
Similarly, coeff_specific_var function can be executed to return a coefficient estimate for a specific variable (current smoking status in the example below). A cautionary note is that the user must provide the name of a variable as in varname within each adjpval_specific_var or coeff_specific_var functions.
Additionally, sigDCGtab and sigDCGnames functions take a data.frame object as an input, defined by adjpval function earlier, to output the names of DC genes (i.e. NCBI Entrez gene IDs in the first column) for the main binary grouping variable utilized for the DN analysis, as well as corresponding adjusted p-values. sigDCGnames returns the names of DC genes only. A user may adjust the level of significance (alpha), which is set to 0.05 by default. Please see the following commands below:
As a result, PRANA identified 23 genes that are significantly DC between current and non-current smokers while accounting for additional covariates such as smoking pack years, age, gender, race, and FEV1 percent.
As an additional step, a user can utilize rename_genes function from the dependency package (dnapath) to rename results with Entrez gene IDs into gene symbols. See below for the demonstration in R console. Results are summarized in Table 1.
The R package PRANA has been published in the CRAN (https://cran.r-project.org/web/packages/PRANA/index.html). This package has no operating system dependencies. A vignette is available on this package at https://cran.r-project.org/web/packages/PRANA/vignettes/UserManualPRANA.html or can be accessed by typing in an R console (browseVignettes(package="PRANA")). In this package, the sample dataset is provided with COPD-related genes, as well as clinical and demographic variables. The source code of the package can be found in GitHub: https://github.com/sjahnn/PRANA.
PRANA has some plans for future development. Firstly, although a user may attempt a classical regression-based variable selection such as stepwise selection, we have not yet validated this through a statistical simulation experiment. Secondly, the names of genes provided in the sample dataset are Entrez gene IDs. Further extension will include a function that convert from these gene IDS to gene symbols (i.e. 10370 to CITED2) and vice versa for user convenience.
In conclusion, PRANA is a user-friendly and novel regression-based method that accounts for additional covariates along with the main binary grouping variable for the DN analysis.
The differential network analysis identifies changes in measures of associations between genes under different biological conditions. Although there has been increasing volume of work in this subject, overall covariate adjustment remains underexplored. In this paper, we present PRANA, the first R package that adjusts for additional covariates for the differential network analysis. As a brief note on the usage, PRANA takes RNA-sequencing and phenotype data (metadata) as inputs and in return tables containing DC gene names and their corresponding adjusted p-values are produced for a main binary grouping variable to be adjusted with the presence of additional covariates. This software is easy to install and user-friendly.
Availability and requirements
Project name: PRANA
Project home page: https://cran.r-project.org/web/packages/PRANA/index.html
Operating system(s): Platform independent
Programming language: R
Other requirements: Install dnapath, dplyr, robustbase, and minetR packages
License: GNU GPL-3
Any restrictions to use by non-academics: No restrictions
Availability of data and materials
PRANA is freely available at (https://cran.r-project.org/web/packages/PRANA/index.html). The COPDGene data is available in the GEO database with accession number GSE158699 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158699). Please reach out to the maintainer (Seungjun Ahn, firstname.lastname@example.org) if you have any further inquiries on the data or code.
Chronic obstructive pulmonary disease
Gene Expression Omnibus
Kukurba KR, Montgomery SB. RNA Sequencing and Analysis. Cold Spring Harb Protoc. 2015;11:951–69.
Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9:78644.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17.
Han Y, Gao S, Muegge K, Zhang W, Zhou B. Advanced Applications of RNA Sequencing and Challenges. Bioinform Biol Insights. 2015;9(Suppl 1):29–46.
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
Ulgen E, Ozisik O, Sezerman OU. pathfindR: an R package for comprehensive identification of enriched pathways in omics data through active subnetworks. Front Genet. 2019;10:858.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.
Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–7.
de la Fuente A. From ‘differential expression’ to ‘differential networking’ - identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26(7):326–33.
Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):1002375.
Stanford BCM, Clake DJ, Morris MRJ, Rogers SM. The power and limitations of gene expression pathway analyses toward predicting population response to environmental stressors. Evol Appl. 2020;13(6):1166–82.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet. 2000;25(1):25–9.
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.
Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33(Database issue):428–32.
Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, et al. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12(7):615–21.
Shojaie A. Differential network analysis: a statistical perspective. WIREs Comput Stat. 2021;13:1508.
Ha MJ, Baladandayuthapani V, Do KA. DINGO: differential network analysis in genomics. Bioinformatics. 2015;31(21):3413–20.
McKenzie AT, Katsyv I, Song WM, Wang M, Zhang B. DGCA: a comprehensive R package for differential gene correlation analysis. BMC Syst Biol. 2016;10(1):106.
Grimes T, Potter SS, Datta S. Integrating gene regulatory pathways into differential network analysis of gene expression data. Sci Rep. 2019;9(1):1–12.
Ahn S, Grimes T, Datta S. A Pseudo-Value Regression Approach for Differential Network Analysis of Co-Expression Data. BMC Bioinformatics. 202;24(1):8.
Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Philadelphia: Chapman & Hall/CRC; 1993.
Logan BR, Zhang MJ, Klein JP. Marginal models for clustered time-to-event data with competing risks using pseudovalues. Biometrics. 2011;67:1–7.
Klein JP, Gerster M, Andersen PK, Tarima S, Perme MP. SAS and R functions to compute pseudo-values for censored data regression. Comput Methods Programs Biomed. 2008;89:289–300.
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl 1):S7.
Datta S, Datta S. Empirical Bayes screening of many p-values with applications to microarray studies. Bioinformatics. 2005;21(9):1987–94.
Sakornsakolpat P, Prokopenko D, Lamontagne M, Reeve N, Guyatt A, Jackson V, et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nat Genet. 2019;51(3):494–505.
Wang Z, Masoomi A, Xu Z, Boueiz A, Lee S, Zhao T. Improved prediction of smoking status via isoform-aware RNA-seq deep learning models. PLoS Comput Biol. 2021;17(10):1009433.
The authors would like to thank the maintainers at the Comprehensive R Archive Network (CRAN). It is our special thanks to the investigators (Wang Z and Castaldi P) who have graciously shared their data publicly available in the GEO database.
Research reported in this publication was supported in part by the National Cancer Institute Cancer Center Support Grant [NIH P30CA196521-01] awarded to the Tisch Cancer Institute of the Icahn School of Medicine at Mount Sinai and used the Biostatistics Shared Resource Facility. The content is solely the responsibility of S.A. and does not necessarily represent the official views of the National Institutes of Health.
Ethics approval and consent to participate
Consent for publication
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Ahn, S., Datta, S. PRANA: an R package for differential co-expression network analysis with the presence of additional covariates. BMC Genomics 24, 687 (2023). https://doi.org/10.1186/s12864-023-09787-3