Skip to main content

PRANA: an R package for differential co-expression network analysis with the presence of additional covariates

Abstract

Background

Advances in sequencing technology and cost reduction have enabled an emergence of various statistical methods used in RNA-sequencing data, including the differential co-expression network analysis (or differential network analysis). A key benefit of this method is that it takes into consideration the interactions between or among genes and do not require an established knowledge in biological pathways. As of now, none of existing softwares can incorporate covariates that should be adjusted if they are confounding factors while performing the differential network analysis.

Results

We develop an R package PRANA which a user can easily include multiple covariates. The main R function in this package leverages a novel pseudo-value regression approach for a differential network analysis in RNA-sequencing data. This software is also enclosed with complementary R functions for extracting adjusted p-values and coefficient estimates of all or specific variable for each gene, as well as for identifying the names of genes that are differentially connected (DC, hereafter) between subjects under biologically different conditions from the output.

Conclusion

Herewith, we demonstrate the application of this package in a real data on chronic obstructive pulmonary disease. PRANA is available through the CRAN repositories under the GPL-3 license: https://cran.r-project.org/web/packages/PRANA/index.html.

Peer Review reports

Background

The RNA-sequencing (RNA-Seq) leverages the rapid breakthroughs of the next-generation sequencing platform for profiling high-quality gene expression. Over the span of years, the RNA-Seq has emerged as an alternative to other gold standard techniques in transcriptomes [1, 2]. In contrast to microarrays, RNA-Seq achieves a higher resolution and lower technical variability [2,3,4] which leads to a higher reproducibility [5]. Another advantage of RNA-Seq relative to previously developed transcriptomic sequencing methods is that it has the ability to track transcriptomic dynamics (or gene expression changes) of tissues during physiological changes [5, 6], which thus allows a comparison of biological samples from patients with or without a specific disease or condition.

In response to these advantages, a vast number of statistical methods have become available to elucidate the genes or biological pathways associated with biological conditions or health outcomes, such as differential expression (DE) analysis [7, 8] and pathway enrichment (PE) analysis [9,10,11] of read counts (or gene expression) of an RNA-Seq data. However, it can be argued that results of DE analysis may provide limited information with the increased evidence that genes work in conjunction each other [12, 13]. The PE analysis appears to be a useful complement to the analysis of DE. The fundamental hypothesis in a PE analysis is that genes are regulated under common biological processes and clustered as a ‘pathway’ [13, 14], which borrows a priori pathway knowledge from the public repositories, namely, Gene Ontology [15], Kyoto Encyclopedia of Genes and Genomes [16], or Reactome [17]. To put it another way, PE analysis is primarily restricted to its use in a reference collection of well-studied biological processes only. Thereby, the idea of ‘network’ is introduced to pursue the veiled information that are obscured in those well-defined pathways [18].

The differential network (DN) analysis provides novel insights for identifying changes in gene-gene interactions under different biological conditions [19]. In theory, such changes are assessed through a comparison in characteristics of a network structure (i.e., network topology) between two or more networks that are perturbed by a specific biological condition such as the development of cancer.

Despite the growing popularity, none of existing methods [20,21,22] fully addresses how to adjust for additional covariates (e.g. patient-age, patient-reported family histories, and other comorbidities) that may be associated with network topology.

Recently, we have adopted a pseudo-value regression [23] that allows covariate adjustment for the DN analysis while maintaining a high precision and recall values via a Monte Carlo simulation comparing with other methods available in R packages such as DINGO [20] and dnapath [22]. In addition, the computation time of this approach was shown competitive. To date, this is the first attempt of statistical method for the DN analysis with the inclusion of additional covariates.

In this article, we describe the software built as an R package, namely PRANA (Pseudo-value Regression Approach in Network Analysis). PRANA is tailored to incorporate additional covariates information that may be associated with measures of connectivity of a gene (i.e. centrality) and a binary group indicator. This differs from previous statistical framework (or softwares) in DN analysis such as dnapath and DINGO.

Implementation

Algorithm

The algorithm below summarizes how the pseudo-value regression approach is embedded in a function named with PRANA. Briefly, the association measures are marginal quantities, such as degree centralities of each gene. Through the use of jackknife pseudo-values [24], we find the contribution of each individual data point to these quantities. Therefore, we could regress them on additional covariates as shown in studies with multi-state survival data [25, 26]. More details on methodological aspects are fully described elsewhere [23].

Algorithm 1 PRANA: Pseudo-value Regression Approach in Network Analysis

Details of functions in PRANA

Main function

The main R function to perform the pseudo-value regression for the DN analysis with additional covariates is PRANA. The PRANA function imports two R scripts for the calculation of (1) total connectivity of an association matrix estimated from an observed expression data (as in thetahats function) and (2) adjusted p-values with the empirical Bayes screening procedure (as in EBS function) [28]. A list of three data.frame objects (coefficient estimates, p-values, and adjusted p-values of each predictor variable included in the regression for each gene) are returned upon the execution of PRANA function.

Supporting functions

For user convenience, we provide six additional R functions for extracting adjusted p-values (adjpval, adjpval_specific_var), coefficient estimates (coeff, coeff_specific_var), and genes that are significantly DC (sigDCGnames, sigDCGtab) from the output from PRANA function.

Dependencies

The PRANA package is fully implemented in R statistical programming language. The package depends on the base R packages (parallel, stats) and other R packages from the Comprehensive R Archive Network library (CRAN; dnapath, dplyr, robustbase) and Bioconductor (minet). Of important note, minet package should be directly installed from Bioconductor for a full utilization of PRANA package. This can be done by executing the code below in the R console.

figure b

Results

In this section, we illustrate how PRANA can be applied in practice using the sample dataset available from the package. This case study is the same as the one analyzed in our methodology paper [23].

Sample dataset

The PRANA package includes a sample dataset named combinedCOPDdat_RGO with 406 samples. combinedCOPDdat_RGO consists of an RNA-Seq expression data for 28 genes that were spotlighted as associated with the chronic obstructive pulmonary disease (COPD) from a genome-wide association study [29]. It is a subset of the original study stored in the Gene Expression Omnibus (GEO) database with the accession number GSE158699 [30]. In this sample dataset, a phenotype data on six clinical and demographic variables is also available: current smoking status (main grouping variable), smoking pack years, age, gender, race, and FEV1 percent. The user can call the sample data into R by executing the following code:

figure c

Alternatively, the user can also assign the data to an object by running the code below:

figure d

Data processing

The PRANA function requires a user to provide each expression and phenotype data separately.

figure e

The main predictor variable in this example analysis is the current smoking status. As discussed in the Algorithm subsection, the estimation of association matrices (or networks) and the calculation of jackknife pseudo-values are carried out for each group separately. Hence, we add another step that locates the indices of subjects from each ‘current’ vs. ‘non-current smokers’ group. These indices are used to dichotomize expression dataset into ‘non-current smokers (Group A)’ and ‘current (Group B).’

figure f

Apply PRANA function for DN analysis with additional covariates

Once the data processing is complete, a user can perform a DN analysis with additional covariates. PRANA function takes an expression and phenotype data, separately, in which a user specifies each for RNASeqdat and clindat, respectively. To be more specific, the variables included in phenotype data are included in the regression. In addition, the group-specific indices for the main binary indicator variable are provided as groupA and groupB within the function.

figure g

The output of the PRANA function is a list containing three data.frame objects for coefficient estimates, p-values, and adjusted p-values of all covariates included in the fitted model for each gene. Results are shown as following:

figure h

Some supporting functions

The package offers some auxiliary features. A user can get a table of adjusted p-values and coefficient estimates for all variables with adjptab and coeff functions as following:

figure i

Suppose, for instance, we are interested in looking at the adjusted p-values for the current smoking status variable instead of a table with all variables. adjpval_specific_var function is available for that purpose:

figure j

Similarly, coeff_specific_var function can be executed to return a coefficient estimate for a specific variable (current smoking status in the example below). A cautionary note is that the user must provide the name of a variable as in varname within each adjpval_specific_var or coeff_specific_var functions.

figure k

Additionally, sigDCGtab and sigDCGnames functions take a data.frame object as an input, defined by adjpval function earlier, to output the names of DC genes (i.e. NCBI Entrez gene IDs in the first column) for the main binary grouping variable utilized for the DN analysis, as well as corresponding adjusted p-values. sigDCGnames returns the names of DC genes only. A user may adjust the level of significance (alpha), which is set to 0.05 by default. Please see the following commands below:

figure l

As a result, PRANA identified 23 genes that are significantly DC between current and non-current smokers while accounting for additional covariates such as smoking pack years, age, gender, race, and FEV1 percent.

As an additional step, a user can utilize rename_genes function from the dependency package (dnapath) to rename results with Entrez gene IDs into gene symbols. See below for the demonstration in R console. Results are summarized in Table 1.

figure m
Table 1 Results of DC genes obtained from PRANA. The sample dataset contains the NCBI Entrez gene IDs, so does the resulted DC genes (first column). dnapath::rename_genes is utilized to rename Entrez gene IDs to gene symbol (second column)

Discussion

The R package PRANA has been published in the CRAN (https://cran.r-project.org/web/packages/PRANA/index.html). This package has no operating system dependencies. A vignette is available on this package at https://cran.r-project.org/web/packages/PRANA/vignettes/UserManualPRANA.html or can be accessed by typing in an R console (browseVignettes(package="PRANA")). In this package, the sample dataset is provided with COPD-related genes, as well as clinical and demographic variables. The source code of the package can be found in GitHub: https://github.com/sjahnn/PRANA.

PRANA has some plans for future development. Firstly, although a user may attempt a classical regression-based variable selection such as stepwise selection, we have not yet validated this through a statistical simulation experiment. Secondly, the names of genes provided in the sample dataset are Entrez gene IDs. Further extension will include a function that convert from these gene IDS to gene symbols (i.e. 10370 to CITED2) and vice versa for user convenience.

In conclusion, PRANA is a user-friendly and novel regression-based method that accounts for additional covariates along with the main binary grouping variable for the DN analysis.

Conclusions

The differential network analysis identifies changes in measures of associations between genes under different biological conditions. Although there has been increasing volume of work in this subject, overall covariate adjustment remains underexplored. In this paper, we present PRANA, the first R package that adjusts for additional covariates for the differential network analysis. As a brief note on the usage, PRANA takes RNA-sequencing and phenotype data (metadata) as inputs and in return tables containing DC gene names and their corresponding adjusted p-values are produced for a main binary grouping variable to be adjusted with the presence of additional covariates. This software is easy to install and user-friendly.

Availability and requirements

Project name: PRANA

Project home page: https://cran.r-project.org/web/packages/PRANA/index.html

Operating system(s): Platform independent

Programming language: R

Other requirements: Install dnapath, dplyr, robustbase, and minetR packages

License: GNU GPL-3

Any restrictions to use by non-academics: No restrictions

Availability of data and materials

PRANA is freely available at (https://cran.r-project.org/web/packages/PRANA/index.html). The COPDGene data is available in the GEO database with accession number GSE158699 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158699). Please reach out to the maintainer (Seungjun Ahn, seungjun.ahn@mountsinai.org) if you have any further inquiries on the data or code.

Abbreviations

COPD:

Chronic obstructive pulmonary disease

DC:

Differentially connected

DN:

Differential network

GEO:

Gene Expression Omnibus

References

  1. Kukurba KR, Montgomery SB. RNA Sequencing and Analysis. Cold Spring Harb Protoc. 2015;11:951–69.

    Google Scholar 

  2. Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014;9:78644.

    Article  Google Scholar 

  3. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Han Y, Gao S, Muegge K, Zhang W, Zhou B. Advanced Applications of RNA Sequencing and Challenges. Bioinform Biol Insights. 2015;9(Suppl 1):29–46.

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8.

    Article  CAS  PubMed  Google Scholar 

  7. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.

  8. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.

    Article  CAS  PubMed  Google Scholar 

  9. Ulgen E, Ozisik O, Sezerman OU. pathfindR: an R package for comprehensive identification of enriched pathways in omics data through active subnetworks. Front Genet. 2019;10:858.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–7.

    Article  CAS  PubMed Central  Google Scholar 

  12. de la Fuente A. From ‘differential expression’ to ‘differential networking’ - identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26(7):326–33.

    Article  PubMed  Google Scholar 

  13. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):1002375.

    Article  Google Scholar 

  14. Stanford BCM, Clake DJ, Morris MRJ, Rogers SM. The power and limitations of gene expression pathway analyses toward predicting population response to environmental stressors. Evol Appl. 2020;13(6):1166–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet. 2000;25(1):25–9.

    CAS  PubMed  Google Scholar 

  16. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.

    Article  CAS  PubMed  Google Scholar 

  17. Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33(Database issue):428–32.

  18. Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, et al. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12(7):615–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Shojaie A. Differential network analysis: a statistical perspective. WIREs Comput Stat. 2021;13:1508.

  20. Ha MJ, Baladandayuthapani V, Do KA. DINGO: differential network analysis in genomics. Bioinformatics. 2015;31(21):3413–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. McKenzie AT, Katsyv I, Song WM, Wang M, Zhang B. DGCA: a comprehensive R package for differential gene correlation analysis. BMC Syst Biol. 2016;10(1):106.

  22. Grimes T, Potter SS, Datta S. Integrating gene regulatory pathways into differential network analysis of gene expression data. Sci Rep. 2019;9(1):1–12.

    Article  CAS  Google Scholar 

  23. Ahn S, Grimes T, Datta S. A Pseudo-Value Regression Approach for Differential Network Analysis of Co-Expression Data. BMC Bioinformatics. 202;24(1):8.

  24. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Philadelphia: Chapman & Hall/CRC; 1993.

    Book  Google Scholar 

  25. Logan BR, Zhang MJ, Klein JP. Marginal models for clustered time-to-event data with competing risks using pseudovalues. Biometrics. 2011;67:1–7.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Klein JP, Gerster M, Andersen PK, Tarima S, Perme MP. SAS and R functions to compute pseudo-values for censored data regression. Comput Methods Programs Biomed. 2008;89:289–300.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl 1):S7.

  28. Datta S, Datta S. Empirical Bayes screening of many p-values with applications to microarray studies. Bioinformatics. 2005;21(9):1987–94.

    Article  CAS  PubMed  Google Scholar 

  29. Sakornsakolpat P, Prokopenko D, Lamontagne M, Reeve N, Guyatt A, Jackson V, et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nat Genet. 2019;51(3):494–505.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Wang Z, Masoomi A, Xu Z, Boueiz A, Lee S, Zhao T. Improved prediction of smoking status via isoform-aware RNA-seq deep learning models. PLoS Comput Biol. 2021;17(10):1009433.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the maintainers at the Comprehensive R Archive Network (CRAN). It is our special thanks to the investigators (Wang Z and Castaldi P) who have graciously shared their data publicly available in the GEO database.

Funding

Research reported in this publication was supported in part by the National Cancer Institute Cancer Center Support Grant [NIH P30CA196521-01] awarded to the Tisch Cancer Institute of the Icahn School of Medicine at Mount Sinai and used the Biostatistics Shared Resource Facility. The content is solely the responsibility of S.A. and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Contributions

S.D. conceived the original idea of integrating jackknife pseudo-values to differential co-expression network analysis. S.A. developed the methodology, completed statistical programming in R and performed data analyses of the study. S.A. drafted the manuscript. S.D. provided suggestions when writing the manuscript. All authors have reviewed and edited the manuscript.

Corresponding author

Correspondence to Seungjun Ahn.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahn, S., Datta, S. PRANA: an R package for differential co-expression network analysis with the presence of additional covariates. BMC Genomics 24, 687 (2023). https://doi.org/10.1186/s12864-023-09787-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-023-09787-3

Keywords