CeTF: an R/Bioconductor package for transcription factor co-expression networks using regulatory impact factors (RIF) and partial correlation and information (PCIT) analysis
BMC Genomics volume 22, Article number: 624 (2021)
Finding meaningful gene-gene interaction and the main Transcription Factors (TFs) in co-expression networks is one of the most important challenges in gene expression data mining.
Here, we developed the R package “CeTF” that integrates the Partial Correlation with Information Theory (PCIT) and Regulatory Impact Factors (RIF) algorithms applied to gene expression data from microarray, RNA-seq, or single-cell RNA-seq platforms. This approach allows identifying the transcription factors most likely to regulate a given network in different biological systems — for example, regulation of gene pathways in tumor stromal cells and tumor cells of the same tumor. This pipeline can be easily integrated into the high-throughput analysis. To demonstrate the CeTF package application, we analyzed gastric cancer RNA-seq data obtained from TCGA (The Cancer Genome Atlas) and found the HOXB3 gene as the second most relevant TFs with a high regulatory impact (TFs-HRi) regulating gene pathways in the cell cycle.
This preliminary finding shows the potential of CeTF to list master regulators of gene networks. CeTF was designed as a user-friendly tool that provides many highly automated functions without requiring the user to perform many complicated processes. It is available on Bioconductor (http://bioconductor.org/packages/CeTF) and GitHub (http://github.com/cbiagii/CeTF).
Transcriptome analysis has become crucial to identify gene circuits involved in regulating cancer hallmarks . One of the intelligent ways to explore this type of data and obtain biologically relevant information about the mechanisms involved in modulating gene circuits is the inference of gene regulatory networks (GRNs). Conceptually, we can define GRN as the reconstruction of gene networks from gene expression data, revealing the connection of transcription factors (TFs) with their targets , aiming to highlight which gene interactions are the most relevant to the study. Despite the plethora of tools, new methods are needed to assess all possible interactions and their significance . Besides, the presence of TFs in interactions for gene-to-gene is functionally crucial because they may be playing an essential regulatory role in biological processes . TFs are considered key molecules that can regulate the expression of one or more genes in a biological system, thus determining how cells function and communicate with cellular environments . Furthermore, integrating genome-scale and network generation with the identification of main TFs provides new insights into their data function. In this article, we provide an R package that enables performing the Regulatory Impact Factors (RIF) and Partial Correlation with Information Theory (PCIT) analysis separately, or by applying the full pipeline.
We, therefore, developed an R package called CeTF, which would not only apply the RIF and PCIT analysis, but would also perform network diffusion analysis, generate circos plots for specifics TFs/genes, functional enrichment for network conditions, and others features. The biggest advantage is that the package is intuitive to use, and the main functions are written in C/C++, which provides faster analysis for large data.
CeTF is an C/C++ implementation in R for PCIT  and RIF  algorithms, which initially were made in FORTRAN language. From these two algorithms, it was possible to integrate them in order to increase performance and Results. Input data may come from microarray, RNA-seq, or single-cell RNA-seq. The input data can be read counts or expressions (TPM, FPKM, normalized values, etc.). The main pipeline (Fig. 1) consists of the following steps.
If the input data is a count table, data will be converted to TPM by each column (x) as follows:
The mean for TPM values different than zero and the mean values for each gene are used as a threshold to filter the genes. Genes with values above half of the previous averages will be considered for subsequent analyses. Then, the TPM data is normalized using:
If the input already has normalized expression data (TPM, FPKM, etc), the only step will be the same filter for genes that consider half of the means.
Differential expression analysis
There are two options for differential analysis of the gene expression, the Reverter method  and DESeq2 . In both methods, two conditions are required (i.e., control vs. tumor samples). In the Reverter method, the mean between samples of each condition for each gene is calculated. Then, subtraction is made between the mean of one condition concerning the other conditions. The variance of the subtraction is performed, then is calculated the difference of expression using the following formula, where s is the result of subtraction and var is the variance:
The DESeq2 method applies the Differential expression analysis based on the negative binomial distribution. Although both methods can be used on count data, it is strongly recommended to use only the Reverter method on expression input data.
Regulatory impact factors (RIF) analysis
The RIF algorithm is well described in the original paper . This step aims to identify critical Transcription Factors calculating for each condition the co-expression correlation between the TFs and the Differentially Expressed (DE) genes (from previously item). The result is RIF1 and RIF2 metrics that allow the identification of critical TFs. The RIF1 metric classifies the TFs as most differentially co-expressed with the highly abundant and highly DE genes, and the RIF2 metric classifies the TF with the most altered ability to act as predictors of the abundance of DE genes. The main TF is defined if:
Partial correlation and information theory (PCIT) analysis
The PCIT algorithm is also well described in the original paper from Reverter and Chan . Moreover, it has been used for the reconstruction of Gene Co-expression Networks (GCN). The GCN combines the concept of the Partial Correlation coefficient with Information Theory to identify significant gene-to-gene associations defining edges in the reconstruction of the network. At this stage, the paired correlation of three genes is performed simultaneously, thus making the inference of co-expressed genes. This approach is more sensitive than other methods and allows the detection of functionally validated gene-gene interactions. First, is calculated for every trio of genes x, y, and z the partial correlation coefficients:
And similarly, for rxz,y and ryz,x. After that, for each trio of genes is calculated the tolerance level (ε) to be used as a threshold for capturing significant associations. The average ratio of partial to direct correlation is computed as follows:
The association between the genes x and y is discarded if:
Otherwise, the association is defined as significant, and the interaction between the genes x and y is used in the reconstruction of the GCN. The final output includes the network with gene-gene and gene-TF interactions for both conditions, besides generating the main TFs identified in the network.
Functions of the package
There are 28 functions and 5 example datasets available in CeTF, which are described in Table 1. A working example for each of these functions is given in the package documentation in the Supplementary Material. The package allows the integration with many other packages and different types of genomics/transcriptomics analysis.
The CeTF package also includes additional features in order to visualize the results. After running PCIT and RIF analysis, it is possible to plot the data distribution, the distribution of differentially expressed genes/TFs that shows the average expression (in log2) by the difference of expression, the network for both conditions and the integrated network with genes, TFs and enriched pathways. Besides, it is possible to visualize the targets for specific TFs as a circos plot. It is also possible to perform the grouping of ontologies  without statistical inference and functional enrichment for several databases with the statistical inference of many organisms using WebGestalt database . Finally, it is possible to save all tables that include interaction networks, enrichment, differential expression, main TFs, and others.
CeTF is an R-based toolkit, and most of the code is written in R language. PCIT and tolerance functions were written in C/C++ using Rcpp (v1.0.5)  and RcppArmadillo (v0.10.1.2.2)  for better performance. The main R packages used for analysis and visualization of the results were the circlize (v0.4.10) , ComplexHeatmap (v2.6.0) , DESeq2 (v1.30.0) , ggplot2 (v3.3.2) , RCy3 (v2.10.0) , and others listed in the Supplementary Material.
To demonstrate the tool’s utility, we used stomach adenocarcinoma RNA-seq data from The Cancer Genome Atlas (TCGA) project  and applied all analyzes available in the CeTF package. Here, we compared samples from normal tissue (NT=36) and primary tumor (PT=408) of Stomach adenocarcinoma (STAD). The TFs-HRi are shown in Table 2 and the analysis of partial results in Fig. 2A.
Table 2 describes a list of 37 TFs-HRi. Among the main TFs-HRi identified, we highlight four TFs (SETD3, HOX3B, FOXA1, and SOX4) for being widely reported in association with stomach adenocarcinoma. Some studies show that high expression of the SETD3 gene is associated with poor survival in triple-negative breast cancer , while HOXB3 and FOXA1 were identified as indicators of better prognosis [20–22]. Interestingly, the elevated expression of the SOX4 gene has been described to regulate the epithelial-mesenchymal transition (EMT) mechanism mediated by TGF-beta . The Results presented below will be centered on the HOXB3 gene, as it is one of the HOX genes studied by our group [24, 25].
After filtering data, a total of 8,037 genes remained in the analysis and are represented in Fig. 2A, with 151 up-regulated genes (red dots) and 118 down-regulated genes (blue dots). On this set of genes, 7 TFs are up-regulated (green dots), 9 TFs are down-regulated (pink dots) and 504 are not differentially expressed. Figure 2B places the HOXB3 gene as a central hub and its 2520 gene-to-gene interactions obtained with the CeTF package. Seventy-six up-regulated targets, and 58 down-regulated targets were found.
Figure 2C shows the heatmap with all 163 HOXB3 targets, which revealed no correlation with the two main groups of samples with clinical and histopathological data. A graph with the enrichment of gene pathways only with HOXB3 targets (Fig. 2D) shows that only one biological process (muscle system process) was enriched with overexpressed HOXB3 targets. Nine other biological processes were enriched with downregulated targets associated with the cell cycle, corroborating with the biology of normal tissues (Fig. 2D). Furthermore, the Chip-seq data from one of our studies (unpublished data) were used to validate the 163 targets predicted. Although the CHIP-seq data were generated from placental tissue, 54% of the targets predicted by the CeTF package have been validated (Fig. 2D). In addition to the negative control of the cell cycle, the DUSP1 gene, which is upregulated in all cell cycle biological processes, is related to the negative regulation of cellular proliferation . A representation of the genomic distribution of the HOXB3 targets (located on chromosome 17) shows that the vast majority of targets are in different chromosomes. Ten targets are located on chromosome 17 (Fig. 2E). Finally, we built the network for HOXB3 and their targets (Fig. 2F). The targets validated by Chip-seq are highlighted in green color.
CeTF is a tool that assists the identification of meaningful gene-gene associations and the main TFs in co-expression networks, as demonstrated previously. It offers functions for a complete and customizable workflow from count or expression data to networks and visualizations in a freely available R package. We expect that CeTF will be widely used by the genomics and transcriptomics community and scientists who work with high-throughput data to understand how main TFs are working in a co-expression network and what are the pathways involved in this context. We employ RNA-seq data of stomach adenocarcinoma from the TCGA project to demonstrate all the CeTF package analyses. We believe that the present study will help researchers either identify transcription factors with a critical role in regulating gene pathways involved with tumorigenesis or other biological systems of interest.
Availability and requirements
Availability of data and materials
CeTF is a publicly available Bioconductor package available from http://bioconductor.org/packages/CeTF. Documentation is available on the Bioconductor website, and we provide vignettes describing more example analyses. We also maintain a public github repository (http://github.com/cbiagii/CeTF), and invite the community to submit or request additional functionality to incorporate into this package. This package requires R ≥4.0.0 and depends on several R/Bioconductor packages including circlize, ComplexHeatmap, clusterProfiler, DESeq2, GenomicTools, GenomicTools.fileHandler, ggnetwork, GGally, ggplot2, ggpubr, ggrepel, graphics, grid, igraph, Matrix, network, Rcpp, RCy3, S4Vectors, stats, SummarizedExperiment, utils and WebGestaltR. A web page is also available with tutorials and additional information: http://cbiagii.github.io/CeTF/. A docker image with the latest version is available in
Coexpression for Transcription Factors
Regulatory Impact Factors
Partial Correlation with Information Theory
The Cancer Genome Atlas
Transcription Factors with a High Regulatory impact
Gene Regulatory Networks
Transcripts Per Million
Fragments Per Kilobase Million
Hanahan D, Weinberg R. Hallmarks of cancer: the next generation. cell. 2011; 144(5):646–74.
Hu X, Hu Y, Wu F, Leung RWT, Qin J. Integration of single-cell multi-omics for gene regulatory network inference. Comput Struct Biotechnol J. 2020; 18:1925–38.
Yu D, Kim M, Xiao G, Hwang T. Review of biological network data and its applications. Genomics Inform. 2013; 11(4):200.
Farnham P. Insights from genomic profiling of transcription factors. Nat Rev Genet. 2009; 10(9):605–16.
Vaquerizas J, Kummerfeld S, Teichmann S, Luscombe N. A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009; 10(4):252–63.
Reverter A, Chan E. Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks. Bioinformatics. 2008; 24(21):2491–7.
Reverter A, Hudson N, Nagaraj S, Pérez-Enciso M, Dalrymple B. Regulatory impact factors: unraveling the transcriptional regulation of complex traits from expression data. Bioinformatics. 2010; 26(7):896–904.
Reverter A, Ingham A, Lehnert S, Tan S-H, Wang Y, Ratnakumar A, Dalrymple B. Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer. Bioinformatics. 2006; 22(19):2396–404.
Love M, Huber W, Anders S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014; 15(12):550.
Consortium G. The gene ontology resource: 20 years and still going strong. Nucleic Acids Res. 2019; 47(D1):330–8.
Liao Y, Wang J, Jaehnig E, Shi Z, Zhang B. Webgestalt 2019: gene set analysis toolkit with revamped uis and apis. Nucleic Acids Res. 2019; 47(W1):199–205.
Eddelbuettel D, François R, Allaire J, Ushey K, Kou Q, Russel N, Chambers J, Bates D. Rcpp: Seamless r and c++ integration. J Stat Softw. 2011; 40(8):1–18.
Eddelbuettel D, Sanderson C. Rcpparmadillo: Accelerating r with high-performance c++ linear algebra. Comput Stat Data Anal. 2014; 71:1054–63.
Gu Z, Gu L, Eils R, Schlesner M, Brors B. circlize implements and enhances circular visualization in r. Bioinformatics. 2014; 30(19):2811–2.
Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016; 32(18):2847–9.
Wickham H. Elegant graphics for data analysis (ggplot2); 2009. https://ggplot2-book.org. Accessed 18 Nov 2020.
Gustavsen JA, Pai S, Isserlin R, Demchak B, Pico AR. RCy3: network biology using cytoscape from within R. F1000Research. 2019; 8:1774.
Weinstein J, Collisson E, Mills G, Shaw K, Ozenberger B, Ellrott K, Shmulevich I, Sander C, Stuart J, Network C, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013; 45(10):1113.
Hassan N, Rutsch N, Győrffy B, Espinoza-Sánchez N, Götte M. Setd3 acts as a prognostic marker in breast cancer patients and modulates the viability and invasion of breast cancer cells. Sci Rep. 2020; 10(1):1–16.
Tomioka N, Morita K, Kobayashi N, Tada M, Itoh T, Saitoh S, Kondo M, Takahashi N, Kataoka A, Nakanishi K, et al. Array comparative genomic hybridization analysis revealed four genomic prognostic biomarkers for primary gastric cancers. Cancer Genet Cytogenet. 2010; 201(1):6–14.
Ren H, Zhang P, Tang Y, Wu M, Zhang W. Forkhead box protein a1 is a prognostic predictor and promotes tumor growth of gastric cancer. OncoTargets Ther. 2015; 8:3029.
Camolotto S, Pattabiraman S, Mosbruger T, Jones A, Belova V, Orstad G, Streiff M, Salmond L, Stubben C, Kaestner K, et al. Foxa1 and foxa2 drive gastric differentiation and suppress squamous identity in nkx2-1-negative lung cancer. Elife. 2018; 7:38579.
Peng X, Liu G, Peng H, Chen A, Zha L, Wang Z. Sox4 contributes to tgf- β-induced epithelial–mesenchymal transition and stem cell characteristics of gastric cancer cells. Genes Dis. 2018; 5(1):49–61.
Brotto D, Siena ADD, de Barros I, Carvalho SdCeS, Muys B, Goedert L, Cardoso C, Plaça J, Ramão A, Squire J, et al. Contributions of hox genes to cancer hallmarks: Enrichment pathway analysis and review. Tumor Biol. 2020; 42(5):1010428320918050.
Ramão A, Pinheiro D, Alves C, Kannen V, Jungbluth A, de Araújo LF, Muys B, Fonseca A, Plaça J, Panepucci R, et al. Hox genes: potential candidates for the progression of laryngeal squamous cell carcinoma. Tumor Biol. 2016; 37(11):15087–96.
Cheng C, Liu F, Li J, Song Q. Dusp1 promotes senescence of retinoblastoma cell line so-rb5 cells by activating akt signaling pathway. Eur Rev Med Pharmacol Sci. 2018; 22(22):7628–32.
We thank Regional Blood Center of Ribeirão Preto for all support. We appreciate Bioconductor reviewers and the GitHub community for constructive discussions about software usage, statistics and better performance.
(CAPES), grant #88882.378695/2019-01; São Paulo Research Foundation (FAPESP), #2013/08135-2, and by Research Support of the University of Sao Paulo, CISBi-NAP/USP Grant #12.1.25441.01.2.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Oliveira de Biagi, C., Nociti, R.P., Brotto, D.B. et al. CeTF: an R/Bioconductor package for transcription factor co-expression networks using regulatory impact factors (RIF) and partial correlation and information (PCIT) analysis. BMC Genomics 22, 624 (2021). https://doi.org/10.1186/s12864-021-07918-2
- R package
- Transcript factors