Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite

Background High-throughput transcriptomic datasets are often examined to discover new actors and regulators of a biological response. To this end, graphical interfaces have been developed and allow a broad range of users to conduct standard analyses from RNA-seq data, even with little programming experience. Although existing solutions usually provide adequate procedures for normalization, exploration or differential expression, more advanced features, such as gene clustering or regulatory network inference, often miss or do not reflect current state of the art methodologies. Results We developed here a user interface called DIANE (Dashboard for the Inference and Analysis of Networks from Expression data) designed to harness the potential of multi-factorial expression datasets from any organisms through a precise set of methods. DIANE interactive workflow provides normalization, dimensionality reduction, differential expression and ontology enrichment. Gene clustering can be performed and explored via configurable Mixture Models, and Random Forests are used to infer gene regulatory networks. DIANE also includes a novel procedure to assess the statistical significance of regulator-target influence measures based on permutations for Random Forest importance metrics. All along the pipeline, session reports and results can be downloaded to ensure clear and reproducible analyses. Conclusions We demonstrate the value and the benefits of DIANE using a recently published data set describing the transcriptional response of Arabidopsis thaliana under the combination of temperature, drought and salinity perturbations. We show that DIANE can intuitively carry out informative exploration and statistical procedures with RNA-Seq data, perform model based gene expression profiles clustering and go further into gene network reconstruction, providing relevant candidate genes or signalling pathways to explore. DIANE is available as a web service (https://diane.bpmp.inrae.fr), or can be installed and locally launched as a complete R package. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07659-2).


Statistical procedure
To assess weather an importance value is significant or not, the rfPermute package [1] fits Random Forests and repeatedly shuffles the target gene expression profile so that the null distribution of each regulator influence is estimated. Hence, the empirical p-value of a regulator-gene pair is given by the extremeness of its importance as compared to the estimated null distribution.
As biological networks are known for their pronounced sparsity [2,3,4], testing all possible regulator-target pairs would be of very little interest, as well as a waste of computation time. Besides, our preliminary analysis showed that corrections for multiple testing were made unreasonably conservative by the very large number of edges. We therefore propose to create a first graph, topologically consistent with biological network standards, which will be further refined by statistical testing.
More precisely, the steps of the method are : 1. Inference of the importance values for all regulator-target gene pairs using GENIE3. The importance metric returned by GENIE3's Random Forests is the total decrease in node impurities from splitting on the variable, averaged over all trees [5]. It requires the target gene expressions to be normalized to a unit variance, so that their regulatory importance measures can be compared without bias. In the GENIE3 framework, it was shown faster and equivalent to another importance metric, the prediction error on the out-of-bag permuted data. Although both can be used for this step, we recommend the use of the second one for consistency reasons regarding the third step.
2. Selection of the number E of edges based on the inferred regulatory ranking. The value of E is such as it gives a superior limit to the network density. The total number of possible edges in an oriented regulatory network being E max = N regulators (N genes − 1), and the density being defined as d = E Emax , we deduce E = dN regulators (N genes − 1).
Studies such as [4] on state of the art protein-protein interaction structure found that the typical values of density in biological networks lie approximately between 0.1 and 0.001, guiding the user's choice for this parameter.
3. Empirical p-values are computed for the selected regulatory weights with the rfPermute package. For each gene involved in the selected edges, Random Forests are fitted using its connected regulators as variables, as defined in the network resulting from the first step. The response variable is permuted nShuf f le times to build the null distributions. The empirical p-value for an edge is consequently the proportion of the null importance values above the observed importance. We propose a default value nShuf f le = 1000, but it can be increased for more precise p-value estimations. The importance metric to use in the Random Forests for this step is the prediction error on Out-of-bag examples. Indeed, we observed (data not shown) that, unlike the node impurity measure, prediction error on OOB examples was robust to the reduced number of regulators caused by the selection of E edges only and to over-fitting as well. Moreover, it does not require any expression normalisation, as it is already dealt with within the metric definition.
4. FDR adjustment [6] for multiple testing is applied to the set of p-values.
5. Only the edges above a certain FDR threshold are kept to be part of the final network.
In brief, the main user-defined parameters are the estimated network density, and the FDR cut-off. Together, they bring much more biological meaning and decision help than an arbitrary importance threshold.

Implementation
For the implementation of this method for edges selection, the source code of GENIE3 was modified in order to use the R implementation for Random Forests and allow to change the importance metric. The testing procedure was implemented in a function that benefits from CPU multi-threading to reduce computation time, but it stays the more time consuming step. Graphics that show the p-values distribution and the final number of edges depending on the FDR choice are displayed, providing the user with additional decision guidance.
The method is embedded in DIANE, available either through its user interface, or via functions to run from R scripts, as detailed in the package vignette.