Globally learning gene regulatory networks based on hidden atomic regulators from transcriptomic big data

Background Genes are regulated by various types of regulators and most of them are still unknown or unobserved. Current gene regulatory networks (GRNs) reverse engineering methods often neglect the unknown regulators and infer regulatory relationships in a local and sub-optimal manner. Results This paper proposes a global GRNs inference framework based on dictionary learning, named dlGRN. The method intends to learn atomic regulators (ARs) from gene expression data using a modified dictionary learning (DL) algorithm, which reflects the whole gene regulatory system, and predicts the regulation between a known regulator and a target gene in a global regression way. The modified DL algorithm fits the scale-free property of biological network, rendering dlGRN intrinsically discern direct and indirect regulations. Conclusions Extensive experimental results on simulation and real-world data demonstrate the effectiveness and efficiency of dlGRN in reverse engineering GRNs. A novel predicted transcription regulation between a TF TFAP2C and an oncogene EGFR was experimentally verified in lung cancer cells. Furthermore, the real application reveals the prevalence of DNA methylation regulation in gene regulatory system. dlGRN can be a standalone tool for GRN inference for its globalization and robustness.


Background
Gene regulatory networks (GRNs) play fundamental and central roles in response to endogenous or exogenous stimuli for maintaining the viability and plasticity of cells [1,2]. Although it has been acknowledged that aberrant gene networks can be a key driver of human diseases including cancer, little is known about the GRNs of cancer, which has largely impeded the development of cancer precision medicine [3][4][5]. In these years, a deluge of omics big data has been generated and accumulated worldwide, which provides an unprecedented opportunity for reverse engineering GRNs in a cost-efficient way [6,7]. Efficient computational models for inferring GRNs from these omics data are urgently needed theoretically and practically.
Generally, several key issues need to be carefully dealt with in inferring GRNs [7]: 1) Highly complex and heterogeneous networking. Various types of regulations, e.g., transcriptional, methylation or miRNA regulations, are involved and mutually interwoven in GRNs; 2) A large number of regulatory elements or variables unknown or hidden; 3) Discerning indirect and direct interactions; 4) Prior knowledge exploitation or integration of multi-omics data. Broadly speaking, according to the way of modeling transcriptional expression patterns [8], current GRN inference methods can be divided into two categories: parameterized topology paradigm (PTP) and un-parameterized topology paradigm (UPTP). The former attempts to model expression patterns of genes including TFs by parameterizing the topology of the GRNs with various methods [9,10], such as probabilistic graphical models, ODEs and Petri Nets, while the latter treats each pair or subset of regulators and target genes locally and then assembles them into a complete network [11]. In PTP, the use of generative network models allows to take advantage of prior knowledge and favors a global inference for GRNs recovery. A main disadvantage of the PTP methods, however, is the expensive computational cost raised by the heuristic or greedy search for network parameters in an extremely large space. For example, Gaussian graphical models need to estimate a partial correlation matrix of size at least square of the number of genes [12,13]. Compared with Gaussian graphical models, Bayesian networks can tell about both the strength and the direction of regulations, leading to a great prevalence in practice [14]. Friedman et al [10] firstly introduced Bayesian networks to reconstruct S. cerevisiae's gene networks. Recently, Siahpirani et al. [15] considered three types of prior biological knowledge, ChIP, motif and knockout, in an integrative Bayesian network for GRNs inference. For more related works, refer to other literature, e.g. [6,14,[16][17][18].
In contrast, UPTP methods often make use of similarity measures [19], e.g. Pearson correlation (PC), mutual information (MI), or their variants, to score the confidence of regulations between a pair of genes. For this reason, the resulting GRNs are also called dependency networks. Among the similarity measures, most commonly used is MI for its particular power of modeling complex dependencies [20]. For example, the ARACNE method, proposed by Margolin et al. [21], combined MI with the data procession inequality (DPI) to recover GRNs. Due to the transitive effect of correlations and the limited number of observations, ARACNE tends to be over-sensitive to the high noise in microarray data, often yielding plenty of false positives in practice. To overcome the over-sensitivity, Meyer et al. [22] introduced the maximum relevance/minimum redundancy filter for refinement, and Liu et al. [23] designed another two redundancy reduction algorithms specifically for eliminating weakly indirect and noise-induced regulations respectively. Compared with MI, conditional MI (CMI) can provide a constringent result by calculating the mutual information of two genes conditional on other genes [24]. Recent studies show that direct use of CMI, however, tends to have a too conservative result due to the rigid conditional constraint [25]. To relax the constringency, Zhang et al. [26] developed a new conditional MI, CMI2, for characterizing the causal associations between genes, which alternatively quantifies the conditional mutual information through calculating the Kullback-Leibler divergence. By combining CMI2 with path consistency algorithm, the Zhang's model can accurately measure the correlations between gene-pairs for keeping synergistic regulations, thus alleviating the underestimation problem of CMI. For these conditional measures, one more big challenge still remains, i.e. the optimal selection problem of conditional genes, due to lack of prior knowledge.
Recently, target gene-centric regression models (TGCR) have attracted increasing attentions for GRNs reconstruction [2,23,27]. They mainly rely on regression models, instead of the similarity measures described above, and can favourably bypass the challenging optimal selection problem of conditional genes in conditional correlation models like CMI. Briefly, a TGCR method regresses the expression levels of a target gene on known transcriptional factors (TFs) and reports TFs with non-zero coefficients to be a regulator for the target gene. Many regression models have been explored in this way for GRN inference [6], for example, sparse models including l 1 or l 0 regularized regression [28,29]. Compared with the pair-wise paradigm above, TGCR can approach a global inference of regulators for a target gene by trying to estimate a global objective. For example, the recently developed GENIE3 [30], which won the DREAM5 network inference challenge, decomposes the reconstruction of a p-gene regulatory network into p different regression problems. In each of the regression problems, a tree-based ensemble model, Random Forests or Extra-Trees [31], is applied to calculate a local ranking of genes, and the resulting p local rankings are finally aggregated to reach a global ranking of all gene pairs. Dictionary learning (DL) is a recently developed signal restoration model, which finds a dictionary of atomic vectors for a sparse representation of the observed data [32,33]. Extensive applications in different signal processing fields such as image denoising, audio processing as well as pattern classification, have witnessed the great success of DL in recovering hidden signals [34]. We here develop a DL-based GRN inference framework (dlGRN), which intends to learn a sparse representation of the gene regulatory system via a modified DL algorithm and then makes a global inference of the regulators for a target gene based on the sparse representation, independent of known or observed regulators. We argue that it is the first time to truly globally reverse engineering GRNs with the help of a sparse representation of the regulatory system. We demonstrated the effectiveness and efficiency of the proposed method on synthetic data and real-world data about two model organisms and human lung cancer. A novel predicted regulation of a TF, TRAP2C, on an oncogene, EGFR, was experimentally verified. dlGRN is also versatile to infer DNA methylation regulations besides the most concerned transcriptional regulations.

Results
Overview of dlGRN Figure 1a shows the pipeline of dlGRN. The proposed method first decomposes the expression matrix of target genes (TGs) using a modified DL algorithm to uncover atomic regulators (ARs), which as basic regulatory signals reflecting the whole regulatory landscape underpinning the expression data, as shown in Fig. 1b. The modified DL algorithm, called sfk-svd, fits the scale-free and sparse property of GRNs. Given a pair of TF tf, and TG g, dlGRN then estimates Pearson correlations (PCs) between tf and the resulting ARs associated with g and calculates a confidence score (cs) for the regulation of tf on g via the inverse function of the cumulative distribution of PCs, as shown in Fig. 1a. The confidence score is meaningful in systems biology and will be robust due to the globalization of ARs. To avoid small sample bias, we also devise a resampling procedure to wrap the inference model and obtain a final GRN as an average network over multiple runs (Fig. 1a).

Evaluation of the performance of uncovering hidden ARs
When applying sfk-svd to Simulation data I, we observed that root mean squared errors (RMSEs) gradually decreased and converged within~200 iterations in all the data scenarios (Figs. S1-S5 in S1 Notes), irrespective of the values of l, {25, 50, 100 and 150}, suggesting the convergence of the algorithm. With the learned ARs, we calculated RRs and PPVs against the 50 real regulators and averaged them over 20 random data sets in each scenario. Results show that both RRs and PPVs reached a maximum of > 90%, irrespective of sample sizes and noise levels ( Fig. 2a), indicating the super power of dlGRN in learning hidden regulatory signals. We observed that the parameter l took substantial impact on the power: l = 50, i.e., the real number of regulators, always led to the highest RRs and PPVs, while l > 50 or l < 50 decreased PPVs significantly, as shown in Fig. 2 b-c. This may relate to the efficacy of dictionary learning, which can be incomplete or over-complete depending on l, and on the other hand, suggests that an optimal l tends to be equal to or slightly larger than the real number of regulators. Similar results were obtained in simulation scenarios of SNR = 20, 25, 30 and 40 ( Fig.  S6 in Supplemental material SI Notes). Figure 2(b, c) also reveal that the power increases as the sample size becomes larger, especially when noise is high (Fig. S6). Figure 2d-e visualizes the changes of the average RRs and average PPVs over different samples sizes with SNR, showing a trend that the power increases as noise reduces, especially when l is large.

Evaluation of the performance of dlGRN in predicting gene regulations
Results reveals that on Simulation data I, dlGRN achieved higher average AUROCs and AUPRs than four state-of-the-art methods, GENIE3 [30], CLR [35], ARAC Ne-AP [11] and ARACNE [21] in all the scenarios of sample sizes and noise levels, as shown in Table 1 (and  Table S1 in Supplemental material SII Notes). We found that the optimal values of l are always around the number of real regulators [36], which is consistent with the pattern of the power of recovering hidden regulatory signals in simulation experiments (Fig. 2). The advantage of dlGRN over previous methods was almost completely kept on the non-linear Simulation data II, shown in Table 2. A main difference is that the maxima of AUROC (83.24%) and AUPR (24.39%) are reached at l = 500 (Table S2 in Supplemental material SI Notes), which is far larger than the number (195) of real regulators. This should be related to the increased nonlinear complexity in Simulation data II.
For the real two model organisms and three LUAD data sets, AUROCs and AUPRs were calculated against the corresponding experimentally-validated TF-target regulations, respectively. Results reveal that for the S. cerevisiae data set and all the three lung cancer data sets, dlGRN still achieved higher AUROCs and AUPRs than those of the four previous methods and competitive results for the E. coli data set, as shown in Table 2. For each of the three lung cancer data sets, we further sorted the predicted regulations in a decreasing order of cs and counted the numbers of true positives in the top num = 10, 50, 100, 150 and 200 for each method, finding that dlGRN called most true positives on all the three data sets and most common true positives, regardless of num, as shown in Fig. 3. Taken together, these results suggest the superior power of dlGRN in recovering regulations over state-of-the-art methods.
The inferred GRNs by dlGRN are globally scale-free, whose nodes tend to locally cluster Aberrant gene networks can drive the development of complex diseases such as cancer [37]. Following the known 2677 TF-target regulations, we then selected 2677 TFtarget pairs with top cs by each method and built GRNs for LUAD on each of the three lung cancer data sets (Fig. 4a-c and Fig. S7 in Supplemental material SI Notes). The lntransformed distributions of the degree of nodes in the resulting GRNs are examined (Fig. 4d-f). Following the power-law property, i.e., ln(P (deg))~γ ln (deg), where P (deg) represents the likelihood that a gene has a degree of deg, we found that the GRNs obtained by dlGRN have γ = − 6.97, − 6.33 and − 7.19 for the three data sets, GSE32863, GSE10072 and GSE7670, respectively, whose absolute values are larger than those by previous methods, indicating more sparse topological structures. In real-life networks, nodes tend to form tightly knit groups with a relatively high density of connectivity [38,39]. We calculated the average cluster coefficients (ACCs) of these GRNs, finding that the GRNs by dlGRN had larger ACCs (0.31, 0.32, 0.47 for GSE32863, GSE10072 and GSE7670 data sets respectively) than those by all the previous methods on all the three data sets. We further compared the numbers of correctly recognized TFs per target gene and the average numbers (AN) over all target genes among different methods. Results reveal that the GRNs by dlGRN had AN = 0.38, 0.32 and 0.41 for GSE32863, GSE10072 and GSE7670 data sets respectively, as ( Fig. 4a-c), which are larger than those by the four previous methods, confirming the higher sensitivity of dlGRN in recognizing regulations. Furthermore, Venn diagrams of the three sets of 2677 links for different methods (Fig. S8 in Supplemental material SI Notes) reveal that dlGRN resulted in significantly more shared links (484) than other methods (p-value< 0.001), suggesting the better reproducibility and consistency of GRNs by dlGRN.  Literature survey shows that many predicted TFs by dlGRN for each target gene were previously reported. Take as an example the target gene "ID2", whose encoding protein belongs to the inhibitor of DNA binding family with a helix-loop-helix (HLH) domain and plays important roles in cell proliferation, differentiation and angiogenesis [40]. Among the 11 experimentally-verified TFs for ID2, 4, 2 and 2 were successfully predicted by dlGRN on the three data sets, GSE32863, GSE10072 and GSE7670, respectively, all more than those by the four previous methods, as shown in Fig. 5a-e. Especially, dlGRN correctly called a known TF of ID2 (MEIS1) simultaneously on the three data sets, but the four previous methods none. Among other regulators predicted by dlGRN, CREB1, missed by all the four previous methods (Fig. 5f-h and Fig. S9 in Supplemental material SI Notes), has been previously reported to regulate ID2 in [41]. For EGR2, Kim et al. [42] experimentally observed that EGR2 transactivates ID2 by binding to the promoter of ID2 and knockdown of EGR2 represses ID2 gene expression in osteoclast-lineage cells. Both ETS2 and ETV4 encode proteins with ETS-domain which can bind to the gene family of IDs [43]. Both SMAD4 and SMAD7 are members of SMAD family, which have previously reported to suppress the expression of ID2 in tumorigenesis [44,45]. STAT5B is one of two STAT5 TFs from the STAT family, playing important roles in apoptosis and TCR signalling. Li et al. [46] observed that STAT5 proteins regulated ID2 transcription by recruiting STAT5B in a cis-regulatory element to the ID2 promoter in dendritic cells. Furthermore, Sun et al. [47] reported that STAT5 stimulates the expression of ID2 to control the CD103 + DC production and the pDC inhibition.

dlGRN is intrinsically distinctive of direct and indirect regulations
Cascade regulatory structure (CRS) is a basic type of regulatory motifs in GRNs [48], where gene A, for example, regulates gene B and gene B subsequently regulates gene C, denoted by A → B → C. In other words, gene A indirectly regulates gene C via two direct regulations. Due to the transitive effect of correlations, current methods often fail to infer CRS completely correctly. The background 2677-link networks of the lung cancer data contain totally 6678 CRSs, against which we investigated how dlGRN distinguishes direct and indirect regulations. Hypothetically, a CRS may be recovered in five patterns (Fig. S10 in Supplemental material SI Notes): Pattern 1 (P1), which wrongly calls a direct regulation between A and C, Pattern 2 (P2), which recovers the CRS completely correctly, and the rest three patterns, named P3, P4 and P5, which correctly recognize the indirect regulation between A and C but miss direct regulations, A → B, B → C, or both, respectively.  correctly called most CRSs (P2) with least recognition errors of indirect regulations (P1) on almost all the data sets. ARACNe missed most direct regulations (P5) on almost all the three data sets, which may be related to the over-trimming of links by DPI. These results suggest that dlGRN is intrinsically distinctive of direct and indirect regulations due to the modeling globalization.

A novel predicted transcriptional regulation of TFAP2C on EGFR in lung cancer
Considering that EGFR is one of hottest onco-genes in lung cancer, we looked into the predicted TFs for EGFR by dlGRN on the three lung cancer data sets. We averaged the resulting cs over the three data sets for each of the 55 known TFs (Supplemental material SIII Notes), and found that two TFs, LMO2 and TFAP2C, not recorded as TFs of EGFR in the UCSC and TRED databases (April, 2017), are with highest average cs (0.36 and 0.33), suggesting a high likelihood of regulating EGFR. To experimentally verify the predictions, we searched for the transcription factor binding sites (TFBSs) of the two TFs to the promoter of EGFR using the online JASPAR tool (http://jaspar.genereg. net/), finding 93 TFBSs for TFAP2C but none for LMO2. Based on the 93 TFBSs, we conducted TFAP2C siRNA knockdown experiments on lung cancer cell A549. As a result, we observed that EGFR significantly (p-value< 0.01) depressed its expression after knockdown of TFAP2C in the two repeats (Fig. 6a). Similar depression has been previously observed in luminal breast cancers [49]. A potential regulatory mechanism of EGFR by TFAP2C may be via three most highly scored TFBSs predicted by JAS-PAR, as shown in Fig. 6b. According to KEGG pathway database (https://www.genome.jp/kegg/pathway. html), activated EGFR can lead to cell growth and proliferation via RAS-MAPK signalling pathway. We found that many genes along the RAS-MAPK signalling pathway, e.g., RAS and MAP 2 K1, significantly over-expressed in tumors compared with normal tissues in the three lung cancer data sets (Fig. 6c).
dlGRN reveals the prevalence of DNA methylation regulation in gene regulatory system By replacing the expression levels of TFs with DNA methylation levels, dlGRN can be used to infer DNA methylation regulation of genes. For the lung cancer data set, GSE32863, Selamat et al. [50] monitored the DNA methylation profiles of the 116 samples at the same time. We downloaded the DNA methylation data (GSE32861) from GEO and applied dlGRN to jointly analyze it with the expression data, GSE32863. Results reveal that a considerable proportion of genes (60.5%) are significantly methylation-regulated at an ad hoc p-value cutoff of 0.05 (by a permutation test described in Supplemental material SI Notes), as shown in Fig. 7a (and Supplemental material SIV Notes). This coincides with the indispensible roles of DNA methylation in cellular activity [36]. Many of the inferred methylation regulations have been previously observed as hypo-or hyper methylations in cancer (Table  S3 in Supplemental material SI Notes). Take gene "RAB25" (cs = 0.6976, p-value<1e-3) as example. The gene belongs to the RAS superfamily of small guanosine triphosphatase (GTPase), which regulates tumor progression and aggressiveness during tumorigenesis. Figure 7b-c boxplots the expression and methylation levels of RAB25 in tumor and adjacent non-tumor tissues, showing that RAB25 is both significantly up-expressed (p-value< 2.2e-16) and significantly down-methylated (p-value< 2.2e- 16) in the LUAD. Correlation analysis confirms that RAB25 expression is significantly negatively correlated with its methylation (Pearson correlation is − 0.67 and p-value = 3.76e-16). We reason that the abnormal over-expression of RAB25 in LUAD may be driven by its aberrant hypomethylation, albeit needs to be experimentally verified.

Discussion
In this paper, we have proposed a global inference framework for reverse engineering GRNs based on deep learning, i.e. dlGRN. The framework interrogates the gene regulatory system using DL and predicts regulations between TFs and TGs in a global way. Specifically, a modified DL algorithm sfk-svd was developed for reliably uncovering ARs which reflect the whole regulatory mechanism. The modified DL algorithm fits the scalefree and sparse property of GRNs. Then, the regulation confidence of a TF on a target gene can be estimated by a correlation analysis between the TF and the ARs associated with the target genes. The use of ARs guarantees the globalization of the regulation inference. A resampling procedure was also designed to avoid sample biases for inference robustness. Experiments on simulation and real data sets show that dlGRN outperforms state-of-the-art methods with higher AUROCs and higher AUPRs in GRN reconstruction.
Discerning indirect and direct regulations is of importance in GRN reconstruction. Previous methods such as similarity criteria often call plenty of spurious direct regulations due to the transitive effect of correlations. In contrast, dlGRN works in a AR-based global way and thus is intrinsically distinctive of direct and indirect regulations, as illustrated in the experiment on lung cancer data (Figs. S8, S9), where dlGRN correctly recognized most CRS modules with least errors on almost all the three data sets.
We experimentally verified a novel predicted regulation, i.e., the regulation of TF TFAP2C on a hot once-gene EGFR, in lung cancer cell A549 and conceived a potential three TFBSs molecular mechanism. Over-expressed EGFR can stimulate cell growth and proliferation via RAS-MAPK signalling pathway. Many genes along the pathway, e.g., RAS and MAP 2 K1, were observed to be over-expressed in tumors in the three lung cancer data sets (Fig.  6c), confirming the downstream of tumor signals trigged by the abnormal TFAP2C-EGFR regulation. In addition, we also revealed the prevalence of DNA methylation regulation in gene regulatory system. Considering the pressing need of understanding GRNs in cells, we envision that our approach will be very useful and promise broad applications in biological and medical research.
Despite the success of recovering TF/DNA methylation regulations, gene regulatory system is complex and involves various types of expression regulations, for example, histone modification and miRNA degradation, which regulate target genes in different ways and may need more specific reverse engineering Fig. 6 Experimental verification of TFAP2C regulating EGFR in A549 cancer cells. a Relative expression levels of EGFR with or without knockdown of TFAP2C. "**" mean p-values< 0.01. b Potential transcriptional regulation mechanism of TFAP2C on EGFR. c Comparison of expression of genes along onco-signalling pathway activated by EGFR between tumor and normal tissues in the three data sets models. We also notice that TFs preferentially bind to a certain target sequence, and searching for that sequence or similar patterns in the regulatory regions of the target genes may help improve dlGRN. Future work will be addressing these issues for better performance.

A global model of gene regulatory systems (GGRM)
In cells, gene expression can be regulated and mediated in concert by various types of regulatory factors, such as TFs, microRNAs or epigenetic states. We hypothesize that the expression levels of a target gene are collectively shaped by a handful of basic regulators in a weighted linear way. Considering that plenty of regulators are unknown or unobserved, we intend to interrogate the whole regulatory mechanism by mining as many hidden basic regulatory signals as possible via dictionary learning, as shown in Fig. 1b. Theoretically, the resulting regulatory signals, referred to as atomic regulators (ARs), can represent all possible regulatory factors, such as TFs, microRNAs, epigenetic statuses, or even combinational regulatory modules. Mathematically, let Y∈R nÂp denote an observed expression matrix of p target genes in n samples, we reformulate Y as where D∈R nÂl represents the regulatory dictionary matrix of l ARs across n samples, incomplete or overcomplete; X∈R lÂp represents the sparse regulation coefficient matrix of the l ARs on target genes, of which element x ij represents the regulation effect of the i-th AR to the j-th target gene; ε is a random white noise subjecting to an i.i.d Gaussian distribution with mean of zero. The learned AR dictionary reflects a surrogate of the regulatory mechanisms underlying Y. The number of ARs (l) is an important parameter to learn all the ARs behind the expression data. However, no exact guidance exists for choosing the parameter in practice. Theoretically, the parameter should be large enough for a comprehensive Considering the sparse and power-law property of GRNs, we solve the model [1] by simultaneously optimizing D and X under a scale-free sparsity constraint: where x i is the i-th column of X and t i is a prior positive constant, referred to as scale-free sparsity parameter, that specifies the upper boundary of the number of ARs for the i-th target gene. We set t i by randomly sampling from a distribution P(t i ) ∝ t i −λ (λ = 2-4 and t i = 2-6). The optimization [2] guarantees the sparsity of the resulting GRNs and makes it under control in network structure. However, the objective function is not convex on both D and X together and is algorithmically NP-hard. Mathematically, for such optimization problems, no solutions are immediately available and local minima is often desirable in practice [51]. We then developed a modified k-SVD algorithm, named scale-free-constrained k-SVD (sfk-svd), to approximate a local solution for the optimization problem [2]. Briefly speaking, the algorithm repeats two steps, i.e., sparse approximation and dictionary update, until the error converges (See Section 1 in S1 Notes for details).
There are many methods that can be used to analyze the latent regulatory signals, such as PCA [52], ICA [53] or NCA [54]. However, these methods either impose strict statistical properties for learned latent regulators, e.g. orthogonality (PCA) and independence (ICA), or strongly needs priori connectivity information, e.g. NCA, which makes them not suitable and scalable for large scale biological systems inference, especially with limited number of samples [55,56]. Compared with these methods mentioned above, k-SVD-like dictionary learning methods are promising, because they hardly impose none statistical properties on the atomic regulators to be mined except the sparseness of the inferred network structure which is in coordination with the real-world GRN structure.

Inferring regulatory relationships based on ARs
Let ðD;XÞ represent a solution for the GGRM. For the g-th target gene, assume that the g-th columnx g ofX has n g non-zero elements with subscripts (1), (2), …, (n g ), we can have n g ARs that are associated with the target gene, whered ðiÞ is the (i)-th column ofD. For a given TF tf with an expression profile d tf , we then assess how it regulates g as follows: First, calculate Pearson correlation coefficients (pcc) between tf and each AR. Note that one can use Spearman correlation for nonlinear association. Second, estimate the regulation confidence score (cs) as where Γ −1 represents the inverse function of the cumulative distribution of |pcc| and 0 ≤ α ≤ 1 is a quantile cutoff (α = 0.9 as default). Larger αs lead to more sensitive results. Figure 1a illustrates the inference procedure.
A resampling procedure Considering that resampling can relieve sample bias in machine learning, especially when sample size is small or moderate [57], we also devise a resampling procedure for more reliable inference: 1) Randomly selecting a subset of s samples from the total samples without replacement and running dlGRN with the s samples; 2) Repeating 1) N times to obtain N cs by [4] for each pair of regulators and target genes; 3) Averaging the resulting N cs as final results. Specifically, we set s = 25% × n and N = 2 × n as default. The pseudo code of the proposed GRN inference approach dlGRN can be listed below:

Parameters of dlGRN
In the proposed GGRM, the parameter l represents the number of atomic regulators (ARs) and should approximate to the number of real-world regulators, including TFs, microRNAs and DNA methylation. Theoretically speaking, the value of l needs to be estimated based on the biological priors of the organism from which the transcriptomic data was collected. In our context, the value of l was set to range around the number of known regulators of genes in the dataset for fully demonstrating the performance of dlGRN. The parameter t i is a small positive constant to constrain the maximum l 0 -norm of the i-th regulatory coefficient vector. The resampling procedure makes the inference results insensitive to the selection of t i within a limited scope [29]. In our context, t i was set to range in 2-6 in all data scenarios.

Measures for method evaluation
We adopted two measures, i.e. recovery rate (RR) and positive predictive value (PPV), to evaluate the performance of recovering ARs from gene expression data. Another two measures, i.e. area under receiver operating characteristic curve (AUROC) and area under precision-recall curve (AUPR), were used to assess the performance of detecting regulatory relationships [58]. See Supplemental material S1 Notes for details of these measures.

Two simulation data sets
Simulation data I mimic a linear regulatory system with background networks following the sparse and scale-free property, consisting of p = 1500 target genes and k = 50 regulators. Totally, 30 data scenarios were considered: six noise levels times five sample sizes (See Supplemental material SI Notes for details of Simulation data I generation). Simulation data II were downloaded from the DREAM5 project (http://www.the-dream-project.org/), which are used to mimic a non-linear regulatory system with a background network drawn from known transcriptional regulatory networks of Yeast Strains. The data sets consist of the expression profiles of 1548 target genes and 195 TFs in 805 samples. See the literature [59] for more details of Simulation data II.

Five real data sets
First two data sets come from two model organisms, E. coli and S. cerevisiae, which consist of the expression profiles of 4511 target genes and 334 TFs in 805 samples and the expression profiles of 5950 target genes and 333 TFs in 536 samples, respectively. For the two data sets, 2066 and 3940 experimentally verified TF-TG regulations were collected from the literature [60][61][62] as silver standard, respectively. Three human lung adenocarcinoma (LUAD) transcriptional data sets, GSE32863, GSE10072 and GSE7670, were downloaded from GEO database and preprocessed (Supplemental material SI Notes) to have the expression levels of 4771 genes in 116, 107, and 54 samples, respectively. For the lung cancer data sets, 2677 TF-target genes regulations were collected from the UCSC database [63] and TRED database [64] (on April 1, 2017) as silver standard.

Cell culture
Human lung cancer cell A549 was purchased from American Type Culture Collection and cultured in DMEM medium supplemented with 10% fetal bovine serum (Biological Industries, Israel). Cells were cultured in a 37°C humidified atmosphere of 5% CO2 and planted in a 6-well plate after the cell states became well.

RNA extraction and quantitative real time PCR
Cancer cells were transfected with siRNAs and negative controls with liposome (lip3000 in our experiments) for 48 h. After the medium was removed, cancer cells were washed 3 times with PBS. Total mRNAs from cultured cells were extracted using Trizol (Invitrogen, UCA) according to the manufacture's instructions. cDNA was synthesized using the HiScript II 1st Strand cDNA Synthesis Kit (Vazyme Biotech, China), and the expression levels of mRNAs were quantified using ChamQ SYBR Color qPCR Master Mix (Vazyme Biotech, China). Quantitative Real Time PCR was performed using the Bio-Rad CFX Real-time PCR system (Bio-Rad, USA). Statistical comparison of the two groups each with triplicates was conducted using Student's t-test. Statistically significances were calculated and indicated. *: P < 0.05, **: P < 0.01.