Functional data analysis for identifying nonlinear models of gene regulatory networks
© Summer and Perkins. 2010
Published: 2 December 2010
Skip to main content
© Summer and Perkins. 2010
Published: 2 December 2010
A key problem in systems biology is estimating dynamical models of gene regulatory networks. Traditionally, this has been done using regression or other ad-hoc methods when the model is linear. More detailed, realistic modeling studies usually employ nonlinear dynamical models, which lead to computationally difficult parameter estimation problems. Functional data analysis methods, however, offer a means to simplify fitting by transforming the problem from one of matching modeled and observed dynamics to one of matching modeled and observed time derivatives–a regression problem, albeit a nonlinear one.
We formulate a functional data analysis approach for estimating the parameters of nonlinear dynamical models and evaluate this approach on data from two real systems, the gap gene system of Drosophila melanogaster and the synthetic IRMA network, which was created expressly as a test case for genetic network inference. We also evaluate the approach on simulated data sets generated by the GeneNetWeaver program, the basis for the annual DREAM reverse engineering challenge. We assess the accuracy with which the correct regulatory relationships within the networks are extracted, and consider alternative methods of regularization for the purpose of overfitting avoidance. We also show that the computational efficiency of the functional data analysis approach, and the decomposability of the resulting regression problem, allow us to explicitly enumerate and evaluate all possible regulator combinations for every gene. This gives deeper insight into the the relevance of different regulators or regulator combinations, and lets one check for alternative regulatory explanations.
Functional data analysis is a powerful approach for estimating detailed nonlinear models of gene expression dynamics, allowing efficient and accurate estimation of regulatory architecture.
A key problem in systems biology is estimating dynamical models of gene regulatory networks. The mathematical modeling of expression dynamics, combined with model parameter estimation, has been crucial to unraveling complex regulatory programs , to recognizing the robustness of the regulatory architecture of the segment polarity genes to variations in initial conditions and parametric variation [2–4], to studying mechanisms of robustness and evolution of the control of the cell cycle in yeast [5, 6], to identifying surprising shifts in the expression domains of the gap genes and the regulatory interactions responsible [7, 8], and to numerous other studies (e.g., [9–16]).
where x is the vector of expression levels of the N genes, and f produces a vector of time derivatives of expression depending on the current expression levels and on some adjustable parameters θ. These parameters typically encode such features as kinetic rates for the production and decay of gene products, and regulatory influences between the genes. The regulatory architecture of the system–that is, which genes' expression derivatives depend on which other genes' expression–may be made explicit in the function f (e.g., ), or it may be implicit in the parameters (e.g., [7, 8]), in which case optimizing the parameters implicitly determines network architecture. Extensions of our work to modeling both mRNA and protein levels of expression, for example, are straightforward, as would be extensions to functions f that depend on time or to delay differential equations, where the derivatives depend on the state of the system in the past. We will also assume that the expression data is collected from the wild type network, though initial conditions may vary. Knock-out or over-expression data has also proven useful in genetic network inference, both in theory  and in practice . However, wild-type data is far more common and easier to generate than genetic perturbation data.
where x (t i ) denotes the solution to the ODE (Eq. 1) with parameters θ. The initial conditions are often taken from the observed data, x (t 0) = y (t 0) , or they may be part of the parameters θ. Even when the ODE model is linear, so that for some matrix A, this optimization is not trivial. The solution to such an ODE is given by the matrix exponential x (t) = e At x (t 0), so that the dependence of the error on the parameters (A) is not straightforward. Still, linear differential equation models have been fit efficiently to expression data by various means, most prominently by recasting the problem into other more convenient forms [9, 19, 20]. When f is nonlinear, as is typically the case when trying to make more detailed models of network dynamics, then solving the minimization (Eq. 2) is all the more difficult.
There is another major approach to fitting ODE models, however, via functional data analysis (FDA) . The fundamental idea of FDA is to transform a data series (e.g., the time series y (t i )) into a continuous function (ŷ (t)). This transformation often involves "denoising" the data, using smoothing splines or some other basis function approximation. Various estimation problems can then be solved in terms of these functions. FDA approaches have made some inroads in the literature on gene expression analysis. It has particularly appeared in papers on dimensionality reduction, clustering or classification for microarray expression time-series data (e.g., [22–24]). More relevantly to the present paper, several works have proposed estimating linear dynamical models from (microarray) expression data [25, 26]. As will be explained in greater detail below, this approach to dynamical modeling allows the estimation problem to be reduced to one of regression, which carries both statistical and computational advantages.
Here, denotes the element of the vector corresponding to gene g, and similarly f g denotes the element of f pertaining to gene g. Thus, if there are N genes and, say, K free parameters per gene, the fitting problem is reduced from a single nonlinear N × K -dimensional problem to a set of N independent K -dimensional optimization problems. Such a reduction in dimensionality is typically very favorable when solving nonlinear optimization problems. Finally, although the derivative-based error criterion is still a nonlinear optimization problem for arbitrary dynamics functions f, informally, the optimization tends to be "less nonlinear" than for the trajectory-based error. In part, this is because the error involves only the evaluation of the dynamics function rather than solutions to the dynamics equation. Typically, f is not taken to be anything more complicated than a generalized linear model , so that minimizing derivated-based error is a generalized-linear least squares regression problem–a type of problem routinely solved in statistical analyses.
Despite the potential advantages of the FDA approach, we believe it has not been seriously evaluated on the problem of estimating nonlinear models of gene expression dynamics. In particular, neither its efficiency nor its ability to correctly estimate regulatory network architecture have been evaluated. Here, we formulate and test FDA approaches on data from two different real networks, the gap gene system of Drosophila melanogaster [7, 8] and the synthetic IRMA network , and on simulated data generated by the GeneNetWeaver program . We show that the FDA approach is extremely efficient at fitting nonlinear dynamical models of these data sets. In fact, it is so efficient that we can explicitly enumerate and test all possible regulatory architectures, which, to our knowledge, has never been achievable before for this type of modeling. These enumerations clarify the key regulatory factors, as well as interactions between factors, that explain the observed expression dynamics. We also assess the accuracy with which regulatory relationships are correctly extracted from the data, and compare it to other state-of-the-art fitting approaches. In general, the approach seems as successful as any other at determining which genes regulate which, and is very successful at discriminating the types of regulatory interactions–activation or repression.
We apply FDA methods for fitting differential equation models of data from two real gene networks and simulated data from a set of in silico systems. Here we briefly describe these systems and the expression data upon which our fits are based.
Reinitz and colleagues have made detailed measurements of the protein expression of these seven genes during development of the embryo  by confocal imaging of fluorescent antibody-labelled preparations. We use a data set that includes measurements at 8 different times spanning approximately one hour of actual time, and covering the trunk region of the embryo at a resolution of 1% of embryo length. This includes 7 genes × 8 times × 58 space points = 3248 total expression measurements. The data for the Hb gene are shown in Figure 1B.
Cantone et al.  reported on the construction and mRNA expression measurement of a synthetic gene network created in yeast called IRMA (for in vivo benchmarking of reverse-engineering and modeling approaches). The network was constructed from five genes: SWI5, CBF1, GAL4, GAL80 and ASH1. Each gene was given a known and well-characterized promoter responsive to one or more of the other genes in the network, as shown in Figure 1C. In their paper, Cantone et al. describe GAL80 as repressing GAL4, but this is via their natural protein interaction properties. At the mRNA level, GAL80 does not affect GAL4, and so the effect of GAL80 is seen only at the target of GAL4 protein, which is the promoter of SWI5. Hence, our canonical model has GAL80 repressing SWI5. The endogenous transcription factors were deleted from the organism, to limit the impact of external factors. The dependence of GAL80/GAL4 binding on galactose is used as an on/off switch for the network. In the presence of galactose, the SWI5 gene is activated to subsequently trigger expression in the rest of the network. We use five different time-series that they generated by switching the network on using galactose and measuring the mRNA levels every 20 min over a five hour interval by quantitative real-time RT-PCR (see Figure 1D).
GeneNetWeaver  can be used to generate in silico datasets of the expression dynamics of gene networks, and has been the basis for part of the DREAM network reverse engineering challenge for several years running . The tool allows one to generate data from estimated yeast or E. coli networks, or subnetworks thereof. The program generates a kinetic model of gene expression, and can output time-series or steady state data for wild-type and genetic perturbation conditions. We used GeneNetWeaver to generate four in silico networks, two of which are sparsely connected like the IRMA network, which we denote S1 and S2 (see Figure 1E for an example), and two of which are more densely connected like the gap gene network, which we denote D1 and D2. We generated 20 wild-type expression time series for each network as the basis for model estimation (see Figure 1F).
We smoothed and transformed the time series into continuous functions of time using the cubic spline functions built into the Matlab programming language (see Methods for details). For each of the data sets, this results in a set of functions ŷ i (t), where the superscript i indicates it is the i th such time series–one of 58 for the Drosophila data, corresponding to each space point, one of 5 for the Cantone data, and one of 20 for each GeneNet Weaver network. With the cubic spline representation, the temporal derivatives can be directly obtained from the spline coefficients, so that is readily computed.
where R g is the maximum rate of production of gene g 's protein or mRNA, is a sigmoidal function ranging between zero and one, T gg ′ is the regulatory weight describing the effect of gene g ′ on the production of g 's protein or mRNA, h g is a bias term, and λ g is the decay rate.
where E 0 is the original error function of Equation 6 and c is a parameter determining the relative import of fitting the data accurately and using "small" weights. The L 1 penalty is often used in an attempt to eliminate excess parameters in regression problems. If one is only concerned about prediction accuracy, and if one has statistically independent data points, then cross-validation can be used to choose a value of c that appropriately trades off model complexity and model accuracy on the training data. In our case, the data come from time series, so derivative estimates at different times are certainly not statistically independent. Nor is our primary concern the accuracy of the regression model. This is only a conduit to determining regulatory architecture. Thus, we experimented with a range of c values, as described in more detail below. Regulatory weights that remain nonzero for large values of c are the most important for explaining derivatives, and we give these the highest "confidence".
For the IRMA and GeneNetWeaver data sets, we fit models without autoregulatory links, as these systems do not include autoregulation. For the gap gene system, however, where autoregulation is believed to occur, we allowed autoregulatory links in the model. Three of the Drosophila genes and some of the genes in the GeneNet Weaver networks do not have any regulatory inputs–at least, not among the genes considered. We did not model these genes, restricting our modeling efforts (and accuracy assessments) to those genes that are regulated.
The gap gene network is densely connected, with 22 of the 28 possible links present in the gold standard model. Adding regularization to the optimization criterion risks eliminating true positives. Nevertheless, we tried optimizing the L 1-regularized error function E 1 for regularization constant c ranging from 0 to 10 in increments of 0.1. The results are summarized in Figure 2C. As one would hope, increasing c increases the number of true negatives from 1 (at c = 0) to 4 (at c = 10). At the same time, however, the number of true positives, and total correct links, drops drastically. The positive predictive value does not improve with c, as both true positives as well as false positives are dropped from the model (data not shown).
Figure 2D shows the network obtained by optimization of the E 0 criterion for the IRMA data. The IRMA network is sparse compared to the gap gene network, having only seven links among the five genes. The optimization correctly identifies six of those links, including their correct sign. It misses only the activation of GAL4 by CBF1, perhaps because the model also has the true regulators of CBF1 connected to GAL4-a case of mistaking direct versus indirect regulation. Without regularization, however, there are many false positive links in the estimated regulatory architecture. Figure 2E compares the performance of the E 0 optimization against the TSNI algorithm, which fits a linear differential equation model that is limited to at most two inputs per gene. This algorithm performed the best of several alternatives tested by Cantone et al. . The TSNI algorithm detected four of the seven true links in the network, attributing the correct sign to three of those links. Its overall fraction of correct links is much higher than that for the FDA fit, perhaps in part because the limitation to two inputs per gene ensures many true negatives–links absent in both the gold standard and the estimated model. (In the next section, we will see how FDA performs when limited to two inputs per gene.) Because of the large number of false positives in the FDA fit, it also has significantly lower PPV than TSNI. However, the FDA fit enjoys greater sensitivity and correct sign fraction.
Because the unregularized fit includes a large number of false positives, we hoped that adding the L 1-regularization would improve the accuracy of the estimated network architecture. Figure 2F shows the results for regularization constant c ranging between 0 and 10. Regularization was partly successful. For c ranging from roughly 3 to 5, one of the true positives was lost, but five false positives were also trimmed away, approximately halving their number. This still left seven false positives, however, which could only be eliminated by losing most of the true positives. From both our experience and the results of Cantone et al. , the IRMA dataset appears much more challenging than the gap gene data, perhaps due to the much smaller number of time series (only 5, compared to 58) or to noise in the data (the gap gene data incorporates significant smoothing and averaging across embryos, to eliminate observation noise and other sources of variability).
Broadly speaking, our results on the two sparse GeneNetWeaver networks mimicked our results on IRMA, and our results on the two dense GeneNetWeaver networks mimicked our results on the gap gene network. Figure 2G-J show the estimated network structures. For the sparse networks all (S2) or nearly all (S1) true links are detected and all are correctly signed. However, there are significant numbers of false positives–albeit less than in the IRMA fit. Conversely, the estimates for the dense networks include no (D1) or few (D2) false positives, but miss out identifying some true links. One particularly interesting case is gene G1 in network D1. This gene has five regulators, all of which act positively, and only two of which are identified by the FDA fit. The other regulators are difficult to detect because the gene is nearly always being activated, and so intuitively it appears almost as if it is unregulated–it is rare to observe the gene in a state that reveals anything about its regulation. The basic trends in true and false positives are reflected in the summary statistics shown in Figure 2K. PPV is moderate for the sparse network and high for the dense networks, while sensitivity is the opposite. One difference in comparision with the gap gene and IRMA results is that FDA obtains higher fractions of correctly signed links (activation, repression or no effect) on the sparse networks than on the dense networks. In all cases performance is significantly better than chance, which would only be right 1/3 of the time.
For the two sparse networks, where false positives were a concern, we evaluated the L 1-regularization approach to improving accuracy. The results are shown in Figure 2L. For both networks, regularization was able to eliminate the majority of the false positives, with little loss in true positives. For S1, the number of correct links (Corr) reached as high as 33 out of 36 links for regularization constant c around 4 or 5. For S2, the number of correct links was as high as 24 out of 30.
As mentioned above, the FDA approach to model fitting is computationally efficient. Part of its speed is due simply to the greater ease of evaluating the derivative-based error (Eq. 3) as opposed to the more traditional trajectory-based error (Eq. 2). We tested this in Matlab, comparing our implementation of the derivative-based error against a trajectory-based error function that uses the built-in ode45 function to solve the dynamics equation. Over a range of testing conditions, we found that the derivative-based error could be computed 300 ±40 times faster than trajectory-based error.
One of the advantages of the speed with which the FDA fits can be done is that we do not need to limit ourselves to unconstrained network architectures. We can explicitly test alternative architectures and, in fact, we are able to enumerate them all if the number of genes in the network is not too large. For the gap gene network, where all seven of the measured genes can act as input to any of the gap genes, there are 27 = 128 possible input combinations for any gene. Because each gene's model is fit independently, we can test all possible regulatory architectures with a total of 4 × 27 = 512 fits. This begins to be a significant computation, but on a 32-core computing cluster, it amounted to an overnight job. By enumerating all possible inputs for every gene, we are able to explicitly assess which regulators or combinations of regulators are most important for explaining each gene's observed expression. Enumeration also gives us another way to regularize the fit, by limiting the number of inputs per gene.
Figure 3D-I give summary statistics for the accuracy of the regulatory networks when limited to the best single-input, two-input, etc. models of each gene. By necessity, the fraction of correct links (CF) is low for the dense networks when only one or a few inputs are allowed, but increases as more inputs are allowed. The reverse happened for the IRMA network, where limiting to one or two inputs gave much better CF and PPV scores, though still not quite as high as achieved by TSNI. Interestingly, for the sparse networks S1 and S2, CF did not vary significantly with the number of regulators allowed. When extra regulators were allowed, the optimization could often drive superfluous weights to near zero, so that they fell below our significance threshold and were counted as zero links.
Our computational studies show that functional data analysis is a powerful approach to estimating nonlinear models of gene expression dynamics, and in particular, to estimating the regulatory relationships between genes. The accuracy of FDA was comparable to state of the art approaches on both the gap gene [7, 8] and IRMA  data sets, which have previously been analyzed by a number of methods, and it performed well on synthetic data sets generated by the GeneNetWeaver program . The FDA approach is computationally efficient because it transforms the estimation problem into a decomposable multiple regression problem. This efficiency enables in-depth analysis of the influences of different factors, as well as explicit exploration of all possible regulatory input combinations.
As with any estimation problem, overfitting-avoidance is an important consideration. We explored L 1-regularization as well as explicitly limiting the number of regulators allowed for each gene.
L 1-regularization was partly successful on the sparse IRMA network, and much more successful on the sparse GeneNetWeaver networks. L 1-regularization requires a constant, c, which determines the relative importance of accuracy of fit to the data versus model complexity (in the sense of summed absolute values of the regulatory weights). Testing a range of values for c allows us to identify links that are most important for accounting for the data. Large c means regulatory weights are highly penalized. Weights that remain significantly different from zero at large c are the best predictors, and thus represent the links in which we have the most confidence. Although there is no standard procedure for choosing a "best" value of c for this sort of data and fitting problem, empirically a value around c = 4 or c = 5 resulted in the highest accuracy of network reconstruction. That a common value worked for all networks is, no doubt, partly due to the similar scales of the expression data (after normalization) and the similar numbers of candidate regulators per gene. Still, it is surprising that common values of c emerged despite quite different numbers of time series for each network and different densities of regulatory links in the networks.
Explicitly evaluating all possible combinations of regulators allows one to see which combinations are the best predictors. In particular, this allows one to identify the best 1-input model of each gene, the best 2-input model, and so on. So, it provides another means for determining which candidate regulators are most important. At the same time, it reveals whether there are alternative solutions of nearly equal quality, and generally gives a more in depth view of the contributions of different regulators, especially when used in conjunction with visualizations methods, as shown in the Results section.
The approach that we have described for using FDA to estimate nonlinear differential equation models of gene expression dynamics can be extended in various ways. One important extension would be to accommodate genetic perturbation data, such as knock-outs, knock-downs or overexpression conditions. In the case of a complete knock-out, this is readily handled by hard-wiring expression of the knocked-out gene to zero in the model and otherwise fitting the data as usual. However, for partial knockdowns or overexpressions of unknown or time-varying magnitudes, more sophisticated procedures are needed. Another relevant extension would be to allow for delays in the differential equations. Cantone et al., for example, suggest that delays may be relevant to modeling their system . For known delays, the FDA framework extends trivially to accommodate delay differential equations, but when delays are unknown, more elaborate extensions are needed . Finally, another natural generalization to explore would be to Bayesian parameter estimation frameworks. Because FDA reduces the parameter estimation problem to one of nonlinear regression, standard methods for approximate computation of posteriors over parameters in nonlinear regression could be applied . Alternatively, if one is interested in Markov-chain Monte Carlo  or reversible-jump Markov-chain Monte Carlo  approaches to Bayesian parameter and/or network structure estimation for genetic networks, then the efficiency of evaluating the data likelihood under an FDA model, and the decomposition of the problem into separate genes, should be of great advantage.
To obtain the temporal derivatives of the time series data, it is necessary to obtain a functional representation of the data. We constructed continuous-time series by interpolating the data with cubic splines, as implemented in the Matlab Spline Toolbox. This toolbox also includes a function to compute the derivatives from the spline. Cubic splines are not wholly defined by the data, but also depend on assumptions at or near the boundaries–in our case, the start of the time series and the end of the time series. The default approach taken by the Matlab's spline function is to use the "not-a-knot" assumption, which states that the third derivative of the spline function should be continuous at the second knot point and the next-to-last knot point . Matlab offers other approaches for completing cubic splines. In pilot studies, we tried the default (not-a-knot) approach, natural cubic splines (which have second derivatives equal to zero at the endpoints; Matlab calls this the "variational" approach), and Matlab's "complete" approach (which sets first derivatives at the endpoints based on an estimate from the function values at the nearest four knots). We found that these different methods for completing the cubic splines had only small effects on the interpolated curves and negligible effects on parameter estimates for our models. So, throughout this paper we used the default not-a-knot approach.
For the IRMA and GeneNetWeaver data sets, we also experimented with smoothing the data first, using the smooth function of Matlab, but this did not affect results significantly. For the Drosophila data, we smoothed/simplified the data by eliminating certain time and space points before interpolating with cubic splines. The space and time points that we eliminated were selected by an evolutionary strategy that sought to minimize a criterion that combined the squared error between the original data and the cubic-splined interpolation at the same point and a penalty for small fluctuations in the derivative.
Minimization of the E 0 or E 1 criteria was done by the Matlab function fmincon. For each optimization, we did 1000 runs from different randomized starting conditions, initializing parameters uniformly within their allowed intervals. For Drosophila the expression data ranged between 0 and 255, regulatory weights were constrained to [–0.1, 0.1], production rates to [0, 25], and decay rates to [0,10]. The bias term was fixed at –3.5, following previous work . For the IRMA and GeneNetWeaver networks, the expression data was multiplied by 100, so that it fell in the range [0, 100]. Weights were constrained to [–0.2,0.2], production terms to [0, 25], decay terms to [0,10] and bias terms to [–25, 25]. Of the 1000 runs, the one resulting in the lowest error was reported as the solution. (Typically many runs found solutions with nearly the same weights and nearly the same error. We never observed two clearly distinct solutions of equal or near equal quality.) Weights less than 0.006 in magnitude were considered zero, and were otherwise considered positive or negative, depending on their sign. For the enumerations, we used the same fitting procedure except that links that were not part of the optimization were contrained to be zero.
For the GeneNetWeaver  experiments, we used the program 30 times to generate networks of seven genes. From these, we chose the two sparse networks and the two dense networks to generate data. The data were produced according to the rules of the DREAM5 contest, except without noise. Numerous papers have addressed the removal of noise from this data as part of inference (e.g., [32, 38]). The program generates time series by applying a perturbation from the steady state for a period of time T, and then removing the perturbation and letting the network relax back towards the steady state for an equal period of time. We used the second half of the time series, as it describes the wild-type (unperturbed) behavior of the network.
Ordinary differential equation
Functional data analysis
Correct sign fraction
Positive predictive value
A synthetic gene network created in yeast, and reported by Cantone et al. 
A fitting algorithm identified as the best-performing among several alternatives investigated by Cantone et al.  on the IRMA data
An annual contest on reverse engineering
We thank Gareth Palidwor for computing and Matlab support. This work was funded in part by grants from the National Sciences and Engineering Research Council of Canada, and the Ottawa Hospital Research Institute. GS was supported in part by a training grant from the Ontario Ministry of Research and Innovation, through its Ontario Research Fund - Research Excellence program.
This article has been published as part of BMC Genomics Volume 11 Supplement 4, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S4.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.