Analysis of functional importance of binding sites in the Drosophila gap gene network model

Background The statistical thermodynamics based approach provides a promising framework for construction of the genotype-phenotype map in many biological systems. Among important aspects of a good model connecting the DNA sequence information with that of a molecular phenotype (gene expression) is the selection of regulatory interactions and relevant transcription factor bindings sites. As the model may predict different levels of the functional importance of specific binding sites in different genomic and regulatory contexts, it is essential to formulate and study such models under different modeling assumptions. Results We elaborate a two-layer model for the Drosophila gap gene network and include in the model a combined set of transcription factor binding sites and concentration dependent regulatory interaction between gap genes hunchback and Kruppel. We show that the new variants of the model are more consistent in terms of gene expression predictions for various genetic constructs in comparison to previous work. We quantify the functional importance of binding sites by calculating their impact on gene expression in the model and calculate how these impacts correlate across all sites under different modeling assumptions. Conclusions The assumption about the dual interaction between hb and Kr leads to the most consistent modeling results, but, on the other hand, may obscure existence of indirect interactions between binding sites in regulatory regions of distinct genes. The analysis confirms the previously formulated regulation concept of many weak binding sites working in concert. The model predicts a more or less uniform distribution of functionally important binding sites over the sets of experimentally characterized regulatory modules and other open chromatin domains.


Background
The construction of a quantitative genotype-phenotype map is one of the most challenging problems in current biology. Mathematical models of gene regulatory networks explicitly connecting the DNA sequence level with that of gene expression or other molecular phenotypic traits represent a flexible framework for studying various aspects of genotype-phenotype mapping [1]. By using this approach, it is possible to assess the importance of various mechanisms involved in translation of genetic information into phenotype and their interplays in different contexts [2,3]. A promising approach for this type of modeling combines methods of the statistical thermodynamics and dynamical systems for model formulation and incorporates the latest experimental results on the sequence and gene expression analysis for its verification [4].
Along this research direction, we previously applied the thermodynamics based model to the Drosophila gap gene network, which is a sub-network of the segmentation gene network. The segmentation genes are expressed in early Drosophila embryo and determine positions of the body segments [5]. The gap gene subnetwork consists of the following genes: bicoid (bcd) and caudal (cad) are maternal coordinate genes, tailless (tll) and huckebein (hkb) are terminal gap genes, and hunchback (hb), Kruppel (Kr), giant (gt), and knirps (kni) are trunk gap genes. All these genes encode transcription factors. The model takes as input the potential regulatory regions of trunk gap genes, predicts transcription factor binding sites (TFBSs) and estimates their binding affinities for all transcription factors (TFs) regulating these genes by using positional weight matrices (PWMs), and then calculates as output the dynamics of gene products (both mRNAs and proteins) in the embryo nuclei at the blastoderm stage. The values of free parameters in the model were estimated by fitting the model solution to the in situ data on gap gene expression (described in Methods), and the quality of fit was quantitatively satisfactory. We aim to use this model in this study to analyze how functional important TFBSs are distributed in the regulatory regions and how their impacts on gene expression patterns correlate with each other.
A special question of interest is how this analysis depends on specific assumptions about the regulatory interactions between genes. In order to do that, we extend our model in several ways. The first extension concerns new binding sites added to the model. In the first variant of the model, we implemented only those high-affinity TFBSs that fall into the DNase I accessibility regions in the regulatory sequence, as these regions presumably correspond to the open chromatin domains. However, filtering by DNase accessibility might discard some of functional TFBSs, so we added to our model binding sites previously reported as functional (according to the study [6]) but whose coordinates do not intersect with the DNase I accessibility regions.
Another extension concerns new regulatory interactions between hb and Kr genes. Based on the analysis of fixed mutant expression patterns and in vitro data, it has been suggested that Hunchback (Hb) protein acts as a dual regulator of Kr : it activates Kr at low Hb concentrations and inhibits at high concentrations [7][8][9][10][11]. Based on the previous findings that Kr monomer can act as activator, while the homodimer can act as inhibitor [12], a recent study has suggested that a similar dual role (activation at low concentration and repression at high concentration) can be played by Kruppel (Kr) in its regulatory action on hb, and a model of this regulatory action is elaborated in that study [13].
The main purpose of this study is to understand the functional organization of the gap gene regulatory regions by developing and applying an advanced model that takes into account the above extensions of our previous model for the gap gene network. We implement both the extended set of TFBSs and the dual regulatory interactions between hb and Kr in a series of the model variants with increasing complexity. The new variants of the model correctly describe wild-type expression of the four gap genes in the mid-embryo. They also correctly predict gene expression for a larger number of experimentally characterized cis-regulatory modules (CRMs) in comparison to the previous model [5]. In the framework of the new model variants, we show that the TFBSs with the strongest impact on gene expressions appear in all domains of the regulatory regions considered in the model. The analysis demonstrates how TFBSs group into clusters of functionally important modules responsible for different spatial parts of the expression patterns, thus forming the in silico images of CRMs. We show how TFBSs work in concert by elucidating the correlation between impacts on gene expression from TFBSs of different TFs and from different regulatory regions. Taking all together, the study sheds more light on the binding site organizational level of the gap gene regulatory network.

Selection of binding sites
We predict binding sites using positional weight matrices (PWMs, see Additional file 1) in the region spanning 12 Kbp upstream and 6Kbp downstream of the transcription start site of each of four gap genes hb, Kr, kni, and gt and for eight transcription factors Bcd, Cad, Hb, Gt, Kr, Kni, Tll, and Hkb [5]. We apply two constraining factors for binding site selection in order to reduce false-positive predictions. According to the first constraint introduced in the previous work [5], we use only sites belonging to DNase I accessibility regions and not overlapping with the coding sequence. However, this set of sites has only partial overlap with the experimentally verified cis-regulatory modules (CRMs) [6], which drive expression in the trunk region of the embryo at the blastoderm stage, and thus misses a notable fraction of binding sites predictions. In this study, we additionally considered footprint sites and CRMs from RedFly database [14], which is a curated collection of known Drosophila CRMs and TFBSs. The full list of CRMs considered is given in Table 1 and the number of sites for each target gene is shown in Table 2. This new list also includes predicted binding sites which overlap with the TFBSs earlier verified by DNase I footprinting, thus allowing to assess their contribution to the model. The number of sites under consideration is 1419 which is 1.6 times more than 889 sites in the previous work.

Combined model of gap gene network
In order to explore the functional importance of the selected binding sites in different contexts, we elaborate our previous model of gap gene expression hereinafter referred to as Model 1 [5]. It describes the expression of four gap genes in a one-dimensional region on the central midline of the embryo (58 nuclei from 35% to 92% of embryo length, EL), and the time period covering cleavage cycles 13 and 14A that are about 20 min and 50 min long respectively. Cleavage cycle 14A is divided into 8 temporal classes (T1-T8) of 6.5 minutes each. Here, we briefly describe our approach and new incorporated enhancements.
The model derives the gap gene mRNA and protein concentrations from the regulatory sequence. At the first step, we calculate the gene activation probability based on the statistical thermodynamics approach [2]. This approach links the gene activation probability to all possible molecular configurations of the regulatory sequence, representing different combinations of free and bound binding sites. The model incorporates homotypic cooperativity of binding proteins [15] and short range repression [16,17]. At the second step, we use reaction-diffusion equations for the spatio-temporal dynamics of mRNA and protein concentrations of the gap genes hb, Kr, gt, and kni, with the time-delay parameter accounting for the temporal separation of transcription and translation processes. We enhanced the model in three steps with increasing complexity. First, we added the synergy to the model by allowing several DNA bound activators to interact with the basal transcription machinery (BTM) simultaneously ("multiplicative effect" model). The synergistic effects of activators are represented in the previous variant of the model only via the homotypic cooperativity mechanism, while only one bound activator is allowed to directly interact with the BTM in that model ("additive effect" model). It can be shown that the total synergistic effect from all activators will disappear at high activator concentrations under the cooperative binding model (activator binding has already been saturated under this condition, thus  cooperative interactions will not be further helpful), but not under the multiplicative model. The maximal number of activators interacting with the BTM is defined as a parameter N max . Another improvement in the new variant of the model concerns the way how the solution is calculated. The gene activation probability in Model 1 was calculated using the experimental data on protein concentration for TFs Bcd, Cad, Tll, Hkb and also Hb, Kr, Gt, Kni that are included in the model as target genes. The numerical solution of model equations for proteins is used in the new variant of the model as concentration profiles for TFs Hb, Kr, Gt, Kni, as generally expected when simulating differential equations. We use Model 2 as the name for the model variant incorporating the synergy and the new simulation method.
Secondly, we added to Model 2 the dual action of TF Hb on gene Kr by replacing the single parameter for the strength of Hb's action on Kr by three parameters: a threshold concentration, an activation strength when Hb concentration is smaller than the threshold concentration, and a repression strength when Hb concentration is larger than this threshold. This variant of the model is called Model 3. Finally, we modified Model 3 by introducing a similar dual action of Kr on hb and call this variant of the model Model 4.
Another novelty for Models 2-4 in comparison with Model 1 concerns the data used for model fitting. The output of Model 1 has previously been fitted to the data on gap protein concentrations from the FlyEx database [18,5], while for the new Models 2-4 we fit the output both to the protein data and to the RNA data from the SuperFly database [19]. The optimization of parameter values is carried out by the differential evolution (DEEP) method [20] to minimize a combined objective function. This function is a sum of the residual sum of squared differences between the model output and the data, the 'weighted Pattern generating potential' [21], and a penalty term, which limits a growth of regulatory weights (see Methods).
We performed about a hundred optimization runs in total for three model variants. It should be noted that several runs are needed due to the stochastic nature of the optimization method in order to find the parameters that provide the minimum value for the combined objective function. We obtained about ten solutions for each model variant with the objective function for proteins close to the value 350000 reported for Model 1. The values of combined objective function for these solutions were (75 ± 2) × 10 4 , (75 ± 2) × 10 4 and (70 ± 7) × 10 4 for Model 2, Model 3 and Model 4 respectively. We select the best model output from optimization runs which both has a small value of the objective function and contains a minimal number of visual defects in the expression patterns. The optimization results for Models 2-4 yield solutions of comparable quality as for Model 1 (for protein concentrations), and the corresponding gap gene expression patterns qualitatively match the data profiles in the wild type embryo ( Figure 1). Model 4 provides the best fit among the three new variants of the model. Despite the overall satisfactory correspondence between the solutions and the data, all model variants bear some common inconsistencies in positioning of some domain borders and an expanded Kni domain. On the other hand, the new solutions demonstrate a good balance between the fit qualities for mRNAs and proteins, in contrast to Model 1 whose fit quality has only been controlled for proteins. The typical defects in model solution are illustrated in Figure S42.
The regulatory parameters are presented in Tables 3, 4, 5. Most features of the gap gene network topology are in agreement with previous modeling results and data from the literature [22]. Bcd and Cad activate zygotic gap gene expression in a majority of circuits. The only exception is the repression of gt by Bcd in Model 2. The reciprocal interactions between the trunk gap genes Kr, hb, kni, and gt are repressive. Tll represses Kr, gt, and kni and acts as activator of hb. Hkb represses hb, Kr, gt, and kni. Gt autoactivates itself in all models. Kr and hb autoactivate themselves in Model 2, but repress themselves in Models 3 and 4. Kni autoactivates itself in Models 2 and 3 and represses itself in Model 4.

Prediction of gap expression in reporter constructs
As the regulatory sequence adopted in the model contains both TFBSs from known CRMs and other potentially strong TFBSs from the accessible chromatin, it is an important test to verify how the CRM's binding sites operate in the background of the rest of the TFBSs. We do it by in silico predicting the CRM driven expression in the model and comparing it to the experimental data. We model gap gene expression in reporter constructs by taking as input only those TFBS that overlap with a CRM contained in a reporter [14]. Tables 6, 7, 8 summarize results of sumulations.
The images of patterns of expression driven by CRMs cloned in reporter constructs and obtained by RNA in situ hybridization with lacZ probes are available for the majority of constructs under consideration [14]. From these data we determine what pattern elements of the endogenous gene are driven by the CRM and compare the corresponding parts of data profiles with simulations. We report that prediction is correct if domains of the simulated pattern overlap with those observed experimentally (see Additional files 2, 3, 4, 5).
Model 4 has the best prediction power as it successfully predicts gap gene expression patterns in 49 constructs out of 59 in comparison to 36 and 46 correctly predicted by Model 2 and Model 3, respectively.
Given the fact that the models correctly predict the basic features of gap gene expression in the majority of the constructs we conclude that all the model correctly account for the functional role of most of the included CRM's binding sites. These results also demonstrate the predictive power of the models. Interestingly, expression patterns driven by some CRMs (e.g. hb_HZ340, hb_HZ526, hb_distal_nonminimal, hb_matDm0.6-lacZ, hb_matDm0.5-lacZ, Kr_HI, Kr_HH, Kr_HE) are not predicted by any model. A possible reason may be that some regulatory mechanisms are still missing in our model, or it is the TFBS prediction quality that limits the model power.
If the expression of the construct is predicted by a model, it is also predicted by subsequent models (models with higher complexity) with 3 exceptions: kni_+1, kni_223+64, Kr_BdelNc0.8. The most deviations in the predictions are for CRMs from the gt and kni regulatory regions (Table 7): Models 2 and 3 are essentially less accurate than Model 4 in gt 's CRM simulation, and Models 3 and 4 are more accurate than Model 2 in kni 's CRM simulation.

Impact of individual TFBS on model solution
We study the functional importance of TFBSs under different modeling assumptions by calculating the regulatory weight (RW) w f r of a TFBS:  Rows 2-5 present regulatory parameters T ab for target genes while columns correspond to TFs. The action of Hb on Kr is characterized by two coefficients for intensities less and greater than the threshold value (given in brackets). K, ω and τ are affinity, cooperativity and delay constants respectively. N max and q BTM are the number of activators interacting with the BTM and the BTM affinity constant respectively. Range is the range for repressor action in bps. Rows 2-5 present regulatory parameters T ab for target genes while columns correspond to TFs. The action of Hb on Kr is characterized by two coefficients for intensities less and greater than the threshold value (given in brackets). K, ω and τ are affinity, cooperativity and delay constants respectively. N max and q BTM are the number of activators interacting with the BTM and the BTM affinity constant respectively. The range for the repression is shown in the same row. Range is the range for repressor action in bps.
where f ∈ {RSS,wPGP} is one of the two measures for the proximity of the solution to the data (see Methods), f ref is the value of f for the model solution for the full set of annotated sites, and f mut is the same value calculated with the site of interest excluded. Therefore, w f r quantifies the input of the TFBS in the solution. To explore how the regulatory weight depends on the context (either the solution quality assessment or the modeling assumptions), we analyze its values for all model variants, f measures and annotated sites.
We find that the RWs estimated with RSS and wPGP strongly correlate in the case of Models 2 and 3 with Pearson correlation coefficients r = 0.85 (P = 2.2 × 10 −16 ) and r = 0.57 (P = 2.2 × 10 −16 ), respectively ( Table 9). The RW does not correlate with the PWM score, a measure of the binding affinity of the site ( Table 9). The absence of correlation between the RW measures and PWM scores is not surprising. The later reflects the strength of TF binding per se to DNA and proceeds from the premise that neighboring positions in DNA are independent, while the former measure considers the impact of TF removal on phenotype that is mediated by gene interactions in the network. It is worthy of note, that recently the comparative analysis of 12 Drosophila genomes in a phylogenetic framework failed to find evidence that selective constraint across cis-regulatory sequences is correlated with predicted TF binding affinity [23].
For Model 4, the RWs estimated with RSS and wPGP have a weaker correlation, r = 0.21 (P = 1.11 × 10 −15 ), but also a bit more essentially correlate with the PWM score, r = 0.16 (P = 4.03 × 10 −9 ) and r = 0.15 (P = 2.044 × 10 −8 ), respectively ( Table 9). The loss of correlation between the RWs for the wPGP and RSS measures indicates that the TFBSs in Model 4 encode for those properties of the expression domains which cannot be captured by the standard RSS measure. As the wPGP measure is shown to be more than the RSS one oriented on quantifying the constitutive spatial characteristics of the expression patterns, this result suggests that this variant of the model represents the functional role of TFBSs more accurately. This also appears to be accompanied with the stronger connection between the site impact on expression and its binding affinity, although this connection remains generally weak, in accordance with our previous findings [5].
Interestingly, most of the inter-model correlations are weak or negligible except the one between w rss for Models 3 and 4, r = 0.28 (P = 2.2 × 10 −16 ) (see Table 10). The correlation between w wpgp for those models is almost twice lower, r = 0.16 (P = 2.304 × 10 −9 ). These  results demonstrate that Models 3 and 4 with dual regulatory actions are closer to each other than to Model 2.
In Figure 2, we plot the RWs of TFBSs estimated with the RSS measure in Model 4 relative to their positions in a regulatory region. Only a small number of sites demonstrate a high impact on the model solution, with the majority of sites having a relatively low individual influence. This finding supports our previous results [5].
Using the histograms of the TFBS RWs (Figure 3), we select the threshold value for w rss equal to 0.02 and for w wpgp equal to 0.05 and further analyze the sites with w r exceeding these thresholds. The complete site lists are presented in Tables S9-S19 in Additional file 1. There is a difference between the model variants in the way the strongest sites are distributed between distinct groups of TFBSs, namely belonging to CRMs, DNase I accessibility region, or both (Table 11). In Model 2, TFBSs from CRMs play the most important role. All regions contribute approximately similar in Model 3. In case of Model 4, the majority of important sites belong to both CRMs and the DNase I accessibility region. Figures 2 and 3 demonstrate that most of the TFBSs are relatively weak and only a few sites are strong in the model with respect to their influence on gene expression. On the other hand, it is not possible to neglect even a small portion of the weak sites without visible reduction in the model output quality (see Figure 7 in [5]). These facts together define the concept of many weak binding sites working in concert, as opposed to the concept of a small number of strong sites controlling everything. It has been shown previously with the help of evolutionary simulations in a similar modeling framework how such enhancer organization may eventually appear during the evolution [24].

Differential expression of individual TFBS
We further study how the impact of each TFBS on gene expression is distributed in space and time and how these spatio-temporal impact distributions correlate for distinct binding sites. We calculate the impact distribution for a given TFBS by setting the diffusion rate parameter to zero in the model equations and computing the difference in all nuclei and at all times between the solution for the case with all sites included and the solution for the case with the binding site of interest excluded. We do it for all sites and for all new model variants. Setting the diffusion rate equal to zero does not lead to significant perturbations of the expression patterns in the model (Figures S12-S17) and, hence, can be used for the analysis. Figures 4, S27, and S40 show the correlation matrices for the TFBS impact distributions calculated for Models 3, 4, and 2, respectively. The color in the figure represents the level of correlation between the impact distributions for each pair of TFBSs. If the correlation is positive, the affects of these two TFBSs on gene expression are similar in different nuclei and at different times (either both sites increase expression level or both decrease expression level), i. e. these affects have the same sign. If the correlation is negative, the impacts from the two sites are of different signs across time and space (if one site increases expression level, the other one decreases). The absence of correlation means that for some nuclei and time points the impact from the two sites can be of the same sign, while for others of the opposite signs. The sites are ordered in alphabetical order by target gene (gt, hb, kni, and Kr) and, secondly, by TF (Bcd, Cad, Gt, Hb, Hkb, Kni, Kr, and Tll) in each group.
The clusters of highly correlated sites appear in the figure as big square islands of yellow or red color. For Model 3, the majority of these clusters are located on Columns correspond to target genes, rows to models.  the main diagonal and for the regulatory regions of hb, kni, and Kr ( Figure 4). This is quite expected result as these TFBS belong to one regulatory region and have either the same or opposite impact on gene expression. In contrast, the Bcd sites in the gt regulatory region have high positive correlation with the Hb sites in the hb regulatory region (a part of the Figure 4 marked with asterisk). This correlation demonstrates an indirect interaction between the Bcd and Hb binding sites: Bcd activates gt, and Gt represses hb, so the net effect of the Bcd sites on hb is repression, and the same effect is true for the Hb sites on hb in Model 3. As the clusters of highly correlated sites are mainly located on main diagonal, we conclude that this model variant has low amount of indirect TFBS interactions. The correlation matrix for Model 2 leads to many yellow squares outside of the diagonal, corresponding to highly correlating sites in the regulatory regions of different genes due to indirect interactions ( Figure S33). The case of Model 4 is somewhere in between with respect to the number of the off-diagonal yellow and red islands ( Figure  S24). The color intensity of those islands (the correlation strength) is also lower than for Model 2. The moderate impact of indirect interactions is intuitive as they are known to be present but are side-effects of the primary regulatory mechanisms. As the number of sites with strong impact is less than 50 for each model, the clusters of sites with high positive or negative correlation can be seen as the illustration of relatively weak impact sites playing in concert. It should also be noted that these indirect effects cannot be captured with Model 1 from [5]. This is because, by definition, the indirect connection between a TFBS for a TF v a with a TFBS for a TF v b takes place only via the influence on another solution component v c . As mentioned above, the solution in Model 1 is calculated by solving equations with the synthesis term that depends only on the data TF profiles at any time, not on the solution, and the data profiles evidently are not influenced by the TFBSs.  (Figure 6; see also Figures S19, S21, and S23 for other genes and Figures S26, S28, S30, S32, S35, S37, S39, and S41 for other model variants).
As opposed to Figure 5, Figure 6 shows the spatial distribution of impacts of different TFs on hb expression.  Columns correspond to regions, rows to models.
Despite the zero diffusion coefficient that prevents the direct transfer of the gene products between the nuclei, the impact of a site may cover several adjacent nuclei with variable magnitude due to regulatory interactions as shown in the inset of Figure 6 by different color intensities for individual nuclei represented by small rectangles. It can be seen from Figures 5 and 6 that a single TFBS can act both as an activator and a repressor of hb depending on the spatial position. For example, some Bcd sites activate hb in the posterior domain and repress in the anterior one ( Figure 6). This change of the regulation sign may happen only due to interaction with an intermediate TF. As Bcd activates hb, the direct impact of its binding sites on hb expression can be only activating. However, Bcd also activates all other inhibitors of hb (see Table 5), so that its binding sites can indirectly repress hb expression in specific positions.

Conclusions
We presented new modeling results in the framework of the two-layer model for gap gene expression. The previously formulated model [5] had serious drawback, namely the gene activation probability was calculated using the experimental data on TF protein concentration for target genes hb, Kr, gt and kni. Here, we introduced several new model variants in which the numerical solution of model equations for proteins is used as concentration profiles for these TFs, as generally expected when simulating differential equations.
Other improvements include the addition of synergy by allowing several DNA bound activators to interact with the BTM simultaneously, dual action of TF Hb on gene Kr or dual regulatory interactions between hb and Kr. All the model variants consider the information about all experimentally characterized CRMs for this developmental system. For all assumptions, the model qualitatively describes the characteristic features of gene Figure 4 Correlation matrix between spatio-temporal distributions of TFBS impacts (model 3). The diffusion rate parameter was set to zero during calculation. The color in the figure reflects the correlation strength between impact distributions for each pair of TFBSs. The sites are ordered alphabetically -first by target gene (gt, hb, kni, and Kr) and then by TF (Bcd, Cad, Gt, Hb, Hkb, Kni, Kr, and Tll) in each group. The clusters of highly correlated sites appear as rectangles of yellow or red color. Arrow with asterisk marks the matrix region which shows high positive correlation between Bcd sites in the gt regulatory region and Hb sites in the hb regulatory region. expression patterns. However, from the obtained regulatory parameters we may conclude that the assumption about the dual interaction between hb and Kr leads to the most consistent results.
The defects present in the expression patterns predicted in all variants of the model can have multiple reasons. The thermodynamic modeling approach has its own limitation. For example, it does not take into account the dynamical effects of the enhanceosome assembly [25,26], which might influence the relative probabilities of different molecular configurations of the regulatory regions. The implemented methods for TFBS search may also contain errors, leading to possible mispredictions for binding sites. The imperfect expression pattern for hb can be explained by the complex twopromoter structure of this gene, with different enhancers influencing different promoters [27], which currently is not taken into account in our model. The difference in parameter values between the model variants represents the influence of different modeling assumptions in these variants and cannot be explained by the overfitting or parameter nonidentifiability issues, as we have shown previously that the model is relatively stable in this respect [5]. This difference underlines the disagreement between the models in the predicted correlation patterns for the impacts of TFBSs on gene expression and the functional organization of the regulatory regions. As the model variant with the dual interactions between hb and Kr (Model 4) is more consistent in terms of the topology of the regulatory network and in terms of the predicted CRM's expression, we believe the results obtained for this model variant deserves more attention. However, additional experimental validation is necessary to make a definite decision.
Our results are in agreement with the previously formulated regulation concept of many weak binding sites working in concert [24,5], following the idea of homotypic binding sites clusters [28]. It is hard to define a threshold for the functional importance of binding sites under this concept, and it makes the problem of selecting the complete set of important binding sites more vague. We applied a combined approach for this selection by joining two sets of potential binding sites, high-affinity TFBSs from the open chromatin domain and experimentally characterized TFBSs irrespective of their position in the sequence. Some TFBSs with high functional impact from the hb, Kr, and kni regulatory regions coincide with the strong sites annotated and verified in the DNase I footprint assays. As the majority of functionally important sites in the best model variant belongs to the intersection between the DNase I accessibility region and experimentally characterized CRMs, we conclude that there should be a balance in binding site selection for modeling gene regulation. The existing information about CRMs is essential, but is not enough to fully determine the expression patterns. On the other hand, our results support previous findings about importance of the DNase I accessibility regions usage for modeling [5,29].
Our model allows to estimate the effect of individual TFBS on molecular phenotype, gap gene expression patterns. This effect is mediated by interactions between different bound TFs. An important advantage of new model variants is their ability to account for indirect interactions between individual binding sites. The analysis reveals specific examples of such binding sites in the regulatory regions of the gap genes and elucidates the regulatory mechanism for their interplay. This mechanism provides a potential basis for the evolutionary compensatory effects such as the binding site turnover [30][31][32]. However, not all variants of the model demonstrate the indirect interactions between the sites. Comparing different modeling assumptions in this context, we conclude that the presence of the dual regulatory action probably obscures existence of such regulatory compensations.
The presented two-layer modeling framework might be applicable to other gene regulatory networks previously described by ODEs alone (e.g [33] (C. albipunctata), [34] (N. vectensis)) given the data on TFBSs, PWMs and expression patterns are available.

Transcription factor and gene expression data
Concentrations of transcription factors Hb, Kr, Gt, Kni, Bcd, Cad, Tll, and Hkb were taken from FlyEx database (http://urchin.spbcas.ru, [18]), and the mRNA data for those factors were taken from SuperFly database [19]. The resulting data had the form of the gene product concentration profiles in 30-58 nuclei on the anterior-posterior axis of the embryo at nine time points (cleavage cycle 13 and eight time classes in cycle 14A). The gene reporter constructs and their expression images were obtained from REDFly database [14].

Sequence data
The potential regulatory regions spanning 12 Kbp upstream and 6Kbp downstream of the transcription start site for the gap genes hb, Kr, kni, and gt were analyzed, and TFBSs in these regions for transcription factors Bcd, Cad, Hb, Gt, Kr, Kni, Tll, and Hkb were predicted by the method of position weight matrices (PWMs) [35]. The PWMs were described in [36] and can be found at http://www.autosome.ru/iDMMPMM/ (see also the Additional file 1). The PWM thresholds were selected as in [37]. Among all predicted TFBSs, only those were added to the model which satisfy at least one of the following conditions: (1) sites having high PWM score and being located in the DNase I accessibility regions, (2) sites overlapping with the CRMs from the regulatory regions, according to RedFly database [14], and (3) sites overlapping with the TFBSs individually confirmed by DNAse I footprints [14].

Model equations
The model formulation is presented in our previous paper [5]. Here, we briefly describe the main equations and the modifications introduced in the new study.
The structure of the model equations has two levels. At the first level, the probability of transcriptional activation for each target gene is calculated for each embryo nucleus and at each time moment. At the second level, the dynamics of the mRNA and protein concentrations is prescribed by differential equations which incorporate the activation probability as the synthesis term.
The probability of transcriptional activation is calculated following the thermodynamic approach [2,5]: where s enumerates all possible molecular configurations (sets of free and bound TFBSs) of the regulatory region for gene a, W a i (σ , t) is the statistical weight (relative probability) of configuration s for nucleus-time coordinate (i, t), and Q a (s) includes parameters quantifying the strength of interaction between bound TFs and the basal transcriptional machinery. These weights depend on the concentrations v b i (t) of all TFs b regulating gene a in nucleus i at time t, on the binding affinities of all TFBSs, and other parameters, and corresponding formulas for this dependence are given in [5] and Additional file 1.
The binding affinities of TFBSs are calculated as a part of weights W in (1) via the PWM-scores and scaled by a free parameter K a (S max ), which is the binding affinity constant for the consensus binding site sequence S max for TF a. The possible cooperativity between binding sites is parameterized by the cooperativity constants ω a (one constant per each TF a) and the distance d between the binding sites on which the cooperative interaction is possible. The strength of the influence of bound TF b on the target gene a is presented in E a i as parameters T ab , and their negative values correspond to a repressive action, while the positive values to an activating action. The repression is modeled via the short-range mechanism, so that the corresponding parameters T ab appear in W as weights at molecular configurations with bound repressors. The parameter d R prescribes the distance in the sequence on which the short-range repression is active in a vicinity of bound repressor. The parameters T ab for activators are included in Q terms in (1).
The activation probability (1) is translated to differential equations describing the synthesis and transport of mRNAs and proteins. These equations include production, diffusion, and decay terms: where u a i (t) and v a i (t) are concentrations of mRNA and protein, respectively, for target gene a in nucleus i at time t. n is the cleavage cycle number, R a v and R a u are the maximum synthesis rates, D a v and D a u are the diffusion coefficients, and λ a v and λ a u are the decay rates for protein and mRNA of gene a. The delay parameter τ accounts for the average time between events of transcription initiation and corresponding protein synthesis.
The dual transcriptional action of TF a on gene b is implemented as follows. Instead of a single interaction parameter T ab , we introduce three: a threshold concentration V, an activation strength T ab + ≥ 0 for the case when v a ≤ V , and a repression strength T ab − ≤ 0 for the case when v a >V. This type of transcriptional actions is incorporated in the model for the dual interaction between gap genes hb and Kr.

Model fitting
We fitted the model to the mRNA and protein concentration data for gap genes hb, Kr, gt, and kni. The values for all free parameters in the model were optimized by the differential evolution entirely parallel (DEEP) method. DEEP is a stochastic global optimization technique described in [20] capable of utilizing several objective functions that combine differential evolution, control of population diversity and the concept of individual age for population member substitution.
The following combined objective function was minimized during the optimization procedure: Error = 0.01 * RSS + 5 * 10 4 * wPGP + 0.005 * Penalty, where the weights were obtained empirically, and the components were defined as follows. The residual sum of squared differences between the model output and data (RSS) is calculated and summed up for both mRNA and protein concentrations u a i (t) and v a i (t): where a, i and t are gene, nucleus, and time point, respectively.
The weighted Pattern Generation Potential wPGP was introduced in [21] as a heuristic measure accounting for characteristic features of gap gene expression patterns that it is always less than or equal to 1. We minimize the sum of wPGP values calculated separately for mRNA u a i (t) and protein concentrations v a i (t) for each gene a and time point t: wPGP = ∀a,t:∃data a (t) wPGP a (t), wPGP a (t) = 1 2 + penalty a (t) − reward a (t) 2 where (omitting common variable t for time and index a for gene) and p i denotes model output u a i (t) for mRNA or v a i (t) for protein while the corresponding data are denoted as r i with its maximum level r max . Correctly predicted amount of expression that can be defined for each nuclei as min(p i , r i ) is weighted by the predicted expression level r i and 'rewarded' that is subtracted from the Error. The incorrect predictions defined as |p i − r i | are weighted by the extent of incorrect expression (r max − r i ) and added to the Error, i.e. 'penalized'.
The third component in the combined objective function penalizes the squared values of the regulatory parameters T ab :