Molecular and structural considerations of TF-DNA binding for the generation of biologically meaningful and accurate phylogenetic footprinting analysis: the LysR-type transcriptional regulator family as a study model
- Patricia Oliver†1,
- Martín Peralta-Gil†2,
- María-Luisa Tabche1 and
- Enrique Merino1Email authorView ORCID ID profile
© The Author(s). 2016
Received: 18 April 2016
Accepted: 18 August 2016
Published: 27 August 2016
The goal of most programs developed to find transcription factor binding sites (TFBSs) is the identification of discrete sequence motifs that are significantly over-represented in a given set of sequences where a transcription factor (TF) is expected to bind. These programs assume that the nucleotide conservation of a specific motif is indicative of a selective pressure required for the recognition of a TF for its corresponding TFBS. Despite their extensive use, the accuracies reached with these programs remain low. In many cases, true TFBSs are excluded from the identification process, especially when they correspond to low-affinity but important binding sites of regulatory systems.
We developed a computational protocol based on molecular and structural criteria to perform biologically meaningful and accurate phylogenetic footprinting analyses. Our protocol considers fundamental aspects of the TF-DNA binding process, such as: i) the active homodimeric conformations of TFs that impose symmetric structures on the TFBSs, ii) the cooperative binding of TFs, iii) the effects of the presence or absence of co-inducers, iv) the proximity between two TFBSs or one TFBS and a promoter that leads to very long spurious motifs, v) the presence of AT-rich sequences not recognized by the TF but that are required for DNA flexibility, and vi) the dynamic order in which the different binding events take place to determine a regulatory response (i.e., activation or repression). In our protocol, the abovementioned criteria were used to analyze a profile of consensus motifs generated from canonical Phylogenetic Footprinting Analyses using a set of analysis windows of incremental sizes. To evaluate the performance of our protocol, we analyzed six members of the LysR-type TF family in Gammaproteobacteria.
The identification of TFBSs based exclusively on the significance of the over-representation of motifs in a set of sequences might lead to inaccurate results. The consideration of different molecular and structural properties of the regulatory systems benefits the identification of TFBSs and enables the development of elaborate, biologically meaningful and precise regulatory models that offer a more integrated view of the dynamics of the regulatory process of transcription.
Gene regulation is a key feature of all organisms in response to intracellular and environmental changes. Bacterial gene regulation primarily occurs at the beginning of transcription by transcription factors (TFs) that recognize specific regions near promoter sequences and results in the activation or repression of the transcription of the nearby genes. The number of TFs in prokaryote genomes typically scales as the square of the total number of their genes . For example, for the model organism Escherichia coli, with 4,405 genes, approximately 8 % of these genes have been estimated to code for predicted or known TFs , of which 35 % correspond to activators and 43 % to repressors, while 22 % have dual activities .
The in silico identification of transcription factor-binding sites (TFBSs) is a key issue for many molecular biology studies aimed at characterizing regulatory elements in genome sequences. These analyses have been performed by considering either different co-regulated genes in one genome  or a set of upstream regions of orthologous genes in closely related genomes, a procedure known as phylogenetic footprinting analysis [5–8]. In any case, it is assumed that the nucleotide conservation of a specific region in the set of sequences is indicative of a selective pressure required for the recognition of TFs for their corresponding TFBSs. Based on this principle, the goal of many programs that have been developed to find TFBSs has been the identification of discrete sequence motifs that are significantly over-represented in a given set of sequences where a TF is expected to bind. These motifs are considered to be part of the TFBSs and are commonly represented as position-specific scoring matrices (PSSMs). TFBSs and their corresponding PSSMs have been compiled in a number of different databases, such as RegulonDB , EcoCyc , RegPrecise , Prodoric  and Tractor_DB . To evaluate the significance of these TFBS predictions, different approaches have been developed based on theoretical models, such as log-odds, entropy-weighted values  or the combination of theoretical and empirical score distributions . Despite their extensive use, the accuracies reached with these programs remain low. In many cases, the true TFBSs are excluded from the identification process or are imprecisely identified, especially when they correspond to low-affinity but important binding sites of the regulatory systems. In other words, the significance of a motif given its over-representation in a set of sequences of co-regulated genes is not necessarily the best way to identify the set of TFBSs for a given regulon.
To evaluate the performance of our protocol, we analyzed the regulatory system of six members of the LysR-type transcriptional regulator family in Gammaproteobacteria. This family represents one of the most important families of TFs in bacteria with poorly conserved TFBSs. The members of this family have three domains: the N-terminal domain, which contains a helix-turn-helix motif for DNA binding; a central domain involved in co-inducer recognition; and a C-terminal domain required for both DNA binding and co-inducer response . For most of the cases in our study, we identified TFBSs with different sequence-conservations and, thus, different affinity strengths. In our study, we found that all identified TFBSs were biologically meaningful and allowed us to propose precise dynamic regulatory models.
To assess the performance of our protocol, we performed in silico identifications of the binding sites of TFs of six regulatory systems that are members of the LysR-type family in Gammaproteobacteria, with target genes (TGs) commonly transcribed in divergent orientations. For comparative purposes, we divided these systems into three different groups in accordance with the regulatory activity of the TF on its TG and the position of the TFBSs with respect to the promoter sequences of the regulated genes.
Group one: GcvA and MetR
The GcvA regulatory system
GcvA, Glycine Cleavage A, is a TF that regulates the transcription of genes involved in the serine-glycine pathway of E. coli [17, 18]. This regulator is encoded by the divergent operon gcvA-gcvB from overlapping promoters and has a common regulatory mechanism. In the presence of glycine, GcvA is negatively auto-regulated and coordinately increases the transcription of the gcvB divergent gene coding for a small RNA  by direct interaction with α and β RNAP . Additionally, GcvA regulates the transcription of the gcvTHP operon .
Using DNase I footprint analysis, Wilson et al. reported that in E. coli, GcvA protects a large 48-bp sequence in the intergenic region of gcvA-gcvB and two other sequences 35 and 57 bp upstream the gcvTHP operon . Alignment of these sequences revealed a conserved 5′-CTAAT-3′ motif, which was subsequently determined by site-directed mutagenesis to be important for GcvA binding and the negative regulation of the gcvTHP and gcvA transcription units [18, 21, 22]. In general, the GcvA-binding sites do not present a clear sequence conservation, except for a short 5′-CTAAT-3′ motif. Additionally, the protected regions of GcvA contain the IR sequence, 5′-ATTA-n7-TAAT-3′ , which is coincident with the GcvA-binding site reported in the RegPrecise database .
Our PProCoM analysis of the gcvB-gcvA intergenic region identified the presence of two 15-bp IR sequences (5′-ATTAG-n5-CTAAT-3′, see Fig. 2). These IR sequences include the previously mentioned motifs reported by Wilson et al., 5′-CTAAT-3′ and 5′-ATTA-n7-TAAT-3′ . Considering the E. coli gcvA-gcvB intergenic region, the central positions of the predicted IR1 and IR2 motifs are located -65 and -43 bp from the gcvB transcription start site (TSS), respectively (see Fig. 2). It is important to remark that the sequences shown in this figure are not the result of a standard sequence alignment but are obtained from the relative location of conserved motifs of different sizes to the E. coli gcvB-gcvA regulatory region (see the Methods section).
The MetR regulatory system
MetR is a TF that regulates the expression of genes involved in methionine biosynthesis and protection against nitric oxide [23–28]. The transcriptional activity of MetR is modulated by homocysteine, the metabolic precursor of methionine. In the presence of homocysteine, MetR activates the transcription of some genes, such as metE and glyA, and represses the transcription of a few others, such as metH, metA, and hmp, along with its own transcription [23–31].
In E. coli and Salmonella typhimurium, the metE and metR genes are divergently transcribed from overlapping promoters; thus, they share a common regulatory region [23–29]. DNase I footprint and mutational analyses in S. typhimurium showed that MetR binds to two IR sequences arranged in tandem with different affinities to consensus sequence 5′-TGAAnnTnnTTCA-3′ . In E. coli, two binding sites with the same characteristics have been reported in the regulatory region of the divergently transcribed hmp-glyA genes regulated by MetR, [30, 31]. The presence of homocysteine has been postulated to enhance the affinity of MetR to these contiguous DNA-binding sites to activate metE and repress metR transcription . To date, no experimental evidence supports the existence of two MetR-binding sites on the E. coli metR-metE intergenic region. Nevertheless, based on our PProCoM analysis, we identified two 15-bp IR sequences with consensus sequence 5′-ATGAA-n5-TTCAT-3′, which is the reported size of the TFBSs of the LysR-type TF family . Based on the E. coli reference sequence, we localized the distal and more conserved IR1 site -63 bp from the metE TSS, while the proximal and less conserved site, IR2, was located -41 bp from the TSS of metE (Fig. 3). These central locations are among the preferred positions of the transcriptional activators in E. coli [32, 33]. As shown in Fig. 3, the E. coli IR1-IR2 inter-motifs sequence is one base shorter than the IR1-IR2 inter-motifs sequence of the overall PProCoM motif alignment (see Fig. 3, sloped-dotted lines). The effect, if any, of this one missing base in the E. coli metR-metE inter-motifs space on the system regulation is not clear. Nevertheless, longer variations in the inter-motifs space, such as 6 bases (half-helix turn), have been demonstrated to have a negative effect on S. typhimurium metE transcription . Additionally, point mutations in any of the two proposed TFBSs have also been reported to decrease the metE transcription, indicating that both TFBSs are required for full metE activation . The 15-bp consensus sequence obtained from our PProCoM analysis was coincident with that reported for MetR in the RegPrecise database, i.e., 5′-ATGAAAATTTTTCAT-3′ .
Group two: OxyR, IlvY and CynR
The OxyR regulatory system
OxyR is a TF that regulates the expression of genes involved in oxidative stress protection, redox balance, and manganese uptake [35–38]. The transcriptional activity of OxyR depends on its oxidized state, which determines the reversible disulfide bond formation of a pair of cysteine residues in its amino acid sequence . In its oxidized state, OxyR activates the transcription of the divergent small RNA gene oxyS. Additionally, OxyR represses its own expression under oxidizing and reducing conditions .
Based on DNase I footprint analyses, Tartaglia et al. showed that OxyR binds to an unusually long DNA region that spans over 45 bp, with putative OxyR-binding sites with no obvious sequence similarity . Using an in vitro binding assay of OxyR to random oligonucleotides and DNase I footprint analyses, Toledano et al. showed that the DNA recognition of OxyR depends on its oxidized/reduced states. In its oxidized form, OxyR recognizes a DNA region that includes four repetitions of the 5′-ATAGnt-3′ sequences located in four contiguous major grooves on one face of the DNA helix. In its reduced form, OxyR binds two repetitions of the 5′-ATAGnt-3′ sequences located at two pairs of major grooves separated by one helical turn .
Our PProCoM analysis of the oxyR-oxyS intergenic region identified the presence of three 15-bp IR sequences (5′-ATAG-n7-CTAT-3′). Considering the E. coli oxyR-oxyS intergenic region, the central positions of the predicted IR1, IR2 and IR3 motifs are located -66, -44 and -35 bp from the oxyS TSS, respectively (see Fig. 4).
The IlvY regulatory system
IlvY positively regulates the transcription of ilvC, a gene involved in isoleucine and valine biosynthesis. The transcriptional activation of ilvC by IlvY depends on the presence of an IlvY inducer, such as acetolactate or acetohydroxybutyrate. At the same time, IlvY negatively regulates its own transcription in an inducer-independent manner [42, 43].
The ilvY and ilvC genes are divergently transcribed from overlapping promoters. Using DNase I footprint analyses, Wek and Hatfield proposed that IlvY binds to two 27-bp operator sequences, named O1 and O2, in the ilvY-ilvC intergenic region . These regions are arranged in tandem and possess imperfect 21-bp inverted repeat motifs: O1, 5′-ACgTTGCAAaaaTTGCAAtGT-3′ (centered at position +17 relative to the ilvY TSS), and O2, 5′-aTATatCaatttccGcaATAa-3′ (which overlaps the proposed -10 and -35 promoter boxes of ilvY and the -35 promoter box of ilvC). The consensus IlvY-binding motif common to the O1 and O2 operators is 5′-A[C/T]ATTGCAA-3′ . These authors proposed that IlvY represses its own transcription in an inducer-independent manner when IlvY binds to O1 and activates transcription of ilvC when IlvY binds to the O1 and O2 operators in a cooperative dependent manner in the presence of the system inducers. In this condition, the transcriptional activation of ilvC was proposed to result from IlvY-RNAP interactions when IlvY was bound to O2 or by a change in the DNA conformation at the ilvC -35 promoter box. Following this reasoning, Rhee et al. proposed that the transcription of the divergent genes ilvY and ilvC is coupled in a DNA supercoiling-dependent manner that increases the binding of the RNAP at this promoter by nearly 100-fold .
Our PProCoM analysis of the ilvY-ilvC intergenic region identified the presence of three 15-bp IR sequences (5′-TTGCA-n5-TGCAA-3′; see Fig. 5). Considering the E. coli ilvY-ilvC intergenic region, the central positions of these predicted IR1, IR2 and IR3 motifs are located -65, -43 and -34 bp from the ilvC TSS, respectively (see Fig. 5).
The CynR regulatory system
CynR is a TF that regulates the transcription of the cynTSX operon, which is involved in cyanate detoxification. Cyanate is also used as a nitrogen source due to its hydrolysis to ammonia and bicarbonate . This activation of the cynTSX operon by CynR depends on the presence of cyanate. CynR also negatively regulates its own transcription in a cyanate-independent manner .
As in the case of the abovementioned LysR-type regulatory systems, the gene coding for the TF (cynR) and its regulatory TGs (cynTSX) are transcribed in opposite directions, and their corresponding promoters overlap [45, 46]. Using DNase I digestion analyses, Lamblin and Fuchs showed that CynR binds to a 60-bp region in the cynR-cynTSX intergenic region and proposed that this region contains two putative binding sites with different affinities . The first of these regions, R1 (5′-ATAAGTAAA-3′), was proposed to have the highest binding affinity, whereas the second region, R2 (5′-ATAAGGTAA-3′), was proposed to overlap the entire cynR promoter sequence and the -35 promoter region of the cynTSX operon [45, 46]. These authors suggested that in a first instance, a CynR dimer could bind to R1 (i.e., the most conserved region), and in a second but almost simultaneous instance, another CynR dimer could bind to R2 in a strong cooperative manner. These authors also proposed that the transcriptional activation of the cynTSX operon takes place in the presence of cyanate, which was believed to trigger a conformational change in CynR, modifying its interaction with DNA .
Our PProCoM analysis of the cynR-cynTSX intergenic region identified the presence of three 15-bp IR sequences (5′-ATAA-n7-TTAT-3′), including the sequences proposed by Lamblin and Fuchs (see Fig. 6). Considering the E. coli cynR-cynTSX intergenic region, the central positions of the predicted IR1, IR2 and IR3 motifs are located -66, -44 and -34 bp from the cynTSX TSS, respectively (see Fig. 6).
Group three: LysR
The LysR regulatory system
LysR is a TF that regulates the transcription of lysA, which encodes an enzyme that catalyzes the final step of lysine biosynthesis. LysR negatively regulates its own transcription and positively regulates the transcription of lysA in the presence of its inducer, diaminopimelic acid [47–49].
As in the previous cases, the genes coding for the TF (lysR) and its regulatory TG (lysA) are transcribed in opposite directions. The TFBSs of LysR and their regulatory mechanism have not yet been identified. However, the LysR-binding sites have been determined to be within a 73-bp DNA fragment located 48 bp upstream of the lysA structural gene . The intracellular concentration of active LysR could be limiting because its regulatory role is diminished when the abovementioned fragment is cloned on plasmids . Based on experimental analyses, the lysR TSS has been predicted to be located 26 bp upstream of its structural gene . However, a putative lysA promoter, with a -35 box (TTGcat) and a -10 box (TATTTT), has been predicted to be located 52 bp from the lysA coding region . The corresponding TSS has been proposed to be located 3 bp downstream of the -10 box of the predicted promoter .
Our PProCoM analysis of the lysR-lysA intergenic region identified the presence of three 15-bp IR sequences (5′-ATATC-n5-GATAT-3′, see Fig. 7). Considering the E. coli lysR-lysA intergenic region, the central positions of the predicted IR1, IR2 and IR3 motifs are located -64, -43 and -9 bp from the lysA TSS, respectively (see Fig. 7). Based on the positions of these predicted TFBSs, we postulate that the lysA TSS is located 22 bp upstream of its structural gene.
Common sequence motifs of the TFBSs of the LysR-type TF family
Dynamic models of regulation
The intergenic sequences of the regulatory systems of group one (metR-metE and gcvA-gcvB), contained two IR motifs, whilst the regulatory systems of group two (oxyR-oxyS, ilvY-ilvC, and cynR-cynT) and group three (lysR-lysA) contained three IR motifs. In all these cases, the IR motifs show different sequence conservation, and thus, different affinity. In group one, IR1 is the most conserved, and IR2 is the least conserved motif. In groups two and three, IR1 and IR3 are the most conserved, and IR2 is the least conserved motif.
All the TFs analyzed; GcvA, MetR, OxyR, IlvY, CynR and LysR, adopt two different conformations depending on the presence or absence of their corresponding inducers: glycine, homocysteine, reactive oxygen species, acetolactate, cyanate and diaminopimelic acid, respectively.
Without the system inducers, the TFs bind as dimers, preferentially to IR1, in the case of group one, and to IR1 and IR3, in the case of groups two and three. In accordance with this binding, footprinting assays with LysR family members show a hypersensitive region 50 bp upstream of the TSS of IlvY [42, 43], CynR , OccR [54, 57]. Similar results have been observed in studies with other regulatory TFs of the LysR family such as ClcR , CatR  and PcaQ . In the case of CynR, this hypersensitive region corresponds to the region where the DNA curves with the binding of CynR .
In the presence of the system inducers, the TFs bind DNA as dimers of dimers in a cooperative manner. Only through this cooperative binding the TFs can recognize IR2, the less conserved of the TFBSs. This kind of binding for members of the LysR TF family has been demonstrated by footprinting assays [18, 29, 34, 40, 42, 43, 54, 57] and site directed mutagenesis analysis [21, 29, 34, 40, 53–56]. As a consequence of this binding, the hypersensitive DNA regions located around -50 bp upstream the TSSs markedly decrease. In addition, it has been shown that altering the distance between IR1 and IR2 reduces the cooperative binding of the TFs [40, 54, 57].
A TFs acts as transcription repressor or activator of the TF or TG genes depending on the position of the IR to which it binds.
The IR1 motifs are downstream or overlap the -10 box of the TF promoters, therefore, the auto-repression of transcription takes place when the TFs are bound to IR1 sites.
The IR2 motifs overlap the TFs promoters and are also immediately downstream of the -35 box of the TGs promoters, therefore, a TF bound to IR2 represses the TF transcription and activates the TG transcription.
In the case of group two, the IR3 motif overlaps the TF and TG promoters, hence, a TF bound to this site simultaneously blocks the transcription of the TF and TG genes. In the case of group three, the IR3 motif only overlaps the TG promoter, accordingly, a TF bound to this site exclusively blocks TF transcription.
In addition to the above-mentioned regulatory outcomes, it is worth mentioning that in the case of group two, the IR2 and IR3 sites overlap, therefore, the binding of TFs to these sites are mutually exclusive. In the absence of the system inducers, the TFs would preferentially bind IR3 since this site has greater sequence conservation than IR2; nevertheless, in the presence of the system inducers, the TFs would bind cooperatively as a dimer of dimers to IR1 and IR2. In this case, the binding of the TFs to IR2 would have two positive effects on TG transcription; directly by its interaction with the RNA polymerase, and indirectly, by blocking the binding of the TFs to IR3, an event that otherwise would repress TF transcription.
Representative regulatory models of the LysR-type TF family in Gammaproteobacteria revealed by PProCoM analyses
Potential use of PProCoM to identify TFBSs of other regulatory systems different to those of the LysR-Family
Our PProCoM protocol can be used to identify TFBSs of almost any bacterial regulatory system if the characteristics of their TFs are considered. For example, in addition to of the LysR regulatory system, we currently conduct a study to identify the binding sites of the TF members of the AraC/XylS family . These TFs usually bind DNA as dimers to repeated direct asymmetric contiguous TFBSs, being the distal one the most conserved site and proximal site the less conserved. The problem in identifying TFBS of the AraC/XylS family is the low conservation and asymmetry of these proximal TFBSs. Nevertheless, we believe that PProCoM is particularly useful identifying such low conserved binding sites since its accuracy does not exclusively depend on the sequence conservation of the TFBSs, but on the molecular properties of the TFs and their interactions between themselves, with the DNA and with the DNA polymerase. Regarding the use our PProCoM protocol for identifying TFBSs in eukaryotic organisms with small intergenic regions, such as yeast, we consider the possibility of obtaining positive results as obtained so far in prokaryotic organisms. Currently we perform site directed mutagenesis and transcriptional quantification of our regulatory systems for experimental verification of our theoretical predictions.
PProCoM represents an unconventional multiple motif alignment representing a set of consensus sequences of increasing length, which are arranged according to reference nucleotide intergenic region – E. coli sequences in our examples (Figs. 2, 3, 4, 5, 6 and 7). This strategy enables the merging of the represented motifs (with significant E-values) with less conserved motifs that play important roles in dynamic transcription regulation systems. These less conserved motifs have generally not been identified or included in previous studies, even in cases with experimental analyses, such as DNase footprinting analysis. Our PProCoM analysis of six members of the LysR-type TF family have made evident the high relevance of the less conserved motifs in the intergenic regions of their regulatory sequences. This approach enables the comprehension of the homodimeric nature of these TFs and provides a more integrated and complete picture of their regulatory processes.
Retrieval of orthologous non-coding regulatory sequences of non-redundant organisms
To avoid bias introduced by the sequencing of preferential model organisms, non-redundant genomes were selected from the KEGG database (release 2015) based on their phylogenetic distances, which were evaluated using the PROTDIST program  from a multiple alignment of concatenated sequences of a set of 31 “house-keeping” proteins defined by Ciccarelli et al.  (see Fig. 11). The phylogenetic group considered in our study was Gammaproteobacteria. Orthologous genes were defined using “bidirectional best hits” criteria  in BLAST . Only intergenic regions with greater length to 10 nucleotides were considered. 150 intergenic regions were considered for analysis. The list of these organisms is presented in Additional file 1: Table S1.
Obtaining the profile of significantly over-represented motifs from phylogenetic footprinting analysis
Length of the analysis window. Although MEME can automatically set the size of the analysis window to define the value at which over-representation of a motif is most significant, in our PProCoM protocol, the MEME analysis was repeatedly performed using analysis windows of different sizes, from the smallest, 10 bp, to the largest, 100 bp, in increments of 2 bp per cycle, or in the case that the results of the analysis remains unchanged despite the increment of the two pair bases. The sizes of the analysis windows were defined using the –w argument of MEME. In addition, we also include the result of a MEME analysis without forcing the size of the analysis window. In Figs. 2, 3,4, 5, 6 and 7, these motifs are indicated as dm (default motif without forcing the size of the analysis window).
E-value of the motifs. Unlike most computational methods that use the E-value to define a motif as significant, in our PProCoM protocol, the E-value is considered as one, among other different criteria, for the selection of significant motifs. The above consideration is because the E-value of a motif might vary depending on the affinity of the TFBSs (high or low), the size of the analysis window, the number of sequences analyzed and on the phylogenetic distances between the organisms in the study. Nevertheless, as a first filter to define a motif as significant, the E-value was set to 1e-6 using the –evt argument of MEME. Figures 2, 3, 4, 5, 6 and 7 include the E-values obtained for each of the analysis window of our six regulatory systems. In all these cases, the E-values were statistically significant (E-values < 1e-20).
Number of motifs identified. To build the PProCoM profile, only the most significant motif is considered per analysis window. This was specified setting the –nmotifs argument of MEME to 1.
Motif symmetry. Considering that some homodimeric TFs, as those of the LysR-family, recognize palindromic DNA sequences, the –pal argument of MEME was used to force this symmetry in the identified motifs.
Distribution of motifs. To specify that the distribution of the motifs to be found by MEME in the set of regulatory sequences corresponded to zero or one per sequence, the –mod argument of MEME was set to zoops.
Background Markov model. In order to avoid the bias originated by the unbalance distribution of the nucleotides (i.e. low or high %GC) in the regulatory sequences, we build a Markov model file for each one of our six regulatory systems. The names of these files were specified using the –bfile argument of MEME.
Alphabet of the sequences. The –dna argument of MEME was used to specify the nature of the nucleotide sequences used in our study.
Mapping the significant motifs onto a reference sequence
To identify the relative positions of the different motifs identified in the previous steps of our protocol, every motif was mapped to a reference intergenic region of a model organism. In our case, we selected E. coli K12 because it is one of the best-characterized organisms among the Gammaproteobacteria. As a result of this mapping step, a PProCoM was obtained.
Integration of the mapped motifs with biological knowledge of the regulatory system and construction of dynamic models of the regulatory system
To properly interpret the results obtained in the previous steps represented as a PProCoM, the molecular characteristics of the TF in the study were considered. The characteristics of the TFs belonging to the LysR-type family are listed in the Background section and include the following properties: the TF-TG divergent transcriptional orientations, the tandem arrangement of TFBSs, the inverted repeat symmetry and length of the TFs, the cooperative binding of the TFs in the presence of a specific inducer, the relative degrees of sequence conservation (i.e., binding affinities) of the TFBSs and their positions with respect to their promoters, and the spaces between the TFBSs that determine their relative orientations in terms of helix-turns.
We wish to thank Ricardo Ciria for computer support and Shirley Ainsworth for bibliographical assistance. Patricia Oliver is a doctoral student from Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México (UNAM), and received a CONACyT fellowship (23556), scholar No. 45230.
This work was supported by Consejo Nacional de Ciencia y Tecnologıa (CONACyT) (235817) and PAPIIT (IN201714) grants to EM.
Availability of data and materials
The datasets of intergenic sequences of the TF-TG of the set of non-redundant Gammaproteobacteria genomes that were used as input sequences in our study are available at our web page http://www.ibt.unam.mx/biocomputo/pprocom_organisms.html.
PO, MP and EM co-developed the project idea, designed and performed the analyses, interpreted the biological significance of the results, and wrote the manuscript. MLT assisted in the biological interpretation of the results. EM also coordinated the study. All authors participated in discussions and read and approved the final manuscript.
The authors have declared that no competing interests exist.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Molina N, van Nimwegen E. Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet. 2009;25:243–7. doi:10.1016/j.tig.2009.04.004.View ArticlePubMedGoogle Scholar
- Martínez-Antonio A, Collado-Vides J. Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol. 2003;6:482–9. doi:10.1016/j.mib.2003.09.002.View ArticlePubMedGoogle Scholar
- Pérez-Rueda E, Collado-Vides J. The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res. 2000;28:1838–47. doi:10.1093/nar/28.8.1838.View ArticlePubMedPubMed CentralGoogle Scholar
- Huerta AM, Salgado H, Thieffry D, Collado-Vides J. RegulonDB: a database on transcriptional regulation in Escherichia coli. Nucleic Acids Res. 1998;26:55–9. doi:10.1093/nar/26.1.55.View ArticlePubMedPubMed CentralGoogle Scholar
- Tagle DA, Koop BF, Goodman M, Slightom JL, Hess DL, Jones RT. Embryonic epsilon and gamma globin genes of a prosimian primate (galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol. 1988;203:439–55. doi:10.1016/0022-2836(88)90011-3.View ArticlePubMedGoogle Scholar
- Tan K, Moreno-Hagelsieb G, Collado-Vides J, Stormo GD. A comparative genomics approach to prediction of new members of regulons. Genome Res. 2001;11:566–84. doi:10.1101/gr.149301.View ArticlePubMedPubMed CentralGoogle Scholar
- Tan K, McCue LA, Stormo GD. Making connections between novel transcription factors and their DNA motifs. Genome Res. 2005;15:312–20. doi:10.1101/gr.3069205.View ArticlePubMedPubMed CentralGoogle Scholar
- Janky R, van Helden J. Evaluation of phylogenetic footprint discovery for predicting bacterial cis-regulatory elements and revealing their evolution. BMC Bioinformatics. 2008;9:37. doi:10.1186/1471-2105-9-37.View ArticlePubMedPubMed CentralGoogle Scholar
- Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muniz-Rascado L, Garcia-Sotelo JS, et al. RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Research. 2013;41:D203–13. doi:10.1093/nar/gks1201.View ArticlePubMedGoogle Scholar
- Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, Bonavides-Martínez C, et al. EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res. 2013;41:D605–12. doi:10.1093/nar/gks1027. PubMed: 23143106.View ArticlePubMedGoogle Scholar
- Novichkov PS, Brettin TS, Novichkova ES, Dehal PS, Arkin AP, Dubchak I, et al. RegPrecise web services interface: programmatic access to the transcriptional regulatory interactions in bacteria reconstructed by comparative genomics. Nucleic Acids Res. 2012;40:W604–8. doi:10.1093/nar/gks562.View ArticlePubMedPubMed CentralGoogle Scholar
- Grote A, Klein J, Retter I, Haddad I, Behling S, Bunk B, et al. PRODORIC (release 2009): a database and tool platform for the analysis of gene regulation in prokaryotes. Nucleic Acids Res. 2009;37:D61–5. doi:10.1093/nar/gkn837.View ArticlePubMedGoogle Scholar
- Pérez AG, Angarica VE, Vasconcelos AT, Collado-Vides J. Tractor_DB (version 2.0): a database of regulatory interactions in gamma-proteobacterial genomes. Nucleic Acids Res. 2007;35:D132–6. doi:10.1093/nar/gkl800. PubMed: 17088283.View ArticlePubMedGoogle Scholar
- Oberto J. FITBAR: a web tool for the robust prediction of prokaryotic regulons. BMC Bioinformatics. 2010;11:554. doi:10.1186/1471-2105-11-554.View ArticlePubMedPubMed CentralGoogle Scholar
- Medina-Rivera A, Abreu-Goodger C, Thomas-Chollier M, Salgado H, Collado-Vides J, Van Helden J. Theoretical and empirical quality assessment of transcription factor-binding motifs. Nucleic Acids Res. 2011;39:808–24. doi:10.1093/nar/gkq710.View ArticlePubMedGoogle Scholar
- Schell MA. Molecular biology of the LysR family of transcriptional regulators. Annu Rev Microbiol. 1993;47:597–626. doi:10.1146/annurev.mi.47.100193.003121.View ArticlePubMedGoogle Scholar
- Wilson RL, Steiert PS, Stauffer GV. Positive regulation of the Escherichia coli glycine cleavage enzyme system. J Bacteriol. 1993;175:902–4.PubMedPubMed CentralGoogle Scholar
- Wilson RL, Urbanowski ML, Stauffer GV. DNA binding sites of the LysR-type regulator GcvA in the gcv and gcvA control regions of Escherichia coli. J Bacteriol. 1995;177:4940–6.PubMedPubMed CentralGoogle Scholar
- Urbanowski ML, Stauffer LT, Stauffer GV. The gcvB gene encodes a small untranslated RNA involved in expression of the dipeptide and oligopeptide transport systems in Escherichia coli. Mol Microbiol. 2000;37:856–68. doi:10.1046/j.1365-2958.2000.02051.x.View ArticlePubMedGoogle Scholar
- Stauffer LT, Stauffer GV. GcvA interacts with both the alpha and sigma subunits of RNA polymerase to activate the Escherichia coli gcvB gene and the gcvTHP operon. FEMS Microbiol Lett. 2005;242:333–8. doi:10.1016/j.femsle.2004.11.027.View ArticlePubMedGoogle Scholar
- Jourdan AD, Stauffer GV. Genetic analysis of the GcvA binding site in the gcvA control region. Microbiol. 1999;145:2153–62. doi:10.1099/13500872-145-8-2153.View ArticleGoogle Scholar
- Wonderling LD, Urbanowski ML, Stauffer GV. GcvA binding site 1 in the gcvTHP promoter of Escherichia coli is required for GcvA-mediated repression but not for GcvA-mediated activation. Microbiology. 2000;146:2909–18. doi:10.1099/00221287-146-11-2909.View ArticlePubMedGoogle Scholar
- Cai XY, Maxon ME, Redfield B, Glass R, Brot N, Weissbach H. Methionine synthesis in Escherichia coli: effect of the MetR protein on metE and metH expression. Proc Natl Acad Sci USA. 1989;86:4407–11. doi:10.1073/pnas.86.12.4407.View ArticlePubMedPubMed CentralGoogle Scholar
- Weissbach H, Brot N. Regulation of methionine synthesis in Escherichia coli. Mol Microbiol. 1991;5:1593–7. doi:10.1111/j.1365-2958.1991.tb01905.x.View ArticlePubMedGoogle Scholar
- Flatley J, Barrett J, Pullan ST, Hughes MN, Green J, Poole RK. Transcriptional responses of Escherichia coli to S-nitrosoglutathione under defined chemostat conditions reveal major changes in methionine biosynthesis. J Biol Chem. 2005;280:10065–72. doi:10.1074/jbc.M410393200.View ArticlePubMedGoogle Scholar
- Membrillo-Hernández J, Coopamah MD, Channa A, Hughes MN, Poole RK. A novel mechanism for upregulation of the Escherichia coli K-12 hmp (flavohaemoglobin) gene by the “NO releaser”, S-nitrosoglutathione: nitrosation of homocysteine and modulation of MetR binding to the glyA-hmp intergenic region. Mol Microbiol. 1998;29:1101–12. doi:10.1046/j.1365-2958.1998.01000.x.View ArticlePubMedGoogle Scholar
- Maxon ME, Redfield B, Cai XY, Shoeman R, Fujita K, Fisher W, et al. Regulation of methionine synthesis in Escherichia coli: effect of the MetR protein on the expression of the metE and metR genes. Proc Natl Acad Sci U S A. 1989;86:85–9. doi:10.1073/pnas.86.1.85.View ArticlePubMedPubMed CentralGoogle Scholar
- Jafri S, Urbanowski ML, Stauffer GV. A mutation in the rpoA gene encoding the alpha subunit of RNA polymerase that affects metE-metR transcription in Escherichia coli. J Bacteriol. 1995;177:524–9.PubMedPubMed CentralGoogle Scholar
- Wu WF, Urbanowski ML, Stauffer GV. Characterization of a second MetR-binding site in the metE metR regulatory region of salmonella typhimurium. J Bacteriol. 1995;177:1834–9.PubMedPubMed CentralGoogle Scholar
- Lorenz E, Stauffer GV. Characterization of the MetR binding sites for the glyA gene of Escherichia coli. J Bacteriol. 1995;177:4113–20.PubMedPubMed CentralGoogle Scholar
- Lorenz E, Stauffer GV. Cooperative MetR binding in the Escherichia coli glyA control region. FEMS Microbiol Lett. 1996;137:147–52.View ArticlePubMedGoogle Scholar
- Harari O, del Val C, Romero-Zaliz R, Shin D, Huang H, Groisman EA, et al. Identifying promoter features of co-regulated genes with similar network motifs. BMC Bioinformatics. 2009;10 Suppl 4:S1. doi:10.1186/1471-2105-10-S4-S1.View ArticlePubMedPubMed CentralGoogle Scholar
- Collado-Vides J, Salgado H, Morett E, Gama-Castro S, Jiménez-Jacinto V, Martínez-Flores I, et al. Bioinformatics resources for the study of gene regulation in bacteria. J Bacteriol. 2009;191:23–31. doi:10.1128/JB.01017-08.View ArticlePubMedGoogle Scholar
- Urbanowski ML, Stauffer GV. Genetic and biochemical analysis of the MetR activator-binding site in the metE metR control region of Salmonella typhimurium. J Bacteriol. 1989;171:5620–9.PubMed CentralGoogle Scholar
- Anjem A, Varghese S, Imlay JA. Manganese import is a key element of the OxyR response to hydrogen peroxide in Escherichia coli. Mol Microbiol. 2009;72:844–58. doi:10.1111/j.1365-2958.2009.06699.x.View ArticlePubMedPubMed CentralGoogle Scholar
- Storz G, Tartaglia LA, Ames BN. The OxyR regulon. Antonie van Leeuwenhoek. 1990;58:157–61. doi:10.1007/BF00548927.View ArticlePubMedGoogle Scholar
- Zheng M, Wang X, Templeton LJ, Smulski DR, LaRossa RA, Storz G. DNA microarray-mediated transcriptional profiling of the Escherichia coli response to hydrogen peroxide. J Bacteriol. 2001;183:4562–70. doi:10.1128/JB.183.15.4562-4570.2001.View ArticlePubMedGoogle Scholar
- Mongkolsuk S, Helmann JD. Regulation of inducible peroxide stress responses. Mol Microbiol. 2002;45:9–15. doi:10.1046/j.1365-2958.2002.03015.x.View ArticlePubMedGoogle Scholar
- Zheng M, Aslund F, Storz G. Activation of the OxyR transcription factor by reversible disulfide bond formation. Science. 1998;279:1718–21. doi:10.1126/science.279.5357.1718.View ArticlePubMedGoogle Scholar
- Toledano MB, Kullik I, Trinh F, Baird PT, Schneider TD, Storz G. Redox-dependent shift of OxyR-DNA contacts along an extended DNA-binding site: A mechanism for differential promoter selection. Cell. 1994;78:897–909. doi:10.1016/S0092-8674(94)90702-1.View ArticlePubMedGoogle Scholar
- Tartaglia LA, Gimeno CJ, Storz G, Ames BN. Multidegenerate DNA recognition by the OxyR transcriptional regulator. J Biol Chem. 1992;267:2038–45.PubMedGoogle Scholar
- Rhee KY, Senear DF, Hatfield GW. Activation of gene expression by a ligand-induced conformational change of a protein-DNA complex. J Biol Chem. 1998;273:11257–66. doi:10.1074/jbc.273.18.11257.View ArticlePubMedGoogle Scholar
- Wek RC, Hatfield GW. Transcriptional activation at adjacent operators in the divergent-overlapping ilvY and ilvC promoters of Escherichia coli. J Mol Biol. 1988;203:643–63. doi:10.1016/0022-2836(88)90199-4.View ArticlePubMedGoogle Scholar
- Sung YC, Fuchs JA. Characterization of the cyn operon in Escherichia coli K12. J Biol Chem. 1988;263:14769–75.PubMedGoogle Scholar
- Lamblin AF, Fuchs JA. Expression and purification of the CynR regulatory gene product: CynR is a DNA-binding protein. J Bacteriol. 1993;175:7990–9.PubMedPubMed CentralGoogle Scholar
- Lamblin AF, Fuchs JA. Functional analysis of the Escherichia coli K-12 cyn operon transcriptional regulation. J Bacteriol. 1994;176:6613–22.PubMedPubMed CentralGoogle Scholar
- Stragier P, Richaud F, Borne F, Patte JC. Regulation of diaminopimelate decarboxylase synthesis in Escherichia coli. I. Identification of a lysR gene encoding an activator of the lysA gene. J Mol Biol. 1983;168:307–20. doi:10.1016/S0022-2836(83)80020-5.View ArticlePubMedGoogle Scholar
- Stragier P, Danos O, Patte JC. Regulation of diaminopimelate decarboxylase synthesis in Escherichia coli. II. Nucleotide sequence of the lysA gene and its regulatory region. J Mol Biol. 1983;168:321–31. doi:10.1016/S0022-2836(83)80021-7.View ArticlePubMedGoogle Scholar
- Stragier P, Patte JC. Regulation of diaminopimelate decarboxylase synthesis in Escherichia coli. III. Nucleotide sequence and regulation of the lysR gene. J Mol Biol. 1983;168:333–50. doi:10.1016/S0022-2836(83)80022-9.View ArticlePubMedGoogle Scholar
- Huerta AM, Collado-Vides J. Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. J Mol Biol. 2003;333:261–78. doi:10.1016/j.jmb.2003.07.017.View ArticlePubMedGoogle Scholar
- Goethals K, Van Montagu M, Holsters M. Conserved motifs in a divergent nod box of Azorhizobium caulinodans ORS571 reveal a common structure in promoters regulated by LysR-type proteins. Proc Natl Acad Sci U S A. 1992;89:1646–50. doi:10.1073/pnas.89.5.1646.View ArticlePubMedPubMed CentralGoogle Scholar
- Maddocks SE, Oyston PC. Structure and function of the LysR-type transcriptional regulator (LTTR) family proteins. Microbiology. 2008;154:3609–23. doi:10.1099/mic.0.2008/022772-0.View ArticlePubMedGoogle Scholar
- Parsek MR, Ye RW, Pun P, Chakrabarty AM. Critical nucleotides in the interaction of a LysR-type regulator with its target promoter region. catBC promoter activation by CatR. J Biol Chem. 1994;269:11279–84.PubMedGoogle Scholar
- Wang L, Winans SC. The sixty nucleotide OccR operator contains a subsite essential and sufficient for OccR binding and a second subsite required for ligand-responsive DNA bending. J Mol Biol. 1995;253:691–702. doi:10.1006/jmbi.1995.0583.View ArticlePubMedGoogle Scholar
- Akakura R, Winans SC. Mutations in the occQ operator that decrease OccR-induced DNA bending do not cause constitutive promoter activity. J Biol Chem. 2002;277:15773–80. doi:10.1074/jbc.M200109200.View ArticlePubMedGoogle Scholar
- MacLean AM, Haerty W, Golding GB, Finan TM. The LysR-type PcaQ protein regulates expression of a protocatechuate-inducible ABC-type transport system in Sinorhizobium meliloti. Microbiology. 2011;157:2522–33. doi:10.1099/mic.0.050542-0.View ArticlePubMedGoogle Scholar
- Wang L, Winans SC. High angle and ligand-induced low angle DNA bends incited by OccR lie in the same plane with OccR bound to the interior angle. J Mol Biol. 1995;253:32–8. doi:10.1006/jmbi.1995.0533.View ArticlePubMedGoogle Scholar
- Parsek MR, McFall SM, Shinabarger DL, Chakrabarty AM. Interaction of two LysR-type regulatory proteins CatR and ClcR with heterologous promoters: functional and evolutionary implications. Proc Natl Acad Sci U S A. 1994;91:12393–7. doi:10.1073/pnas.91.26.12393.View ArticlePubMedPubMed CentralGoogle Scholar
- MacLean AM, Anstey MI, Finan TM. Binding site determinants for the LysR-type transcriptional regulator PcaQ in the legume endosymbiont Sinorhizobium meliloti. J Bacteriol. 2008;190:1237–46. doi:10.1128/JB.01456-07.View ArticlePubMedGoogle Scholar
- Hryniewicz MM, Kredich NM. Stoichiometry of binding of CysB to the cysJIH, cysK, and cysP promoter regions of salmonella typhimurium. J Bacteriol. 1994;176:3673–82.PubMedPubMed CentralGoogle Scholar
- McFall SM, Klem TJ, Fujita N, Ishihama A, Chakrabarty AM. DNase I footprinting, DNA bending and in vitro transcription analyses of ClcR and CatR interactions with the clcABD promoter: evidence of a conserved transcriptional activation mechanism. Mol Microbiol. 1997;24:965–76.View ArticlePubMedGoogle Scholar
- Gallegos MT, Schleif R, Bairoch A, Hofmann K, Ramos JL. Arac/XylS family of transcriptional regulators. Microbiol Mol Biol Rev. 1997;61:393–410.PubMedPubMed CentralGoogle Scholar
- Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–76. doi:10.1007/BF01734359.View ArticlePubMedGoogle Scholar
- Ciccarelli F, Doerks T, Von MC. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–7.View ArticlePubMedGoogle Scholar
- Zhang M, Leong HW. Bidirectional best hit r-window gene clusters. BMC Bioinformatics. 2010;11 Suppl 1:S63. doi:10.1186/1471-2105-11-S1-S63.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.View ArticlePubMedPubMed CentralGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME suite: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:W202–8. doi:10.1093/nar/gkp335.View ArticlePubMedPubMed CentralGoogle Scholar