Identification and analysis of DNA-binding transcription factors in Bacillus subtilis and other Firmicutes- a genomic approach

Background Bacillus subtilis is one of the best-characterized organisms in Gram-positive bacteria. It represents a paradigm of gene regulation in bacteria due its complex life style (which could involve a transition between stages as diverse as vegetative cell and spore formation). In order to gain insight into the organization and evolution of the B. subtilis regulatory network and to provide an alternative framework for further studies in bacteria, we identified and analyzed its repertoire of DNA-binding transcription factors in terms of their abundance, family distribution and regulated genes. Results A collection of 237 DNA-binding Transcription Factors (TFs) was identified in B. subtilis, half of them with experimental evidence. 59% of them were predicted to be repressors, 17% activators, 17% were putatively identified as dual regulatory proteins and the remaining 6.3% could not be associated with a regulatory role. From this collection 56 TFs were found to be autoregulated, most of them negatively, though a significant proportion of positive feedback circuits were also identified. TFs were clustered into 51 regulatory protein families and then traced on 58 genomes from Firmicutes to detect their presence. From this analysis three families were found conserved in all the Firmicutes; fifteen families were distributed in all Firmicutes except in the phyla Mollicutes; two were constrained to Bacillales and finally two families were found to be specific to B. subtilis, due to their specie specific distribution. Repression seems to be the most common regulatory mechanism in Firmicutes due to the high proportion of repressors in the detected collection in these genomes. In addition, six global regulators were defined in B. subtilis based on the number and function of their regulated genes. Conclusion In this work we identified and described the characteristics associated to the repertoire of DNA-binding TFs in B. subtilis. We also quantified their abundance, family distribution, and regulatory roles in the context of Firmicutes. This work should not only contribute to our understanding of the regulation of gene expression in bacteria from the perspective of B. subtilis but also provide us the basis for comprehensive modeling of transcriptional regulatory networks in Firmicutes.


Background
Transcriptional regulation is an important mechanism for controlling many biological phenomena, such as development and cell proliferation. Regulation of gene expression in an organism involves a complex network mediated by DNA-binding transcription factors (TFs) which respond to changes in the cellular environment by altering the gene expression of relevant genes. Due to the crucial role of TFs in co-ordinating the gene expression kinetics of a genome, they are studied in many ways, including mutation analysis and elucidation of numerous three-dimensional structures. The identification of the repertoire of regulatory proteins in a genome sequence is a prerequisite to understand the regulation of gene expression and on a global scale for the elucidation of regulatory networks.
B. subtilis (GenBank: AL009126) is a sporulating Grampositive bacterium that lives primarily in the soil and associated water sources. In its natural habitat, the bacterium is exposed to frequently changing environmental conditions. The high variability of the natural B. subtilis habitat is reflected in its complex gene regulatory apparatus enabling fast and efficient adaptation of the cell to varying environmental factors. Additionally, B. subtilis has evolved to develop a nearly inanimate physiological state, the spore [1]. Starvation and stress as well as initiation of spore formation (sporulation) and the further process of spore germination towards a vegetative cell are associated with extensive changes in the gene expression patterns. This results in a qualitative and quantitative variation in the composition of the cellular mRNA-pol [2].
B. subtilis is the best-characterized member of the Grampositive bacteria. Its genome comprises of 4,224 proteincoding genes. Of these protein-coding genes, 53% are represented once, while a quarter of the genome corresponds to several gene families that have been greatly expanded by gene duplication, the largest family containing 77 putative ATP-binding transport proteins [3]. In addition, a large proportion of the genetic capacity is devoted to the utilization of a variety of carbon sources, including many plant-derived molecules. The publication of its genome sequence, subsequent systematic functional analysis and experimental characterization of its specific gene regulatory programs together with an extensive understanding of its biochemistry and physiology makes this microorganism an excellent candidate next only to Escherichia coli K12 to model regulatory networks in silico.
In this work we are not only interested in the identification and classification of the DNA-binding transcriptional regulatory repertoire of B. subtilis but also in a comparative genomic analysis to deduce thereof how the TFs and their evolutionary families have been distributed among all Firmicutes sequenced. In the process we also characterize the TF repertoire in completely sequenced genomes of Firmicutes. We have selected this bacterium because it represents a model organism for Gram-positive bacteria, a group that includes a wide diversity of organisms, some of them pathogens and others important for the biotechnology industry. This analysis not only resulted in the identification of a core set of regulatory genes for B. subtilis and other organisms but also in the identification of a specific set involved in the regulation of gene expression in only this bacterium. This work provides a basis for the analysis of transcriptional regulatory networks in Firmicutes and beyond.

Results and discussion
The repertoire of DNA-binding TFs in B. subtilis The identification and characterization of the repertoire of DNA-binding TFs in B. subtilis is a key step to understand the transcriptional gene regulatory machinery in this bacterium and opens an excellent opportunity to explore other Firmicutes. Therefore, we scanned the whole genome sequence of B. subtilis by using different computational approaches. Based on HMM-sequence comparisons and literature lookup, we identified a total of 237 genes as the minimal repertoire of proteins acting as DNAbinding regulators, that B. subtilis needs to regulate around 4200 genes (organized in approximately 2591 Transcription Units). 113 TFs have been experimentally characterized while 124 are predicted in silico (see Methods) [See additional file 1]. This dataset represents around 6.0% of the total protein coding genes of B. subtilis and is in agreement with previous estimates that suggest that less than 10% of gene products in all bacteria are associated to gene regulation [4]. However, alternative regulatory mechanisms such as sigma factors or attenuators which are excluded from this analysis could contribute to the regulatory complexity in this genome which otherwise suggests that a smaller proportion of the genome is devoted to transcription factors than in E. coli [4,5]. We found that the proportion of TFs to the number of Transcriptional Units (TUs) is 1:10, a similar proportion has been reported for E. coli. Additionally, based on the information detailing the regulation of 728 genes (promoters, TFs, and binding sites) deposited in the DBTBS [6], Subtilist [7], and Prodoric DB [8] we found that 61% of this set of genes in B. subtilis are regulated by one TF suggesting that gene regulation mediated by only one transcriptional regulator seems to be the most frequent case in most bacterial systems described so far [9]. 24% of the genes are regulated by two TFs, 11% are regulated by three TFs and around 5% by four or more TFs. However it should be noted that the data used in this analysis for the regulatory interactions of B. subtilis is far from complete and so the tendencies observed above although coherent with those observed in E. coli could be influenced by the incomplete-ness of the data in both the genomes. In the process of identifying regulators which have an influence on a large fraction of the regulatory network we defined a TF as a global regulator if it regulates more than 20 different genes organized on different transcription units (TU's) and if those genes belong to a minimum of four different functional categories, excluding hypothetical functions. This definition of identifying global regulators is based on previous observations made for the gram negative model, E. coli [10]. Using this approach, we identified six TFs (see Table 1) modulating the expression of around 60% of the total genes in B. subtilis as global regulators. These proteins regulate diverse processes including cellular mechanisms related to cellular envelope, amino acid biosynthesis, energy, and transport. Unlike this approach a recent work identified global regulators or hubs in the transcriptional regulatory network of B. subtilis using only the out-going connectivity of the TF [11], however the results suggest very good overlap. An intriguing observation is the case of the catabolic repression, mediated by the global regulator CcpA (Genbank: 16080026) in B. subtilis (Catabolite response regulator) which in E. coli is mediated by Crp (Genbank: 82583733), both proteins do not share similar evolutionary histories, but regulate the same metabolic response. Based on these two cases, one can infer that multiple mechanisms of catabolite control might have evolved independently of each other to respond to the same cellular condition, such as Crp (Crp family) in E. coli, CcpA (GalR/LacI family) in B. subtilis, and more recently Crc TF (Genbank: 15600525) (endonuclease/exonuclease/phosphatase family) in Pseudomonas aeruginosa [12].
In bacteria, the most common structure associated to TFs is the helix-turn-helix (HTH). The position of this struc-ture in the sequence correlates with the regulatory role, i.e. most repressors tend to have a HTH in the N-terminal whereas the activators display it in the C-terminal [13]. In order to determine the proportion of repressors, activators and dual TFs in B. subtilis, we used this correlation to assign probable regulatory roles to the collection of regulators in the dataset where the DNA-binding structure is a HTH, some of them were corroborated by literature search. From this analysis, we found that 59% of the TFs could be predicted to be repressor proteins, 17% activators and 17% could be putatively assigned as dual regulators while for 6% of them a regulatory role could not be deduced (See Figure 1). This trend correlates with the observation that most promoters are repressed in bacteria and correlates with the fact that B. subtilis exhibits a major proportion of promoters repressed than activated. Indeed, a more detailed analysis of these promoters indicates that around 60% of the repressor sites are between -1 to -60, the region occupied by the RNA pol, whereas around 30% are between +1 to +60 (see Figure 2), suggesting that repression by steric hindrance is probably the most common regulatory mechanism associated with the TFs in B. subtilis, where the repressor-binding site overlaps core promoter elements and blocks recognition of the promoter by the RNA polymerase holoenzyme [14], whereas in lesser proportion by blocking the elongation chain. On the other hand, 60% of the positive sites were found to be located upstream of the promoter (between -40 to -100) suggesting a mechanism of activation of class I and II, where the activator binds to a target that is adjacent to the promoter's -35 element, and the bound activator interacts with the alpha subunit of the region 4 of σ 70 . 15% of the activated DNA-binding sites are located between -40 to -1 suggesting activation by conformation changes, where the activator binds at or near to the promoter elements and  Regulates genes during nitrogen-limited growth MerR realigns the -10 element and the -35 element so that the RNA polymerase holoenzyme can bind to the promoter [14]. Similar trends have been observed in two independent studies conducted previously using the data from E. coli [15,16].
Additionally we identified 56 TFs, which are reported to be cross-regulated. In Figure 3 we show the matrix of regulatory interactions for these TFs in B. subtilis. From this dataset 69% were found to be negatively, 26% positively and 5% dual autoregulated among those whose auto-regulatory role could be established. A similar trend has been observed previously in the case of E. coli K12 [4]. However, the proportion of auto-regulatory positive feedback circuits in B. subtilis contrasts against that observed in E. coli, where only 6.5% account for positive autoregulation. Probably, this difference is a consequence of the enhanced regulatory mechanisms in B. subtilis which could have been developed for a systematic recruitment of metabolic signals to improve the response or to switch between vegetative and spore life cycle. In fact, Thieffry et. al [17] propose that mixed metabolic/genetic positive circuits need the continuous presence of the involved metabolites to remain active, allowing the cell to monitor the presence of such metabolites continuously. In light of these findings we propose that the common autoregulatory organization observed in the TFs of E. coli and B. subtilis, might play an important evolutionary and functional role in all bacteria, due to the fact that perturbations in the expression of a particular TF would lead to a change of expression of a limited number of coordinated genes, and not to the whole network.

Identification of TF families in B. subtilis
In a previous analysis it has been proposed that DNAbinding TFs can be grouped into protein families based on their amino acid sequence similarity [4]. In order to construct TF families in B. subtilis, we first identified and defined families based on HMM searches done with family specific HMMs derived in E. coli and by using PFAMs [18] (see Methods). In order to expand the families we then considered a protein as a member of a family if it shared at least 25% of identity with any member of the group in the DNA-binding domain (DBD) or if the protein had matches derived from HMM searches. We then performed alignments between the TF and its correspond-Total number of TFs identified per genome for all the firmicutes analyzed in this study Figure 1 Total number of TFs identified per genome for all the firmicutes analyzed in this study. The proportion of activators (yellow), repressors (black), dual (red) and proteins with unknown (gray) regulatory role is shown.
ing family-specific PFAM model, by using the hmmalign program from HMMer suite of programs [18].
The whole repertoire of TFs was clustered into 51 families of varying sizes (see Figure 4), for instance, nine families contain more than 10 members, the most abundant being the Multiple Antibiotic Resistance Regulators (MarR) family (20 TFs) and the GntR family (14 TFs); whereas thirtynine families include less than 9 TFs. An interesting observation is that the family ArsR (Arsenic resistance regulator) contains nine members, while in E. coli this family is represented by only two proteins, suggesting diverse events of gene duplication for members of the family in this bacterium. Another notable difference relative to E. coli is the Crp family: in E. coli three members associated to global regulatory processes have been identified, while in B. subtilis only Fnr was identified. This observation suggests the possibility that Fnr could have existed in the last common ancestor of these two genomes and that gene loss could have been responsible for the absence of other members, however additional evidences are necessary. Twenty-six genes encoding response regulators were also identified (from the family of EBP, LuxR and OmpR), most of which are arranged adjacent to genes encoding histidine kinases. This finding is interesting because it represents a probable co-evolution process between the response regulators and sensor genes. Indeed, recent coevolution events have been identified in members of this family in E. coli [19].
In addition, 19 families include only one member per group. These families seem to play an important role in specific processes of this cellular division, such as sporulation, and bacterial competence, such as AbrB, ComK, and CodY families. Local regulators, such as BirA (biotin biosynthesis), LexA (SOS response), Fur (Ferric uptake regulator) or ArgR (arginine biosynthesis regulation) families were also identified in few copies. A similar trend has been found in different bacterial genomes for these TFs [20]. subtilis. Each filled box shows the sign of the regulatory effect of the TF in the corresponding column on the expression of the TF in the corresponding row. Repression is denoted by -, activation is denoted by + and dual regulation is denoted by *.
biosynthesis and MarR resistance to antibiotics). We also found that there are specific regulators associated to B. subtilis and to Firmicutes, three of them were involved in sporulation and related processes (AbrB, ComK, and CodY). This difference, in the proportion of TFs and families, suggest that different regulatory mechanisms have been probably invented in B. subtilis to specific processes, such as sporulation, but also sharing a core of TFs to maintain an adequate homeostatic control in the rest of the genes.

Structural assignments and Fold frequency of Transcription factors (TFs)
Helix-Turn-Helix (HTH) is known to be the most common structure associated to DNA-binding TFs in bacteria [5,20]. Alternative structures have been identified in smaller proportions. In order to determine the diversity of the TF structural domains in the repertoire of TFs in B. subtilis, the transcriptional regulators were analysed by using Superfamily HMM models [21]. This analysis shows the structural variability associated to HTH proteins. We found that fourty-seven percent of the whole repertoire of TFs contain a "winged" HTH. This result is interesting because it represents 21 out of 51 families identified in this bacterium. Around 14% of the TFs (that represent four families) are intimately associated to the "homeodomain-like" HTH superfamily domain. Only two families contained the "classical" HTH, although representing almost 12% of the whole repertoire, showing that these groups represent two largest families in B. subtilis (GalR/ LacI and Xre).
Alternative DNA-binding domains were also identified, though in much smaller proportions, such as the IHF-like structures or nucleic acid binding structures (associated to the Cold Shock family). Finally, B. subtilis TFs contain two-domain proteins (a DBD and a multimerization/ligand binding domain). A similar trend has been noticed in the repertoire of TFs in E. coli K12 reported recently [22], where the authors also suggest that almost three quarters of the TFs are two-domain (like in B. subtilis), and are a result of diverse duplication events [See additional files 2 and 3].

Tracing the TF families in Firmicutes
In order to determine the abundance and distribution of TF families among Firmicutes, we examined their occurrence in 58 genomes, 27 Bacillales, 17 Lactobacillales, 10 Mollicutes, and 4 Clostridia (see Methods). We considered this analysis under the belief that it might give us clues about the evolution of common cellular processes among organisms of this bacterial lineage. Below we summarize the prominent observations emerged out of this analysis: a) We observed a rough trend between the numbers of TFs versus the genome size. Larger genomes contain more TFs than smaller ones. This might not be a surprise considering that the more number of coding regions within a genome it would encode more DNA-binding transcriptional regulators, like it has been previously proposed [5,20,23,24]. Thus, the proportion of TFs in larger genomes would be consistent with the hypothesis that an increase of genome complexity and physiological functionality is generally associated with a more complex regulation of gene expression since the additional genetic information has to be integrated into the existing regulatory networks that operate in a bacterial cell [25]. In this context it is interesting to note that the phyla Molliscutes contains a much smaller fraction of TFs identified so far, probably because most of these organisms have lost a lot of their genes as a consequence of their life style (See below and Figure 4). b) When the proportion of TFs was analyzed as a function of number of families per genome, we found that although some bacteria contain a high proportion of families, their sizes seem to be reduced, whereas in bacteria with few families the familial sizes seem to be larger with high proportion of TFs. This finding suggests that some families have been widely duplicated, whereas other families have been constrained to few members as a consequence of the bacterial life style. This could also suggest that a fraction of the total TF repertoire in Firmicutes is a consequence of massive gene duplication constrained to only few protein families. c) We identified three families "universally" distributed among Firmicutes which include HrcA, DnaA, and PurR (except in the mollicutes Mycoplasma genitalum and M. pneumoniae). These families are associated to the regulation of class I heat-shock genes (dnaK, groESL) for HrcA, DNA replication process (DnaA), and the adenine nucleotide-dependent regulation of pur operon for PurR; all of them important informational processes and they might belong to the ancestral core of TFs in this cellular division. d) Fifteen families were identified as common families to Bacillales, Lactobacillales and Clostridia, which include GntR, GalR/LacI, LysR, MarR, TetR, MerR, OmpR, RpiR, Rrf2, CtsR, LytR, AraC/XylS, Xre, IHF and PaiA. Interestingly, many of these families are highly represented in these three lineages of Firmicutes. Members of the GntR, AraC/XylS and GalR/LacI families generally respond to environmental changes that affect the carbohydrate metabolism of the cell [4]. It certainly makes sense that soil bacteria have a large diversity of DNA-binding transcriptional regulators that respond to changes in the carbohydrate composition of the environment. The families MarR, TetR and MerR regulate the resistance to antibiotics and mercury among others, while the family OmpR is associated to regulate membrane components from the two component systems. The large number of proteins grouped into GntR and GalR/LacI families may provide these bacteria the ability to grow in the presence of several carbon sources and to rapidly adapt their gene expression to changing nutrient conditions as has been suggested previously [26]. e) Two families exclusive to Bacillales: Psq and ComK were traced among all genomes of Bacillales. Among these ComK, emerges like an essential TF for the development of genetic competence in B. subtilis and probably in all Bacillales. This protein contains an atypical DNA-binding structure, the "helical domain of sec23/24" [27]. This transcription factor is considered as a global regulator and its gene expression is strictly regulated by nutritional and growth phase-dependent control [28]. Additionally, it is dependent on its own gene product and that of the TFs AbrB, ComA, ComP, DegU, Sin, Spo0A, Spo0H, Spo0K and SrfA. This system is highly regulated because it represents the final convergence signals which trigger competence development [29]. The highly regulated genes might be associated to key processes in the cell, such as competence or sporulation suggesting that additional genes highly regulated might participate in important cellular mechanisms. It should be noted that most Bacillales include a phase of sporulation in their life cycle.
f) Finally, two families, GutR and SenS, were exclusively identified in B. subtilis. These families are very interesting as they seem to be intimately related with important cellular processes, such as the regulation of sorbitol dehydrogenase gene (gutB) by GutR, which contains a HTH motif and a nucleotide binding domain at the N-terminal region [30] and regulation of extracellular enzyme genes (amyE, aprE, nprE) by SenS, which comprises of a HTH motif along its length of 65 amino acids [31][32][33]. These TF families might be considered as a regulatory signature of this bacterium.
In summary, the distribution and abundance of TF families was traced among fifty-eight genomes of Firmicutes from different lineages, opening diverse opportunities to understand the evolution of regulatory networks in this bacterial division and to define their precise role in maintaining an adequate control of gene expression. This repertoire of TF families will also pave the way to understand and analyze exhaustively other Firmicutes from the perspective of B. subtilis and to consider other specific and important questions not addressed here.

Conclusion
Based on analysis of the sequenced genome of B. subtilis we identified its repertoire of DNA-binding TFs, which allowed us to identify TFs common to other Firmicutes, and TFs specific to few closer lineages. We demonstrated that the number of TFs reflects different forms of life styles, and that families are distributed almost homogeneously among all Firmicutes. The diverse elements involved in the regulatory networks apparently have different evolutionary histories some times denoting exclusive functional conservation in specific lineages such as sporulation specific TFs observed in Bacillales. Further research is necessary to determine the physiological function of species-specific and shared transcriptional regula-tors that might be involved either in the regulation of cellular processes relevant for biotechnological production or that might control the expression of genes involved, for instance, in virulence of pathogenic bacteria. However, we must consider alternative regulatory mechanisms not considered here, such as attenuation or regulation mediated by riboswitches. For instance, when we analyzed the proportion of sigma factors between E. coli (7 sigma factors) and B. subtilis (17 sigma factors) we found a clear difference between them possibly suggesting that sigma factors account for the relatively large proportion of DNA-binding proteins in B. subtilis in comparison to E. coli. The analysis presented here, will help to understand the regulatory networks in different bacteria by using E. coli and B. subtilis as models and to decipher the evolution of these networks in a global context.

Identification of DNA-binding transcription factors (TFs)
In order to identify the repertoire of TFs in Firmicutes including B. subtilis, we used a combination of information sources and bioinformatics tools. The first set of 292 putative TFs were collected from DBTBS, a database devoted to the gene regulatory mechanism in B. subtilis strain 168 [6]. From this dataset, we excluded by sequence comparison against SwissProt and reference searches, 75 proteins annotated as terminators, antiterminators, and sigma factors, among others. Finally, we were left with 217 well-annotated TFs in this bacterium.
In the second phase, 90 family-specific Hidden Markov Models (HMMs) reported previously [20] were used to scan the whole B. subtilis sequence genome (E-value threshold ≥ 10-3). We used the hmmsearch module from HMMer suite of programs [18]. Briefly, these models were constructed by using as seed the TF families previously identified in E. coli K12. The models -almost exclusivelyconsider the DNA-binding domain sequence for every protein family (around 60 amino acids). We excluded proteins that matched less than 50% against their corresponding HMM although they correspond to the DBD. In this search, 181 proteins were identified as probable TFs [See additional file 4] This search is important because proteins identified by these specific HMMs might not be included in the dataset retrieved from the previous phase.
In the third phase, the B. subtilis proteome was analyzed with the library of HMMs from the Superfamily database (E-value ≥ 10-3) [21]. This HMM library is based on the sequences of domains collected in the Structural Classification of Proteins (SCOP) database [27] and are thus applicable for a structural classification of proteins. This attempt was made to identify additional TFs not identified in the previous phases.
TFs identified in each of the three phases: DBTBS, HMM-E. coli models and Superfamily searches were compared to define the final TF repertoire. The final dataset included the intersection of proteins identified by HMMs, Superfamily searches, and the repertoire (manually curated) of TFs described in DBTBS. Three confidence levels were considered to have an assessment of the quality of the identified TFs: a) higher level that includes TF identified by the three approaches; b) medium level, those identified by two methods; and c) lower level, for TFs which are identified by only one method. Additionally, literature information was used to find additional TFs not identified by these automatic searches.
Finally, 237 proteins were deduced like the minimal number of TFs that B. subtilis needs to regulate its gene expression. The identified proteins were classified into families by using HMMs deposited in the PFAM DB [18], and aligned by using the program hmmalign from HMMer. In order to identify TFs and families in other Firmicutes we constructed family-specific HMMs to B. subtilis, and we ran against 58 Firmicute genome sequences (Evalue ≥ 10-3 was considered as threshold) [See additional file 5].