Computational analysis of LexA regulons in Cyanobacteria

Background The transcription factor LexA plays an important role in the SOS response in Escherichia coli and many other bacterial species studied. Although the lexA gene is encoded in almost every bacterial group with a wide range of evolutionary distances, its precise functions in each group/species are largely unknown. More recently, it has been shown that lexA genes in two cyanobacterial genomes Nostoc sp. PCC 7120 and Synechocystis sp. PCC 6803 might have distinct functions other than the regulation of the SOS response. To gain a general understanding of the functions of LexA and its evolution in cyanobacteria, we conducted the current study. Results Our analysis indicates that six of 33 sequenced cyanobacterial genomes do not harbor a lexA gene although they all encode the key SOS response genes, suggesting that LexA is not an indispensable transcription factor in these cyanobacteria, and that their SOS responses might be regulated by different mechanisms. Our phylogenetic analysis suggests that lexA was lost during the course of evolution in these six cyanobacterial genomes. For the 26 cyanobacterial genomes that encode a lexA gene, we have predicted their LexA-binding sites and regulons using an efficient binding site/regulon prediction algorithm that we developed previously. Our results show that LexA in most of these 26 genomes might still function as the transcriptional regulator of the SOS response genes as seen in E. coli and other organisms. Interestingly, putative LexA-binding sites were also found in some genomes for some key genes involved in a variety of other biological processes including photosynthesis, drug resistance, etc., suggesting that there is crosstalk between the SOS response and these biological processes. In particular, LexA in both Synechocystis sp. PCC6803 and Gloeobacter violaceus PCC7421 has largely diverged from those in other cyanobacteria in the sequence level. It is likely that LexA is no longer a regulator of the SOS response in Synechocystis sp. PCC6803. Conclusions In most cyanobacterial genomes that we analyzed, LexA appears to function as the transcriptional regulator of the key SOS response genes. There are possible couplings between the SOS response and other biological processes. In some cyanobacteria, LexA has adapted distinct functions, and might no longer be a regulator of the SOS response system. In some other cyanobacteria, lexA appears to have been lost during the course of evolution. The loss of lexA in these genomes might lead to the degradation of its binding sites.


Background
The LexA protein was first characterized as the transcriptional regulator of the SOS response in Escherichia coli [1,2], and later in several other bacteria, including Bacillus subtilis [3,4] and Fibrobacter succinogenes [5]. In fact, the lexA gene is found in almost all eubacterial groups examined so far [5,6]. In E. coli, around 30 genes involved in the SOS response are under the regulation of LexA [2]. Under normal growth conditions, LexA represses the SOS response genes by binding to their promoter regions, and thus blocking their transcription. When DNA is damaged, the binding of RecA to the released single-stranded DNA induces the auto-cleavage of the Ala 84 -Gly 85 peptide bond [7,8] in LexA, thereby inhibiting the dimerization of LexA and preventing its binding to DNA [9][10][11]. In this manner, SOS response genes are de-repressed and expressed at different time points and different levels in a coordinated way [10].
LexA in E. coli consists of an N-terminal DNA-binding domain and a C-terminal dimerization domain [8,12].
The N-terminal contains three α-helices (I, II, III) and an anti-parallel β-sheet [12]. Helices II and III form a helixturn-helix DNA-binding motif, and all the DNA-contacting residues Ser 39 , Asn 41 , Ala 42 , Glu 44 and Glu 45 are located in helix III [13] as revealed by both NMR [12] and X-Ray crystallography analyses [8]. The LexA-binding sites in E. coli were found to be a 16-bp palindromic motif with the consensus sequence CTG(TA) 5 CAG [14]. It has been shown that two reactive residues Ser 119 and Lys 156 in E. coli LexA are critical for the auto-hydrolysis of the peptide bond Ala 84 -Gly 85 [1,9]. The core set of the SOS response system consists of lexA, recA, uvrABCD, umuCD and ruvB [10]. Upon the auto-hydrolysis of LexA, the uvrABCD operon is expressed first, whose products are responsible for the nucleotide excision repair (NER). Then recA and several other genes for homologous recombination are expressed, retrieving the excised DNA double strands. Next, the cell division inhibitor SfiA is induced to guarantee a sufficient time for the DNA repairing to be completed. In the end, if the DNA is not completely repaired, the operon umuCD encoding the mutagenic DNA repair polymerase Pol V will be induced to perform translesion DNA synthesis [9,14]. Since the lexA gene itself is also under the control of LexA, after the damaged DNA is repaired, the activity of RecA declines, the production of LexA surpasses its auto-cleavage. Consequently, the increased concentration of LexA restores the inhibition of the expression of the SOS response genes.
More recently, LexA homologs were also experimentally studied in a few cyanobacteria [15][16][17][18][19][20][21]. These studies suggest that LexA in Nostoc sp. PCC 7120 [16] binds to the promoter regions of lexA and recA; however, LexA in Synechocystis sp. PCC 6803 may regulate different genes/systems other than the SOS system. Domain et al. concluded from microarray gene profiling analysis [21] that LexA in this species might be involved in carbon metabolism. Later, LexA in Synechocystis sp. PCC 6803 was found to regulate the crhR gene encoding a RNA helicase [19]. Moreover, it has been shown that the transcription of the bidirectional hydrogenase genes hoxEFUYH was regulated by LexA in Synechocystis sp. PCC6803 [17]. In Nostoc sp. PCC 7120, hoxEFUYH genes are split into two separate operons, and LexA was found to bind to the upstream regions for both operons [15]. Mazon et al. [16] showed that the LexA-binding sites in Nostoc sp. PCC 7120 have a 14-bp pseudo-palindromic structure in the form of RGTACNNNDGTWCB, which are similar to those in B. subtilis. Additionally, Sjöholm et al. [15] found two putative palindromic LexA-binding sites: one in the promoter region of alr0750-hoxE-hoxF that resembles Mazon's LexA boxes [16], and another, TTACACTTTAA in the upstream region of hoxU in Nostoc sp. PCC 7120. Meanwhile, multiple putative LexA boxes were identified in Synechocystis sp. PCC6803: a 13-bp pseudo-palindromic segment AGTAACTAGTTCG in the upstream region of hoxE, which is similar to Mazon's site but with one base deletion [17]; another direct repeat pattern, CTA-N 9 -CTA proposed to be recognized by LexA in vitro [20]; and two putative LexA boxes that resemble none of the putative LexA boxes listed above [18]. Despite this progress, a more extensive study of LexA proteins and their binding sites and regulons in cyanobacterial genomes is still needed. In this study, we have predicted LexA-binding sites and regulons in all the sequenced cyanobacterial genomes that harbor a lexA gene, and analyzed the evolutionary changes in the LexA regulons in cyanobacteria, as well as their relationship with those in proteobacteria and gram-positive bacteria.

Conservation of the DNA-binding domain of LexA in cyanobacteria
We identified orthologs of the LexA protein in Nostoc sp. PCC7120 (alr4908) in 26 of the 33 sequenced cyanobacterial genomes using the bi-directional best hit (BDBH) method based on BLASTP search with an E-value cutoff 10 -10 (see Methods). Seven genomes appear not to harbor a lexA gene under this criterion, namely, Gloeobacter violaceus PCC7421, Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a(2-13) B-Prime, Synechococcus elongatus PCC6301, Synechococcus elongatus PCC7942, Trichodesmium erythraeum IMS101 and Thermosynechococcus elongatus BP-1. We removed the Synechococcus elongatus PCC7942 genome from our study since Synechococcus elongatus PCC6301 is virtually identical to it [22]. However, an ortholog of the lexA gene (Gll0709) does exist in Gloeobacter violaceus PCC7421. The reason we failed to identify this ortholog is that it does not meet our BDBH criterion due to its largely divergent sequence. The phylogenetic tree of these 27 LexA amino acid sequences indicates that they can be clustered into three groups ( Figure 1A), corresponding to the previously described Clade A (containing Gloeobacter violaceus PCC7421), Clade C (containing small marine Prochlorococcus and Synechococcus), and Clade B (containing most remaining cyanobacteria) [23,24]. However, aside from Gloeobacter violaceus PCC7421, the DNA-binding domains (DBD) of LexA from these cyanobacteria are highly conserved ( Figure 1B), especially the helix III, where DNA-contacting residues are located [13]. This result is in agreement with earlier observations [16,21]. This provides the rationale of our analysis, including the phylogenetic footprinting analysis (next section) and genome-wide scanning for LexA-binding site predictions. On the other hand, since the DBD in Gloeobacter violaceus PCC7421 is quite different from those in other cyanobacteria, especially where the DNA-contacting residues locate, thus, we excluded it from our study, leaving 31 species/ strains for the putative LexA regulon prediction.

LexA-binding sites predicted by phylogenetic footprinting
We considered both an operon and a singleton gene as a transcription unit (TU). As Mazon et al. [16] have demonstrated the binding of LexA to the upstream regions of two genes, lexA and recA, and predicted LexA-binding sites for other four genes (uvrA, ssb, alr4905, and all4790) in Nostoc sp PCC7120, we used phylogenetic footprinting to identify possible LexAbinding sites in the pooled 118 inter-TU sequences associated with these six genes in Nostoc sp. PCC7120 [16] and their orthologs in the other 25 cyanobacterial genomes (excluding Gloeobacter violaceus PCC7421) that harbor a lexA gene (see Methods). We identified 49 high-scoring 14-bp palindromic sequences (Table 1) out of the 118 input sequences by applying the motif finding tools MEME [25] and BioProspector [26] and incorporating the best motifs found by these two programs (See Methods and Additional file 1). However, the putative LexA box AGTCCTAGAGTCCT (Additional file 1) identified in Synechocystis sp. PCC6803 was not identified by Patterson-Fortin et al. [20] using DNaseI footprinting assays or by Gutekunst et al. [17]. Therefore, we removed this site, leaving 48 putative LexA-binding sites (Table 1) for profile construction. The two LexA boxes that have been characterized in Nostoc sp PCC 7120 [16] were accurately recovered by the phylogenetic footprinting procedure (Table 1), suggesting that most of these high-scoring motifs are likely to be genuine LexA boxes. These putative LexA-binding sites show either a strong palindromic structure similar to the experimentally characterized LexA boxes in Nostoc sp. PCC7120 [16], or a tandem repeat structure with the consensus sequence AGTACWNWTGTACT. As demonstrated in Figure  S1 in Additional file 2, this pattern is rather similar to the consensus sequence of the LexA-binding sites previously identified in B. subitlis (CGAACN 4 GTTCG) [3], and to a less extent, to that of LexA-binding sites found in α-proteobacteria (GTTCN 7 GTTC and GAACN 7 GAAC) [27], but differs remarkably from that in E. coli CTG(TA) 5 CAG [14]. These results are consistent with our phylogenetic analysis of the 183 LexA proteins detected in 598 genomes, showing that LexA proteins in cyanobacteria are more closely related to those in gram-positive and α-proteobacteria bacteria than to those in γ-proteobacteria ( Figure S2 in Additional file 3). Accordingly, since the LexA-binding sites in B. subtilis [3,16] have a palindromic structure, it is not surprising that the LexA-binding sites in cyanobacterial genomes might have a similar palindromic structure.
Genome-wide prediction of LexA-binding sites and regulons in cyanobacterial genomes Both consensus sequence and position weight matrix (PWM) have been widely used to represent the pattern of similar sequences. The advantage of PWM (or profile)  methods over the consensus sequence methods is that the former can capture more quantitative information about the patterns by using a probabilistic model to represent the sequences. In this way, it can differentiate subtly conserved positions from the non-conserved ones [28]. In our study, We used the profile of these 48 LexA boxes (Table  1) to scan the 31 sequenced cyanobacterial genomes to predict additional putative LexA-binding sites and members of LexA regulons, using a scanning algorithm [29][30][31] that incorporates orthologous information and computes a log-odds ratio (LOR) score for evaluating the confidence of predictions in each genome (see Methods for details). The predicted results with a p-value < 0.01 for the 26 genomes harboring a lexA gene are listed in Table S1-26 (Additional file 4), while those for the five genomes without a lexA gene are listed in Table S27-31 (Additional file 5). The predicted results with a p-value < 0.05 for the 31 cyanobacteria are summarized in Table S32-62 (Additional file 6). The score of a detected putative LexA binding site for a TU is the sum of two terms: one evaluates the extent to which the putative LexA binding site resembles the scanning profile; the other evaluates the similarity of this binding site to those identified for the orthologs of genes within the TU in the other genomes. To evaluate the confidence of each motif score s, we used randomly selected coding sequences as the null model to test the statistical significance. A false positive rate was used to evaluate this statistical significance, which was defined as the fraction of the randomly selected coding sequences containing binding sites with a score higher than the cutoff s in the genome. We chose randomly selected coding regions as the null model based on the assumption that a coding sequence is less likely to contain cis-regulatory binding sites than an intergenic sequence. Although it might be possible for genuine LexA boxes to occur in coding regions [15,20], such kind of binding sites should be rare. The LOR function for a genome evaluates the ratio of the fraction of the inter-TU sequences containing a binding site with a score higher than s to the fraction of the randomly selected coding sequences containing a binding site with a score higher than the same s in the genome. Accordingly, positive LOR values that increase monotonically with the increase in binding site sores would suggest that an inter-TU sequence is more likely to contain a high-scoring LexA-binding site than does a randomly selected coding sequence in the genome.
As shown in Figure 2, when the motif score s increases beyond some value, the LOR is generally high for most of the 26 cyanobacteria that harbor a lexA gene, therefore those genomes with high LOR values are likely to contain some true binding sites. Exceptions exist in five genomes, namely, Cyanothece sp. PCC 8801, Synechocystis sp. PCC6803, Synechococcus RCC307, Synechococcus sp. PCC 7002, and Microcystis aeruginosa NIES-843, in which the LOR curves oscillate around zero when binding site score s increases. These poor LOR values might suggest that there are not more highscoring LexA-binding sites in the inter-TU regions than in the coding regions in the five genomes. The reason for this could be that our scanning algorithm rewards a binding site that is shared by orthologs in the other genomes. If a true binding site is unique to a genome, then it will not score high. In this sense, LexA is probably no longer a major SOS response regulator in these genomes. Instead, it might have become a specific local regulator during the course of evolution to adapt to their unique living environments (we will return to this subject later). In the case of Synechocystis sp. PCC6803, it is noted that the LexA-binding sites identified by Patterson-Fortin et al. [20] are totally different from those identified by Mazon et al. [16], and that the LexA sequence in this genome is largely divergent from those in the other genomes ( Figure 1A). Accordingly, the LexA binding sites in this genome might differ in some way from those in the other genomes, which can be another reason for its low LOR values.
In contrast, as shown in Figure S3 in Additional file 2, the LOR values in the five genomes that do not harbor a lexA gene (Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a (2-13) B-Prime, Synechococcus elongatus PCC 6301, Thermosynechococcus elongates BP-1 and Trichodesmium erythraeum IMS101) oscillate around or decrease below zero when the motif score s increases beyond a certain value, implying that the chance to find a relative high-scoring putative LexA-binding site in an inter-TU region is not higher than in a randomly chosen coding sequence, suggesting that these genomes are unlikely to contain functional LexA-binding sites. On the other hand, the LOR values in the three genomes Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a (2-13) B-Prime and Trichodesmium erythraeum IMS101 are relatively higher than those in the other two genomes (Additional file 2, Figure S3), or even could be comparable to those of the five poor-LOR-valued cyanobacteria that harbor a lexA gene ( Figure 2). In fact, the numbers of predicted binding sites in the three genomes are not too small (Table S27, S28, S31 in Additional file 5), which suggests that a few putative LexA-like binding sites exist in these genomes. A possible explanation for this phenomenon could be that these LexA-like binding sites are recognized by other transcription factors that have similar DNA-binding domains to that of LexA. The predictions of LexA regulons in the 26 cyanobacterial genomes that harbor a lexA gene are summarized in Table 2.

Conservation and diversity of the putative LexA regulons in cyanobacteria
To investigate how well the predicted LexA regulons in the 26 cyanobacterial genomes are conserved, we constructed a LexA regulon conservation tree based on the pairwise comparison of the predicted LexA regulons in these genomes (see Methods). As shown in Figure 3, these genomes are divided into two groups. Interestingly, one group is exclusively comprised of marine strains, and the other group contains the remaining genomes isolated from different non-marine habitats. In the former group, high light (HL) adapted and low light (LL) adapted ecotypes are largely grouped into two subgroups. The results suggest that the composition of LexA regulons is dependent on the habitat of the organisms to a large extent. The general topology of the tree ( Figure 3) is basically consistent with both the 16S rRNA gene tree ( Figure 4) and the LexA protein tree of these genomes ( Figure 1A). Furthermore, both the HL and LL adapted marine sub-groups are very compact, indicating that the predicted LexA regulons in both subgroups are relatively conserved. In contrast, the species in the non-marine habitats are not so close to one another (Figure 3), suggesting that the putative LexA regulons in these genomes share few genes with one another except for the closely related Anabaena variabilis ATCC 29413 and Nostoc sp. PCC7120. The tree also indicates that Microcystis aeruginosa NIES 843 and Synechocystis sp. PCC6803 have the most distinct LexA regulons from other cyanobacterial genomes (Figure 3).

Functional classification of putative LexA regulons in cyanobacteria
Predicted members of LexA regulons in the 26 cyanobacteria that harbor a lexA gene are listed in Tables S1-S26 in Additional file 4, their functions can be summarized as follows.

SOS response system
As shown in Table S63 in Additional file 6, all the 33 cyanobacterial genomes included in this study encode a few SOS response genes found in E. coli. Several of the SOS genes in some of the 26 genomes that harbor a lexA gene bear a high-scoring putative LexA-binding site in their regulatory regions (Table S1-26 in Additional file 4). In particular, two of the core SOS response genes [5,6], namely, recA and lexA, are among the most conserved putative LexA targets in the 26 cyanobacterial species/strains (Table S64 in Additional file  6). In addition, the umuC and umuD genes encoded in 13 genomes are also predicted to bear a putative LexAbinding site in their promoter regions (Table S64 in Additional file 6, Table 3 and Additional file 4). These results suggest that as in E. coli, the SOS response in most cyanobacteria might still be regulated by LexA. However, the other SOS response genes were found to bear a putative LexA-binding site only in a few genomes (Table 3) (Tables 3). Thus, it is likely that the NER process in the remaining genomes is regulated by some transcription factor other than LexA, given that uvr genes are present in all the 32 cyanobacterial genomes analyzed in this study, including those that do not encode a lexA gene (Table S63 in Additional file 6). These results are consistent with the earlier observation that LexA target genes in bacteria are highly diversified in order for them to adapt to different ecological niches [5,6]. On the other hand, in Synechococcus PCC7002, Synechococcus RCC307 and Synechococcus WH7803, LexA boxes were only detected for one of the core SOS response genes, i.e., SYNPCC7002_A0426 (recA) in Synechococcus PCC7002, SynRCC307_1756 (lexA) in Synechococcus RCC307 and SynWH7803_0439 (recA) in Synechococcus WH7803, although these genomes all encode a lexA gene and other core SOS response genes, such as recA (SYNPCC7002_A0426, SynRCC307_2111 and SynWH7803_0439,) and ruvB (SYNPCC7002_A1390, SynRCC307_1756 and SynWH7803_0185), umuC (SynRCC307_0043 and SynWH7803_1080,) and umuD (SynRCC307_0042 and SynWH7803_1081). Since only one single SOS response gene bears a putative LexA box in these genomes, it is likely that the role of LexA in the regulation of the SOS response in these genomes might have been attenuated. The case of Synechocystis sp.
PCC6803 seems to go even further in this direction as detailed below.
As indicated previously [16,18], the LexA protein of Synechocystis sp. PCC6803 is unusual in two aspects compared to those in the other genomes analyzed in this study. First, as shown in Figure S4 in Additional file 2, the Ala-Gly dyad in the N-terminus of LexA responsible for auto-cleavage of the protein in all other cyanobacteria as well as in E. coli and B. subtilis [16] are replaced by Gly-Gly in Synechocystis sp. PCC6803. Second, the reactive residue Ser (Ser 119 of LexA in E. coli) that attacks the Ala-Gly peptide bond is replaced by Asp of LexA in Synechocystis sp. PCC6803 [16,18]. It has been shown that SOS induction cannot be initiated by a non-cleavable LexA repressor [10,16]. Therefore, it is highly likely that LexA in Synechocystis sp. PCC6803 cannot undergo the auto-cleavage reaction in response to DNA damage, and it might have adopted a different function other than the canonical SOS response regulator seen in E. coli and B. subtilis [32]. This argument is consistent with the observation that Synechocystis sp. PCC6803 has a notably larger branch length in the 27 LexA protein tree ( Figure 1A), but this is not seen in the 16S rRNA gene tree (Figure 4).
Although the Synechocystis sp. PCC6803 genome harbors some core SOS response genes including lexA and recA (Table S63 in Additional file 6), none of them belongs to our predicted LexA regulon at a p-value < 0.01 (Table S26 in Additional file 4 and Table 3). The mutS (sll1772) gene is the only gene that is likely to be in involved in DNA mismatch repair, while bearing a putative LexA binding site in the genome. However, the orthologs of mutS is not under the regulation of LexA in E. coli [2,33] or within the putative LexA regulon of any other cyanobacteria (Table S1-26 in Additional file 4). These results suggest that at least most of SOS response genes are not under the regulation of LexA in Synechocystis sp. PCC6803. Indeed, using microarray gene expression profiling in response to lexA depletion, Domain et al. [21] concluded that LexA in Synechocystis sp. PCC6803 might be involved in carbon metabolism or controlled by carbon availability rather than the regulation of SOS response. However, our predicted LexA regulon in Synechocystis sp. PCC6803 (Table S26 in Additional file 4) has no intersection with the LexAresponsive genes identified by Domain et al. [21]. Since the LexA-binding sites that were experimentally characterized [17,20] in Synechocystis sp. PCC6803 are different from the sequences in our scanning profile, and considering the distinct nature of the LexA protein in Synechocystis sp. PCC6803 indicated above, it would be particular interesting to determine by experiment the function of the predicted sites in this genome.
Thus, although Synechocystis sp. PCC6803 clearly harbors the components of a basic SOS response system (Table S63 in Additional file 6), it is probably no longer under the regulation of LexA. Accordingly, LexA in this genome might have adopted a different function. Thus, the loss of the original function of LexA in Synechocystis sp. PCC6803 is coupled with the loss of the sequence constraint, thereby accelerating its divergence from other cyanobacterial LexA proteins, at both the sequence and functional levels. On the other hand, given the importance of the SOS response in cell survival, it is highly likely that the transcriptional regulator of the SOS response system in Synechocystissp. PCC6803 has been replaced by another protein.

Other cellular processes
Interestingly, we also found putative LexA-binding sites in the regulatory regions of genes that participate in various cellular processes in these 26 cyanobacterial genomes ( Table 3). The major cellular processes that are likely under the regulation of LexA are summarized below.

Photosynthesis
Putative LexA-binding sites were predicted for the following photosynthetic genes in the 26 cyanobacteria that harbor a lexA gene with p < 0.01 ( putative ABC transporter/multidrug efflux family protein SYNW2111 in Synechococcus sp WH8102; drug exporter-1 ABC transporter ATPase subunit AM1_4624 in Acaryochloris marina MBIC11017. These findings are interesting since it has been shown that the SOS response system is related to drug resistance in E. coli [34,35] and Staphylococcus aureus [35][36][37] by mechanisms that are not fully understood. It was reported that the vP2449 gene encoding a toxin exporter responsible for xenobiotic resistance in Vibrionales parahaemolyticus was under the direct control of LexA [38]. Therefore, it is likely that these drug resistance genes are regulated by LexA, thereby coupling the SOS response to drug resistance in these cyanobacteria.
The origin of the lexA gene in cyanobacteria As indicated earlier, 27 of the 32 cyanobacterial genomes analyzed evidently harbor a lexA ortholog, while the remaining five genomes do not, even when being scrutinized by more sensitive sequence search methods such as PSI-BLAST (data not shown). The five cyanobacteria lacking a LexA are Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a(2-13) B-Prime, Synechococcus elongatus PCC6301, Trichodesmium erythraeum IMS101 and Thermosynechococcus elongatus BP-1. However, the core SOS response genes remain in these five genomes (Table S63 in Additional file 6). In the tree of 183 detected LexA proteins in 598 sequenced genomes ( Figure S2 in Additional file 3), the 26 cyanobacterial LexA proteins that are detected by BDBH (see Methods) form a monophyletic group while LexA in Gloeobacter violaceus PCC7421 is clustered with the group of α-proteobacteria. Furthermore, the topology of the 16S rRNA gene tree ( Figure 4) and the LexA tree/ subtree of 27 cyanobacterial genomes ( Figure 1A and Figure S2 in Additional file 3) are quite similar. This result suggests that lexA in the 26 cyanobacterial genomes (excluding Gloeobacter violaceus PCC7421) is likely to be vertically inherited from the last common ancestor of cyanobacteria. However, Gloeobacter violaceus PCC7421 might have lost its LexA protein during evolution and obtained an ortholog later through horizontal transfer from an α-proteobacterium. The five genomes that lack a lexA gene do not form a monophyletic group in the 16S rRNA gene-based phylogenetic tree of these 32 cyanobacteria (Figure 4). In particular, Synechococcus elongatus PCC6301, and Trichodesmium erythraeum IMS101 are spread in a clade whose members except these two genomes harbor a lexA gene. The most parsimonious explanation of this distribution would be that these two genomes Synechococcus elongatus PCC6301 and Trichodesmium erythraeum IMS101 lost their lexA genes through two independent events (one for each genome) to adapt to their corresponding environments during the course of evolution. Furthermore, the remaining three genomes, Thermosynechococcus elongatus BP-1, Synechococcus sp. JA-3-3Ab A-prime and Synechococcus sp. JA-2-3B'a (2-13) B-prime, which do not possess a lexA gene, branch earlier from the others (Figure 4). A plausible explanation of this distribution would be that these genomes lost their lexA genes inherited from the last common ancestor of cyanobacteria during the course of evolution. Interestingly, all these three genomes are thermophilic, their extreme ecological niches might facilitate the loss of the lexA gene. Since the core SOS response genes remain in these five genomes (Table  S63 in Additional file 6) after lexA was lost, they might have been hijacked by another transcription factor given the importance of the regulation of the SOS response genes for cell survival. The genomes that lost their lexA gene appear to have lost LexA-binding sites ( Figure S3 in Additional file 2). Alternatively, these five cyanobacteria might still harbor a lexA gene that has largely diverged from the others' during evolution to such a level that our method could not detect them. In addition, it has been suggested that the lexA gene was derived from gram-positive bacteria, which then spread into cyanobacteria and fibrobacteres. Then α-proteobacteria acquired lexA from cyanobacteria [5,6,16]. Our phylogenetic analysis of the LexA proteins and their binding sites supports such an argument. As mentioned before, cyanobacterial LexA proteins are more closely-related to those in gram-positive bacteria and α-proteobacteria than those in the other groups ( Figure S2 in Additional file 3), and the predicted LexAbinding sites in cyanobacteria are clustered together with those in the gram-positive bacterium B. subtilis and in α-proteobacteria, but are far away from those in E. coli ( Figure S1 in Additional file 2).
Moreover, Erill et al. [27] have suggested that there is a common set of genes in the LexA regulon of proteobacteria and gram-positive bacteria: recA, uvrA, ssb, and ruvC. However, our predicted LexA regulons in cyanobacteria do not always include this set of genes. Thus, the concept of a common set of SOS response gene in its more general form warrants further scrutinization.

Conclusions
In this study we have predicted LexA-binding sites and analyzed the putative LexA regulons in 26 cyanobacterial genomes that harbor a lexA gene using a highly efficient motif scanning and regulon prediction algorithm. In most lexA-containing cyanobacterial genomes, some SOS response genes bear a putative LexA box. Some genes involved in various cellular processes such as photosynthesis, drug resistance, etc. are also predicted to bear a putative LexA box in their promoter regions. However, in Synechocystis sp. PCC6803, LexA might have adopted a new function and no longer be in charge of the SOS response genes. In some genomes, lexA was likely lost during the course of evolution accompanied by the loss of its binding sites. The SOS response genes in these genomes that appear to lack a lexA gene might be regulated by another or multiple transcription factors. Moreover, we conclude that cyanobacteria inherited the lexA gene from their last common ancestor; however, substantial genome-wide turnover seems to have led to the high degree of variation of the LexA regulons in some species during evolution.

Materials
The sequences and annotation files of 33 sequenced cyanobacterial and the other genomes were downloaded from NCBI at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. The cyanbacterial genomes used in this study include:

Prediction of transcription units
We predicted the operon structures in cyanobacterial genomes using the operon prediction algorithm developed by Dam et al. [39]. The algorithm is based on the integration of both genome-specific and comparative genomic information. In this work, both the multi-gene operon and singleton operon (containing one gene) are considered as a transcription unit (TU), and the upstream intergenic sequence of the first open reading frame is not considered as a part of the operon.

Prediction of orthologs
We used the bi-directional best hit (BDBH) method based on BLASTP searches with an E-value cut-off of 10 -10 for both directions to predict orthologous protein pairs between any two proteomes. The BDBH method assumes that a cross-species protein pair are orthologous if each protein returns the other as the best hit in the whole proteome comparison [40].

Phylogenetic analysis
To construct the phylogenetic tree of LexA in cyanobacteria, multiple sequence alignment of the LexA amino acid sequences from 27 cyanobacterial genomes and the E. coli K12 genome were made using ClustalW implemented in MEGA [41] with default settings. The phylogenetic tree was then constructed using the neighbor-joining method with Poisson correction model in MEGA. E. coli LexA was placed as the outgroup of the tree. To construct the species tree, the DNA sequences of 16S ribosomal RNA genes from the 32 cyanobacteria and E. coli were aligned using ClustalW with manual adjustment by removing the unalignable regions. A neighbor-joining tree was then constructed with E. coli K12 being the outgroup using the Kimura 2-parameter model. Statistical significance at each node in the trees was evaluated using 500 bootstrap resamplings.
To construct the LexA protein tree across cyanobacteria, gram-positive bacteria, α-proteobacteria, δ-proteobacteria and γ-proteobacteria and some other bacteria strains/species ( Figure S2 in Additional file 3), we first downloaded 598 sequenced microbial genome sequences from NCBI, and then identified LexA orthologs in them by the BDBH method described above. Multiple sequence alignments of these LexA sequences were made using ClustalW implemented in MEGA [41] with default settings. The phylogenetic tree was then constructed in the same way as the 27 LexA protein tree.
The phylogenetic tree ( Figure S1 in Additional file 2) of LexA-binding sites in cyanobacteria, B. subtilis, α-proteobacteria and E. coli K12 was generated by the STAMP web tool [42] with the default alignment parameters: Pearson correlation coefficient for column comparison metric; ungapped Smith-Waterman for pair-wise alignment. The phylogenetic tree was constructed using the UPGMA method implemented in STAMP [42].

Phylogenetic footprinting and construction of LexAbinding sites in cyanobacteria
The previous study by Mazon et al. [16] characterized the LexA boxes associated with two genes: alr4908 (lexA) and all3272 (recA). Four putative LexA boxes were also identified in the promoter regions of alr3716 (uvrA), alr0088 (ssb), alr4905, and all4790 in Nostoc sp PCC7120 in that study. The orthologs (if they exist) of these six genes in PCC7120 were identified in the other 25 cyanobacterial genomes which harbor a lexA gene. We pooled the entire upstream inter-TU regions of these six genes in the target genome Nostoc sp. PCC7120 as well as those of the TUs containing at least one of the orthologs of these six genes in other cyanobacteria. If the length of the inter-TU region is longer than 800 bases, then only the immediate upstream 800 bases were extracted. Two motif finding programs, MEME [25,43] and BioProspector [26], were then applied to these pooled inter-TU regions to identify palindromic 14-mers as putative LexA-binding sites in these sequences according to previous studies [16]. MEME applies an expectation maximization method to fit a two-component finite mixture model and returns the identified motifs with an E-value [43], while Bio-Prospector employs a Gibbs sampling strategy and estimates the significance of the identified motif by a Monte Carlo method [44]. These two programs were selected as they are widely used and often have complementary predictions [45,46]. MEME identified 45 putative LexA-binding sites with an overall E-value of 1.4e-026 for its most significant predicted motif, while Bio-Prospector detected 39 putative LexA boxes in its most significant predicted motif (see Additional file 1 for details). High-scoring putative LexA-binding sites from either program were selected to build the LexA-binding sites profile (Table 1) in cyanobacteria. Sequence logos of binding sites were created using the Weblogo server [47].

Genome wide prediction of LexA-binding sites
We used the profile constructed above to scan the inter-TU regions of the genomes to predict all putative LexA-binding sites using a scanning algorithm that we previously developed [29][30][31]. This algorithm is briefly described as follows.
For each predicted TU U(g 1 ,g 2 ,...,g n ) composed of genes g 1 ,g 2 ,...,g n in genome G, we extracted its upstream inter-TU regions and the first 40 bases of coding region (if its length is longer than 800 bases, then only the immediate upstream 800 bases were extracted), denoted as I U (g1,g2,...,gn) . The set of all the I U(g1,g2,...,gn) in this genome is denoted as I U . To find the best matching substring in a sequence t in I U (t I U ) when scanned by profile M, we use the following scoring function: (1) where l is the length of the binding sites of profile M, h any substring of sequence t with length l (i.e. each l-mer of the sequence t), h(i) the base at the i-th position of h, p(i,b) the frequency of base b at position i in M, q(b) is the frequency of base b in the aggregated inter-TU regions for the organism, I i is basically the information content or the relative entropy of the column [28,48] divided by a normalization factor a, a is the upper limit of the information content I i for this column to keep I i [0, 1], n the number of binding sites for constructing the profile M. To avoid zero value of the numerator p(i,b), a pseudo count 1 is added to the counts of the each base {A, C, G, T} in column i.
To show the derivation of the normalization factor a, we considered the extreme case: for a column i of profile M containing n binding sites, the more conserved the column is, the higher its information content I i will be, and the maximum information content for column i occurs when this column is completely homogeneous. That is, all sequences have the same nucleotide, say, A at that position, and this nucleotide has the smallest background frequency, q(A), noted as q 0 . Thus, after adding one pseudocount to the counts of each of the four nucleotides to column i, the frequency of base A of column i in the profile will therefore be (n+1)/(n+4), and 1/(n+4) for the other three nucleotides. Then the upper limit a of the prenormalized I i as shown by formula (3) can be derived as follows. Intuitively, we slide a window of length l across sequence t with the profile M, and return the substring h with the highest score defined by (1).
Since true regulatory binding sites are likely to be more conserved than other inter-TU sequences and thus tend to be shared by closely related orthologous genes. For each genome (considered as a target genome), we reward its putative binding sites appeared to be conserved in regions upstream from orthologous genes in other genomes. To do this, we assume a transcription unit U(g 1 ,g 2 ,...,g n ) in the target genome G is composed of n genes. Gene g i (i = 1...n) has orthologs in m i genomes G 1 , G 2 , ..., G m i , and o k (g i ) is the upstream inter-TU sequence associated with the orthologous gene g i in genome G k (for a graphic explanation, see Figure  S5 in Additional file 2). Then the s M (t) score for the inter-TU sequence t upstream from U(g 1 ,g 2 ,...,g n ) in genome G can be increased by a term A max (g i ): where A max (g i ) is the value calculated for gene g i whose orthologs across other genomes have the maximum average of the product of two terms: where d i,k is the Hamming distance between the sequence h detected by the profile M in sequence t and the corresponding sequence in, o k (g i ) and l is the length of the binding sites in profile M.
Since the orthologs of genes of a transcription unit in one organism may not comprise a single transcription unit in another. For U(g 1 ,g 2 ,...,g n ) in target genome G, orthologs of g 1 ,g 2 ,...,g n may be separated into different TUs in other genomes, therefore we evaluated the orthologous inter-TU sequences for each gene in (g 1 , g 2 ,...,g n ), and chose the gene g i whose orthologs across other genomes have the maximum average of the product of two terms indicated above. Then by combining formula (5) and (6), the refined score of the best putative binding site in sequence t can be defined as: Statistical significance of predicted binding sites To evaluate the extent to which a putative binding site with a score s or higher can be found purely by chance, we randomly extracted coding sequence with the same length as I U(g1,g2,...,gn) , denoted as C U(g1,g2,...,gn) . All the C U(g1,g2,...,gn) extracted in genome G form the set C U . The score of an extracted sequence t (t C U ) scanned by a profile M is also defined by formula (1). Note that each randomly chosen C U(g1,g2,...,gn) has nothing to do with U(g 1 ,g 2 ,...,g n ). Therefore, when incorporating the additional score from reference genomes (formula (7)), the coding sequence o k (g i ) is unlikely the coding sequence associated with the orthologous genes of g 1 , g 2 ,...,g n in a reference genome G k as it is a randomly chosen one. To avoid possible biased sampling of coding sequences for each I U(g1,g2,...,gn) in I U , we randomly extracted 300 coding sequences C U(g1,g2,...,gn) sharing the same length as I U(g1,g2,...,gn) . These randomly chosen coding regions for all the U(g 1 ,g 2 ,...,g n ) in genome G form a sequence set C U , then each sequence in the set C U was scanned using formula (7). Let S(I U ) and S(C U ) be the set of scores of binding sites found in I U and C U , respectively, and P(S(t) >s) be the cumulative probability of finding a binding site in a sequence t (t I U (or) t C U ) with a score S(t) > s as defined by equation (7). Next, the false positive rate, p S s C U ( ) > can be used to evaluate the statistical significance of the motif score s of a inter-TU sequence. p S s is actually the fraction of coding sequences bearing a putative binding site with a score higher than s in the coding sequences set C U in genome G. In other words, it describes the extent to which one can find a motif with a score higher than s by chance. Thus, it can be considered as an empirical p-value for a binding site score s. A cut-off score s corresponding to a p-value <0.01 is used for the LexAbinding site and regulon prediction in each genome in this study.
To evaluate the confidence of our overall predictions in inter-TU regions in one genome, we used a log odds ratio (LOR) to compare the probability of finding a putative binding site in an inter-TU region with the probability of finding a putative binding site in a randomly extracted coding region by considering all the extracted I U s and C U s in a genome. We estimated the statistical significance of the predictions using the LOR function defined as The LOR function for a genome is the log-odds ratio of the fraction of the inter-TU sequences containing a binding site with a score higher than s to the fraction of the randomly selected coding sequences containing a binding site with a score higher than the same s in the genome. Accordingly, a monotonic increase in positive LOR with the increase in the motif score in a genome would suggest that this genome is likely to contain some high-scoring LexA-binding sites.

Analysis of the conservation of LexA regulons in cyanobacteria
We defined the conservation (c ij ) between two regulons R i and R j from genome i and j, respectively, as, Where |R i ∩ R j | is the number of orthologous genes shared by both regulons R i and R j . We took the reciprocal of 1 c ij as the distance d ij between the two regulons. A neighbor joining tree (Figure 3) based on a distance matrix such defined was constructed using PHYLIP [49] and displayed by MEGA [41].