A flexible platform for comparative reconstruction of bacterial regulons
CGB implements a complete computational workflow for the comparative reconstruction of bacterial regulons using available knowledge of TF-binding specificity (Fig. 1). Execution starts with the read-in of a JSON-formatted input file. This file contains the NCBI protein accession number and list of aligned binding sites for at least one transcription factor instance, accession numbers for chromids [20] or contigs mapping to one or more target species, and several configuration parameters. Reference TF-instances are used to detect orthologs in each target genome and a phylogenetic tree of TF instances is generated. The tree is used to combine available TF-binding site information into a position-specific weight matrix (PSWM) for each target species. Operons are predicted in each target species and promoter regions are scanned to identify putative TF-binding sites and estimate their posterior probability of regulation. Groups of orthologous genes are predicted across target species and their aggregate regulation probability is estimated using ancestral state reconstruction methods. CGB outputs multiple CSV files reporting identified sites, ortholog groups, derived PSWMs and posterior probabilities of regulation, as well as plots depicting hierarchical heatmap and tree-based ancestral probabilities of regulation. The following sections describe the novel strategies used to implement the different components of this computational workflow in order to generate an efficient and highly customizable comparative genomics platform.
Gene-centered, species-specific regulon reconstruction
Previous approaches to regulon reconstruction have focused on the operon as the fundamental unit of regulation [10, 11]. This poses problems for both analysis and reporting due to the frequent reorganization of operons. It is well known, for instance, that after an operon split, genes in the original operon may be regulated by the same transcription factor through independent promoters [2, 3, 21]. CGB uses instead a gene-centered framework, wherein operons become logical units of regulation, but the comparative analysis and reporting of regulons is based on the gene as the fundamental unit of regulation. This enables a rapid assessment of the regulatory state of each gene, while providing the user with detailed information on the operon setup in each organism.
Experimental information on TF-binding specificity is often available in different reference bacterial species, yet the problem of how to transfer and combine this information to target species in a comparative genomics analysis remains largely unaddressed. TF-binding motif information can be transferred across species, but the efficacy of this process decays with evolutionary distance [19]. CGB takes in prior knowledge in the form of a list of TF instances (NCBI protein accessions) in different bacterial strains, together with reported (or inferred) TF-binding sites for each of these TF instances. The collections of TF-binding sites for each TF instance must be aligned, so that the resulting motifs have the same dimensions (i.e. compatible PSWMs). This alignment can be performed manually by the user, or using dedicated tools [22]. CGB automates the transfer of TF-binding motif information from multiple sources across target species. CGB estimates a phylogeny of the reference and target TF orthologs, and uses the inferred distances between reference and target species to generate a weighted mixture PSWM in each target species, following the weighting approach used in CLUSTALW [23]. This provides a principled and reproducible approach for the dissemination of TF-binding motif information, forgoing the need to manually adjust inferred collections of TF-binding sites in each target organism [10].
Promoter scoring and probabilistic framework
The frequency information in a PSWM can be transformed into a position-specific scoring matrix (PSSM) and used to identify TF-binding sites in genomic sequences. The use of a PSSM score cut-off for predicting putative TF-binding sites in promoter regions has long been the de facto standard in regulon analysis [8,9,10, 24, 25]. However, this approach is not well-suited for the comparative genomics framework, because thresholds may often need to be tuned in different bacterial genomes owing to their particular distribution of oligomers [6]. To circumvent this problem, here we adopt a Bayesian probabilistic framework originally developed for regulon analysis in metagenomic sequences [26]. This framework estimates posterior probabilities of regulation that are easily interpretable and directly comparable across species.
For each position i of a promoter region, we first combine the PSSM scores obtained in the forward (f) and reverse (r) strands using the function [26]:
$$ PSSM\left({s}_i\right)={\log}_2\left({2}^{PSSM\left({s}_i^f\right)}+{2}^{PSSM\left({s}_i^r\right)}\right) $$
(1)
To estimate the posterior probability of regulation of a promoter, we define two distributions of PSSM scores within a promoter region. In a promoter not regulated by the TF, we expect a background distribution of scores (B). We approximate this distribution using a normal distribution parametrized by the statistics of PSSM scores in promoters genome-wide:
$$ B\sim N\left({\mu}_G,{\sigma}_G^2\right) $$
(2)
In a promoter regulated by the TF, however, we expect that the distribution of PSSM scores (R) be a mixture of both the background distribution (B) and the distribution of scores in functional sites. The latter can be approximated with a normal distribution parametrized by the statistics of the TF-binding motif (M):
$$ R\sim \alpha N\left({\mu}_M,{\sigma}_M^2\right)+\left(1-\alpha \right)N\left({\mu}_G,{\sigma}_G^2\right) $$
(3)
The mixing parameter α is a prior that corresponds to the probability of a functional site being present in an average-length regulated promoter and can be easily estimated from experimental data. Bacterial transcription factors regulate most of their target genes by binding to a given number of sites in the promoter region, and the average length of the promoter region in a given organism can be readily approximated as the average intergenic distance between the first genes in opposing directons. For a transcription factor known to typically bind one site per regulated promoter, and an estimated average promoter length of 250 bp, we obtain α = 1/250 = 0.004. This results in a mixture distribution for the regulated promoter (R) drawing predominantly (99.6% of the time) from the background distribution of scores (B).
For any given promoter, we can define the posterior probability of regulation P(R|D) given the observed scores (D):
$$ P\left(R|D\right)=\frac{P\left(D|R\right)P(R)}{P(D)}=\frac{P\left(D|R\right)P(R)}{P\left(D|R\right)P(R)+P\left(D|B\right)P(B)} $$
(4)
Assuming independence among the scores at each promoter position, the likelihood functions can be estimated for a given score si using the density function of the R and B distributions defined above:
$$ P\left(D|R\right)=\prod \limits_{s_i\in D}L\left({s}_i|\alpha N\left({\mu}_M,{\sigma}_M^2\right)+\left(1-\alpha \right)N\left({\mu}_G,{\sigma}_G^2\right)\right) $$
(5)
$$ P\left(D|B\right)=\prod \limits_{s_i\in D}L\left({s}_i|N\left({\mu}_G,{\sigma}_G^2\right)\right) $$
(6)
The priors P(R) and P(B) can be inferred from the reference collections. P(R) can be approximated as the number of known regulated promoters in a reference genome, divided by the total number of operons, and P(B) is trivially 1-P(R). Alternatively, the number of regulated promoters can be estimated from the information content of the species-specific TF-binding motif [6, 27].
Operon prediction
Operon prediction remains a challenging problem in bacterial genomics [28]. Available comparative genomics platforms rely on curated operon databases to improve accuracy, but this limits their applicability to a preselected set of complete bacterial genomes [10, 11]. To enable analyses including newly sequenced, complete or incomplete bacterial genomes, CGB implements a two tiered operon prediction sequence. Intergenic distance is an effective and widely-used predictor of operons. Genes pairs in a same directon (adjacent in the same orientation with no intervening genes in the opposite strand) are considered to belong to an operon if their intergenic distance is below a pre-stablished threshold [29,30,31]. Because different genomes can have different coding densities, CGB defines this threshold in an adaptive manner as the average intergenic distance in all directons within a given genome. We benchmarked this approach using experimental operon data from the ODB database for six bacterial species (Fig. 2), revealing that dynamically adapting the threshold to each bacterial genome significantly enhances the prediction accuracy.
When reconstructing regulons through comparative genomics, errors in operon prediction can yield two different scenarios: an operon may be split and regulation for some of its genes hence not properly detected, or spurious genes may be incorporated into an operon and their regulation inaccurately predicted. To circumvent this problem CGB predicts operons using a conservative threshold to minimize undesired operon splits (Fig. 2). It then scans the upstream region of all genes to detect any genes within a predicted operon harboring a high-scoring TF-binding site. This is defined as a site scoring above a PSSM score threshold that satisfies the equality between the negative logarithm of the false positive rate (FPR) and the information content of the TF-binding motif [32]. In such cases, the operon is split on the gene with the identified high-scoring site. This allows recovering relevant regulation information for genes that may have inaccurately included within an operon in the initial prediction.
Ortholog detection and ancestral state reconstruction
Ortholog detection remains a challenging, computationally-intensive problem in bioinformatics [33]. Available comparative genomics platforms make use of precompiled ortholog sets [10, 11], but this restricts the range of bacterial species that can be analyzed. To provide the user with greater flexibility, CGB implements automated detection of orthologous groups in the species under analysis. Orthologous genes are detected as reciprocal best BLAST hits for each pair of species using tBLASTX with default parameters and a 10− 10 cutoff e-value. Pairwise reciprocal BLAST hits are used to generate a graph where vertices correspond to gene products and edges denote best reciprocal BLAST hit relationships, and orthologous gene groups are detected as cliques in the graph [34]. Crucially, regulon reconstruction through comparative genomics does not require that all orthologs groups across target species be identified. CGB limits ortholog detection to those genes present in operons with a posterior probability of regulation higher than a user-specified cut-off in any of the target species. This dramatically reduces the complexity of the ortholog detection step, enabling it to be performed in real time.
Integrating the regulon information inferred from each target genome is a critical step in comparative genomics in order to generate insights on the overall makeup of the regulon and its evolutionary history. A common rule of thumb in many comparative genomic analyses has been to assume that the detection of putative TF-binding sites in the promoter region of orthologous operons from two or more sufficiently divergent genomes represents strong evidence of regulation [7,8,9]. More recently, comparative regulon reconstruction has been formally recast as an ancestral state reconstruction problem, wherein one seeks to infer the likelihood of regulation for a given operon on a phylogenetic tree [18]. CGB implements this approach through bootstrapping ancestral state reconstruction for any given gene on a phylogenetic tree of TF instances. For each bootstrap replicate, CGB assigns discrete regulated (s1) or non-regulated (s0) states to each target species by sampling randomly according to the inferred posterior probability of regulation. If the species does not encode an ortholog for the given gene, the absent (sa) state is assigned. CGB then uses BayesTraits to infer the discrete regulation states on ancestral nodes for each bootstrap replicate [35]. These inferred ancestral states are averaged over all bootstrap replicates to obtain ancestral posterior probabilities of regulation.
Comparative analysis of the LexA regulatory network in gram-positive bacteria
The SOS response is a transcriptional regulatory network that responds to DNA damage and activates the expression of genes to address DNA lesions and their effects. The SOS response was first described in Escherichia coli, where it was shown to regulate over 40 genes involved in three primary functions: DNA repair, inhibition of cell division and translesion synthesis [36, 37]. DNA damage is sensed by the recombination protein RecA, which can promote the autocatalytic cleavage and inactivation of the transcription factor LexA, leading to de-repression of its target genes [36, 38]. Later research has shown that the SOS response is widespread in bacteria but, in contrast with other regulatory networks, multiple LexA-binding motifs have been reported in different bacterial phyla [39]. Both the binding motif and the regulatory network for LexA have been amply documented the Actinobacteria and the Firmicutes [40,41,42,43,44], providing an ideal test case for assessing the performance of CGB. We performed a comparative analysis of the LexA regulon across seven bacterial species: five for which the SOS response has been reported (Corynebacterium glutamicum ATCC 13032, Bacillus subtilis 168, Staphylococcus aureus NCTC 8325, Listeria monocytogenes EGD-e and Mycobacterium tuberculosis H37Rv) and two related species where the SOS response remains uncharacterized (Leifsonia xyli CTCB07, Acidothermus cellulolyticus 11B).
In agreement with previous reports [9], our analysis reveals that the core SOS regulon in Gram-positive bacteria encompasses the LexA and RecA proteins, as well as error-prone polymerases, a radical SAM protein and a cell-division inhibitor (Fig. 3). The plot also illustrates how CGB distributes the available experimental information on TF-binding motifs across all target species, generating phylogenetically weighted mixture motifs that smooth out motifs with low experimental support. We assessed the accuracy of CGB at determining the regulatory state of target genes using the predicted posterior probability of regulation in different reference species. Our results show that, among genes predicted to be regulated in at least one species, the posterior probability generates sharp distinction between regulated and non-regulated genes, yielding accuracy across a broad range of thresholds. Analysis of the prediction accuracy for individual genes reveals that there is a consistent positive correlation (0.33 ± SD 0.08; Spearman correlation coefficient) between true positives in individual species and the number of species in which the gene was predicted to be regulated, supporting the fundamental assumption that conservation of regulatory elements is indicative of functionality.
Analysis of type III secretion system regulation by HrpB/HrpX
Regulation of the bacterial type III secretion system (T3SS) has been described as mediated by the orthologous transcription factors HrpX (Xanthomonas oryzae) and HrpB (Ralstonia solanacearum) [45, 46]. In both systems, this TF has been shown to regulate the expression of genes governing the assembly of the T3SS apparatus and several of the effectors that these pathogenic bacteria translocate into host cells. Here we used experimental TF-binding motif from Xanthomonas and Ralstonia species available in the CollecTF database (Fig. 4a) to analyze the evolution of this regulatory network in several groups of pathogenic bacteria harboring HrpB/X orthologs. This is accomplished by first propagating the experimental motif among all target species (Fig. 4b), and then performing the comparative analysis.
The results (Fig. 4c) reveal that the only elements of the HrpB/X regulon conserved across all the species analyzed are the members of the two core operons that define the structural elements of the type III secretion system. Beyond this core HrpB/X regulon, the analysis (Fig. 4c) shows the divergent uptake of different effectors and chaperones in Ralstonia, Burkholderia and Xanthomonas species, indicating that the core T3SS effectome under HrpB/X regulation in each of these genera diverged soon after each obtained its T3SS module, providing specialized functions in each group [47]. This divergent uptake can be traced to specific lineages through ancestral state reconstruction (Fig. 4d), as in the case of the Ralstonia genus for the virulence regulator PrhI [48]. Our results showcase the ability of CGB to automatically disseminate experimental TF-binding motif information among target species in a principled manner (Fig. 4a&b), to use all available sequence files for any given genome (multiple chromosomes and plasmids), to highlight the core elements of a transcriptional regulatory network spanning multiple bacterial orders (Fig. 4c), and to provide a formal inference of the ancestral state of regulation for any given gene (Fig. 4d).
Reconstruction of the SOS response network in the Balneolaeota phylum
Metagenomic, single-cell and systematic large-scale genomic sequencing studies have uncovered the existence of several new major bacterial phyla [13,14,15,16]. Genomic data corresponding to these phyla is often only available as whole-genome shotgun assemblies and is hence not amenable to comparative genomics studies using available platforms that rely on precompiled complete genome datasets. The Balneolaeota phylum comprises several genera of halotolerant bacteria, but there is scant experimental information on their physiology [49]. Here we coupled motif discovery with comparative genomics using CGB to reconstruct the LexA regulon of the Balneolaeota. Motif discovery (Fig. 5a) was performed on the promoter regions of Balneolaeota LexA homologs with MEME, and the resulting motif was used as input for CGB analysis without phylogenetic weighting. After performing the comparative analysis of the putative LexA regulon across all species with available genome sequence in the Balneolaeota phylum (Fig. 5b), we validated in vitro predicted LexA-binding sites in the promoter region of all genes with orthologs in at least six of the seven species analyzed and presenting an average inter-species posterior probability of regulation above 0.5 (Fig. 5c). The resulting EMSAs on Balneola vulgaris and Rhodohalobacter halophilus promoters (Fig. 5c) reveal that all the predicted LexA-regulated promoters are bound by LexA. The high precision illustrates the usefulness of leveraging the comparative genomics approach implemented in CGB to boost the accuracy of in silico prediction of TF-regulated genes on individual genomes.
Our results (Fig. 5b) reveal that the Balneolaeota LexA protein binds a novel direct repeat motif with consensus sequence TTACACATATTTTWTACATA (Fig. 5a; Additional File 1). In spite of substantial operon rearrangement, the Balneolaeota LexA regulon encodes a SOS response network encompassing the LexA repressor and the inducer of the system (RecA), as well as several translesion synthesis polymerases (polymerases IV and V), a helicase, a radical SAM protein and a predicted SOS peptidase [50] (Fig. 5c). These results represent the first description of the LexA regulon in the Balneolaeota phylum, reinforcing the notion that translesion synthesis is the primary role of the SOS response and that this system has convergently evolved to regulate error-prone polymerases following drastic changes to the LexA-binding motif [51]. They also illustrate the complex evolutionary history of the LexA protein, which appears to have independently evolved in several instances the ability to recognize a direct repeat motif structure, as opposed to the canonical palindromic LexA-binding motif [51,52,53]. This example illustrates the ability of CGB to leverage draft genome sequence data to infer regulons in novel phyla, and to operate in tandem with comparative motif discovery methods, using a single instance of the inferred TF-binding motif uniformly distributed across target species.