Predicted transcription factor binding sites as predictors of operons in Escherichia coli and Streptomyces coelicolor
© Laing et al. 2008
Received: 09 October 2007
Accepted: 12 February 2008
Published: 12 February 2008
Skip to main content
© Laing et al. 2008
Received: 09 October 2007
Accepted: 12 February 2008
Published: 12 February 2008
As a polycistronic transcriptional unit of one or more adjacent genes, operons play a key role in regulation and function in prokaryotic biology, and a better understanding of how they are constituted and controlled is needed. Recent efforts have attempted to predict operonic status in sequenced genomes using a variety of techniques and data sources. To date, non-homology based operon prediction strategies have mainly used predicted promoters and terminators present at the extremities of transcriptional unit as predictors, with reasonable success. However, transcription factor binding sites (TFBSs), typically found upstream of the first gene in an operon, have not yet been evaluated.
Here we apply a method originally developed for the prediction of TFBSs in Escherichia coli that minimises the need for prior knowledge and tests its ability to predict operons in E. coli and the 'more complex', pharmaceutically important, Streptomyces coelicolor. We demonstrate that through building genome specific TFBS position-specific-weight-matrices (PSWMs) it is possible to predict operons in E. coli and S. coelicolor with 83% and 93% accuracy respectively, using only TFBS as delimiters of operons. Additionally, the 'palindromicity' of TFBS footprint data of E. coli is characterised.
TFBS are proposed as novel independent features for use in prokaryotic operon prediction (whether alone or as part of a set of features) given their efficacy as operon predictors in E. coli and S. coelicolor. We also show that TFBS footprint data in E. coli generally contains inverted repeats with significantly (p < 0.05) greater palindromicity than random sequences. Consequently, the palindromicity of putative TFBSs predicted can also enhance operon predictions.
Genes within an operon are, in the majority of cases, transcribed from a single promoter . Therefore, any integrated approach to the understanding of gene expression in prokaryotic systems must consider operons as one of the basic units, which are regulated by promoters, transcription factors and associated proteins. Although over 500 bacterial genomes have been sequenced and microarray-based expression studies are increasingly common, defining and mapping operon structure is time-consuming and complex, and the ability to predict them ab initio is therefore important. Since operon structure is frequently unstable across species boundaries [2–8], most approaches have used non-homology based data to predict operon structure. Besides predicted promoters and terminators, intergenic distance and expression data are also often used in non-homology based operon prediction strategies with varying sensitivity [9–17]. However, since there is generally limited data on known promoter locations and/or unreliable prediction methods, the identification of promoters is often difficult and makes their use problematic for operon prediction. In an attempt to avoid the problems associated with prediction of promoters it is worthwhile investigating the usefulness of other regulatory binding sites as potential predictors of operons, something not previously attempted.
Aside from promoters and terminators, transcription factor binding plays a key role in controlling the expression level of a gene or group of genes. Therefore, to coordinate the expression of genes within an operon a transcription factor binding site (TFBS), just like promoters, should be found predominantly upstream of the first gene in an operon; although there are a small number of reported cases of secondary binding sites within operons e.g. lac operon  and in our previous study , the traditional, 'static' view of operons (a group or genes transcribed together in all conditions) is applied. Prokaryotic transcription factors are usually deemed as either activators or repressors (with some having a dual role) depending on the effect they have on the rate of transcription initiation of the downstream gene(s). The dimeric nature of repressor and activator binding in prokaryotes requires that a recognition site must be a dyad, and typically an inverted repeat [20, 21]. This 'structure' of a binding sequence coupled with their typically short length (6–15 bp) can be used to predict likely TFBSs themselves [21, 22] and therefore if this can be done with confidence, can be used in the prediction of operonic gene pairs.
TFBSs are often described as consensus sequences or regular expressions that represent the most common sequence in a set of closely related binding site examples. Though simple in their construction, such consensus sequences are unable to imitate the variability that exists within true TFBSs in nature, where some sites may tolerate a range of bases and retain function. When using simple regular expressions, TFBS prediction is therefore dependent on the amount of mismatches (if any) that are tolerated. Defining the best consensus sequence for predicting sites is also difficult . Position Specific Weight Matrices (PSWMs), first used for the characterisation of translation initiation sites  are an alternative to consensus sequences. The methods used to produce PSWMs differ mainly in the type of information used to collect a set of binding site examples. For example, the identification of regulatory motifs which can be used to predict regulons, commonly adopted after microarray data collection, uses a set of co-regulated genes as a prerequisite [c.f. [25, 26]]. Similarly, prior knowledge can be used where experimental characterisation of a common regulatory binding site leads to the searching of upstream regions of the genes for a similar motif . Knowledge of protein structure has also been applied to regulatory protein binding site identification through binding energy calculations between nucleotide and amino acid (e.g. [28, 29]). In this study however, we wished to develop a method applicable to Streptomyces coelicolor genomics and due to the relative paucity of comprehensive information on Streptomyces gene regulation an a priori method that is capable of producing PSWMs is desirable. It will also support an operon prediction method that integrates data from various sources (e.g. intergenic distance  and/or expression data ) by increasing the amount of information about an operon whilst still being applied generically to all sequenced prokaryotic species without a dependency on the availability of functional or structural information.
One such method was published by the Siggia group  which requires no prior knowledge, can in principle be applied to any genome, and can generate PSWMs of dyads. The algorithm has successfully been applied in E. coli , B. subtilis  and additionally to S. coelicolor . Here the published method  was implemented and applied in Escherichia coli and Streptomyces coelicolor.
By constructing PSWMs from over-represented words in a given data set it is possible to test the upstream regions of genes for putative TFBSs. Based on the 'classical' definition of an operon  intra-operonic genes (excluding the first gene of an operon) should not have a TFBS in their upstream region. Although this is increasingly viewed as overly simplistic as an operon definition [11, 19, 32, 33] it appears to hold for the large majority of operons in most conditions. Subsequently, adjacent gene pairs that have a TFBS in between them are not expected to be part of the same operon. Using positive and negative examples of genes which are considered to be operonic members, the use of TFBSs as predictors of operons can be tested.
Prior to this, the first part of this work characterises the palindromic tendency of footprints of known transcription factors in E. coli, a property which is later shown to improve operon prediction in S. coelicolor. The second part of this work concentrates on developing a strategy of implementing the Siggia group's algorithm  through the use of different thresholds and examines the differences in results obtained. Comparisons with the previous TFBS set predicted in S. coelicolor  are also made. Finally the usefulness of TFBSs as predictors of operons in both E. coli and S. coelicolor is discussed.
The algorithm published by Li and co-workers  is capable of taking any set of upstream sequences (referred to herein as UPSQs) and constructing PSWMs from significantly over-represented dimers within this set. This method can be summarised by five steps:
and n(W 1) and n(W 2) are the total number of occurrences of W 1 and W 2 in the UPSQs data set and is the number of independent positions in the data where a motif M of length L(M) can be placed (M can be W 1, W 2 or D). The summation is over the regulatory regions (r) over all genes, each with a length L(r) (as taken from ).
2) A similarity matrix is constructed via an all-vs-all comparison of significant dimer pairs scoring above a designated threshold. Similarity between pairs of dimers is calculated by a simple scoring method, evaluating the number of matches minus the number of mismatches. Matches to N (an unknown base) or to any other base or overhangs are ignored (a heuristic method of sliding the sequences along to get the best score is applied). This score is then normalised to between 0 and 1.
3) The similarity matrix is used to cluster the significant dimers using CAST (Clustering Affinity Search Technique ) where a similarity threshold for clustering is set.
4) For every cluster, the actual sequences in the UPSQs set matched by any member of the cluster are extracted along with up to 10 flanking nucleotides (dependent on the position of the match and length of the UPSQ sequence). A multiple alignment of these sequences using CONSENSUS  produces an alignment matrix for that cluster.
5) Each alignment matrix is converted to a PSWM using the background base counts for the organism of interest. Here the background frequencies of A = 0.2953, C = 0.2059, G = 0.2023, T = 0.2965 were used for E. coli and A = 0.1589, C = 0.3494, G = 0.3420, T = 0.1497 for S. coelicolor.
A full description of the algorithm is given in the original publication .
In this work the thresholds used in steps 2 (dimer significance) and 3 (clustering threshold) are of the most interest. It is these thresholds that can affect the integrity of the PSWMs, which in turn could affect operon prediction accuracy. The different thresholds of dimer significance used were: -log10 P > 6, approximating to -log10(1/amount of all dimers tested for), -log10 P > 3, approximating to -log10(1/amount of all perfectly palindromic (i.e. palindromic without any mismatches) dimers tested for), and two intermediate thresholds of -log10 P > 4 and -log10 P > 5. Different clustering thresholds of 0.6, 0.7 and 0.8 were also tested.
With each PSWM there is an associated distribution of scores obtained when applied to a given data set of sequences. Two distributions of scores are obtained when scoring sequences from either the UPSQs set that were used to build the PSWM (collected in step 4 of the algorithm), or sequences from the UPSQs set that were not used to build the PSWM (sequences that will have a 'score' but should not have a significant match to the PSWM (as they were not used to build the PSWM) and thus provide a means of estimating noise in the PSWM prediction). These two Gaussian distributions, describe the 'real hits' and the background random matches respectively.
Using the mean and standard deviation of the two distributions the specificity and sensitivity of searching for a potential TFBS can be controlled via two thresholds: threshold1 = μ background + xσ background threshold2 = μ real - xσ real
where x represents any real number and affects the false-positive and false-negative rates when using threshold1 and threshold2 respectively.
In this study for a sequence to be predicted to contain a given TFBS the score of the sequence against the PSWM must be greater than μreal-2σreal (giving 98% coverage) and μbackground+Sσbackground, where S is used to represent trials of different thresholds from 0 to 10 in 0.5 increments as discussed further in this work. Since this term is the only part of the threshold that changes, the TFBS prediction threshold used will only be referred to herein as μbackground+Sσbackground.
The sequence data used to form the UPSQs and operon prediction test sets were taken from E. coli K12 strain MG1655 from Genbank (accession number: NC_000913[36, 37]) and S. coelicolor chromosome (not plasmids) in EMBL (accession number: AL645882 version 2 [38, 39]). In E. coli the UPSQs set was created using the upstream intergenic sequences (maximum length 300 bp) of the first genes of known and predicted operons in the E. coli sequence taken from RegulonDB [40, 41] as described previously . For S. coelicolor upstream intergenic sequences (maximum length 300 bp) of all genes in the chromosome were used. Testing of the constructed PSWMs was conducted using the upstream intergenic sequences (maximum length 300 bp) of the positive and negative examples of operon members of the relevant organism.
Assuming that the entire polycistronic transcript is documented, non-operons of length 2 were formed by using the initial gene of the operon and its upstream neighbouring gene if that neighbour is transcribed in the same direction (e.g. gene pair d-e), and the last gene in the documented operon and its downstream neighbouring gene, again if it is transcribed in the same direction (e.g. gene pair g-h). These negative examples are termed as transcriptional boundaries. To increase the size of the non-operonic data set a non-operon of length 3 was also formed by collecting triplets of genes that are transcribed in the opposite direction (e.g. genes b-c-d), termed as directional examples, as the central gene (e.g. gene 'c' in Figure 2) is certainly not in the same transcriptional unit as the two flanking genes transcribed in the opposite direction (e.g. genes b and d in Figure 2). In total 35 operons and 1282 non-operons were collected for S. coelicolor, and 325 operons and 821 non-operons were collected for E. coli. Both positive and negative examples for S. coelicolor and E. coli are provided [see Additional file 1].
All examples of operonic genes (positive and negative) were split into singular genes such that the operonic connection status (in the same operon or not) of all genes with their respective upstream gene neighbours could be tested.
Coverage of inverted repeats in E. coli footprint data
Random average coverage %
Footprint coverage %
When considering all transcription factor footprints together, as depicted in Figure 3 and Table 1, the overall coverage obtained using palindromes and mismatches to characterise binding sites can be calculated. Coverage is defined as the percentage of all footprint sites identified as containing a palindrome. Here, 77% coverage can be achieved by searching for perfect palindromes and 92–93% coverage when allowing for 1 mismatch. However, it is important to note that tolerating mismatches decreases the specificity of binding site prediction. Of course, the occurrence of a palindromic inverted repeat within an upstream intergenic region does not necessarily mean that a sequence is a true binding site either, but it is expected to increase the likelihood. Indeed, in support of this, when comparing the coverage obtained (using Z-score) searching for perfect palindromes in true footprint sites to that obtained from 500 random footprint simulations a significant difference (p < 0.05) is observed (see Table 1). Here, random bases were chosen to create a sequence the same length of true, known footprint examples taken from those used to create Figure 3. No significant difference was observed when allowing 1 or 2 mismatches for Repressors and Dual regulators.
Implementation of the PSWM construction algorithm using E. coli sequences resulted in 4,102 significant dimers using a Poisson probability threshold of -log10 P > 6. Although this is a similar threshold to previous work , these authors identified only 1,775 significant dimers. Despite rigorous checks with the implementation of this algorithm as published, and variation of parameters and similar experiments to reproduce the published data, this was not achieved and the implementation here consistently defined a larger number of putative sites. However, our implementation did discover all the dimers predicted by the earlier study , and this set was also used in our evaluation protocol.
Different thresholds of dimer significance (as described in the methods section) were then applied and sets of significant dimers were clustered using CAST  (step 3 of the algorithm) producing 12 data sets in total. In addition to these 12 sets of clustered sequences, the 849 clusters published previously , referred to as "LiFlank", were put through steps 4 and 5 of the PSWM construction algorithm. The resultant PSWMs for all data sets and the original PSWM data , referred to as "Li", were then used to predict binding sites in the upstream intergenic sequences of the positive and negative examples of operon members in E. coli with 0.5 increments of the TFBS prediction threshold μbackground+Sσbackground (see methods). For each increment of S, counts of TFBS predictions for the positive and negative tests sets were made enabling the calculation of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN).
Throughout this work we introduce a simple nomenclature to distinguish the PSWM sets, x_y_z, which refers to the Poisson distribution dimer significance threshold x, the word clustering threshold y, and the final number of clusters z.
Area Under the Curve scores for PSWM data sets
The second best operon predictor set (3_08_1506 with 79% accuracy) is of one of the novel PSWM sets constructed in this study (using all steps of the algorithm). The dimer significance of -log10 P > 3 and clustering threshold of 0.8 was therefore used to construct PSWMs for S. coelicolor resulting in 22,359 significant dimers producing 3,628 clusters/PSWMs (referred to herein as S3_08_3628). Previous work applying the same approach on S. coelicolor  reported 2,497 putative TFBSs. However, using dyad word lengths of 3–5 nt (used here) instead of just 4 nt results in additional matrices, as well as all of the matrices found previously .
From Figures 6 and 7 it is apparent, particularly in S. coelicolor (Figure 6), that the higher a PSWM's Z-score (and thus higher specificity), the lower the PSWM operon prediction accuracy. Somewhat counter intuitively, this suggests that the more successful PSWMs in terms of operon prediction accuracy predict many TFBSs (due to a low Z-score indicating lack of specificity in the prediction of an actual TFBS) such that very few operonic predictions are made. Further evidence for this can be seen in Figures 6 and 7 which show accuracy is correlated to the amount of hits, whereby many hits equates to higher accuracy. This demonstrates that high operon prediction accuracy is obtained by maximising sensitivity (coverage) of the TFBS prediction, albeit at the expense of TFBS prediction specificity, and without over-predicting TFBSs either.
Figures 6 and 7 also consider nucleotide biases. In E. coli the AT-richness of a PSWM does not appear to be related to accuracy due to the wide distribution of scores (Figure 7). In S. coelicolor however the matrices that are more AT rich tend to have a higher operon prediction accuracy, seen as a right shift in Figure 6. AT tracts in the transcription initiation region of genes are not uncommon  but when the PSWM operon prediction is restricted to using PSWMs of the set S3_08_3628_0 that have an AT-richness greater than 40%, no effect on operon prediction accuracy was found (data not shown). Finally there does not seem to be any relatedness between a PSWM's accuracy and its bias towards non coding regions in both E. coli and S. coelicolor (Figures 6 and 7 respectively).
To date non-homology based operon prediction methods that search upstream regions for motifs have only included promoters and/or terminators [9–12, 17]. Data is presented here that uses a different upstream motif, transcription factor binding sites, to predict operons in both S. coelicolor and E. coli. Avoiding the need for experimentally validated examples of promoters/terminators and therefore dependency on the amount of prior knowledge available [9, 10, 12, 17] this novel approach is able to predict operons in E. coli and S. coelicolor with 83% and 93% accuracy respectively. This strongly suggests that predicted TFBSs are a highly suitable feature for inclusion in integrated operon prediction strategies, not only in well characterised systems but also in other poorly-annotated but sequenced prokaryotes, where good quality TFBS predictions can be obtained ab initio with the protocol described here. It should be noted here that our definition of non-operonic genes is expanded by considering triplets with genes transcribed in opposite directions and that the overall prediction accuracy falls to around 64% when only the transcriptional boundary set are used. However, this drop off also occurs using intergenic distance as the sole predictor, reducing to 62% accuracy with the same data set (and same threshold as that optimised for the negative set including directional examples). Thus, our claim is not that predicted TFBS alone are superior operon predictors, rather that they should be used as independent features that offer a complementary approach.
Indeed, TFBS may offer other advantages. Thus far intergenic distance has proven to be the most accurate single feature for operon prediction in E. coli and B. subtilis [9, 10, 12–14, 49]. Intergenic distance is implicitly linked to our method since TFBS identification/prediction is focused on intergenic non-coding regions only. Comparing the operon prediction accuracy in S. coelicolor of the TFBS and intergenic distance approach (using the same log-likelihood method documented in ) both features achieve similar accuracy (93–95%) when applied to the same positive and negative (both directional and transcriptional boundary) operon example data set. The operon predictions themselves however are not perfectly correlated between the two approaches, although there is a strong correlation - around 0.7 depending on thresholds used for TFBS and intergenic distance predictions. Hence, the two features are complementary, and their combination is expected to produce further improvement in operon prediction. Finally, although both methods perform equally well, TFBSs as predictors of operons are advantageous since additional regulatory information about a particular putative operon may be inferred.
This protocol derives genome-specific PSWMs (Position Specific Weight Matrices) representing significant dimers, which can subsequently be exploited for operon prediction. The majority (77% - 93%) of experimentally validated E. coli transcription factor binding footprints can be characterised as palindromic (either with or without mismatches allowed) and consequently the PSWM set was filtered prior to operon prediction. Restricting the PSWM data set to 'perfect' palindromes (no mismatches allowed) increased operon prediction accuracy in S. coelicolor but not E. coli and therefore, and supports the conclusion that genome-specific models for operon prediction should be employed [19, 50]. The improved performance in S. coelicolor compared to E. coli may be a result of the PSWM construction strategy. The complete upstream intergenic regions were used in S. coelicolor whilst a restricted set based on annotated TFBSs was used in E. coli to find over-represented dimers. This would be expected to lead to increased coverage in S. coelicolor, and perhaps increased specificity in E. coli. This is borne out by the results, where high TFBS prediction coverage leads to superior operon prediction performance.
The false negatives (FNs) in this operon prediction strategy are intra-operonic genes (genes known to be transcribed together with its respective upstream gene) that have a predicted binding site (given the set of predicted PSWMs and applied threshold) in their upstream non-coding regions. The existence of these is not wholly surprising given our previous results  which demonstrated that intra-operonic genes with increased expression relative to the first gene of the same operon are more likely to have a predicted TFBS in their upstream non-coding regions. This intra-operonic promotion of gene expression has been noted by other groups [11, 32, 33]. Using a μbackground+4σbackground as a PSWM threshold we found that 23% of the S. coelicolor and 43% of the E. coli false negatives (35 and 227 FNs respectively) had higher expression than the first gene of the same operon. Therefore, when microarray expression data is available, it is possible to combine this with high quality TFBS predictions to improve the overall accuracy of operon prediction and gain information on the dynamic nature of some operons (specific, isolated control within operons for specific responses). This would not be possible with the application of intergenic distance alone to predict operons, and offers a way forward to address the true complexity of operon regulation.
This work has demonstrated the suitability of predicted TFBSs as operon predictors independently. We are currently working on a genome specific operon predictor for both S. coelicolor and E. coli that is able to combine TFBSs and other features found to be useful in operon predictions (Laing et al, Manuscript in preparation). Integrated approaches to operon prediction in prokaryotes are clearly needed, since expression of genes within operons is affected by several factors, not least the presence of suitable transcription factor binding sites, and array-based data on its own is insufficient to understand this phenomenon. Indeed, a recently developed operon prediction method for S. coelicolor  that combines intergenic distance, expression data and terminators, whilst able to show comparable operon prediction accuracy to that presented here, is unable to predict 'dynamic' operons, defined as the isolated expression of a gene(s) from their 'common' (under 'normal' conditions) transcriptional unit to which they reside. The findings of this work suggests that TFBS are a useful feature for predicting operons of all prokaryotes in their own right but also have the capacity to allow the modelling of dynamic operon structures of an entire genome in a particular, specific condition (e.g. response to a stress) when combined with expression data .
EL would like to thank the MRC for funding in the form of a studentship and the EC FP6 ActinoGEN programme.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.