Predicted transcription factor binding sites as predictors of operons in Escherichia coli and Streptomyces coelicolor
© Laing et al; licensee BioMed Central Ltd. 2008
Received: 09 October 2007
Accepted: 12 February 2008
Published: 12 February 2008
As a polycistronic transcriptional unit of one or more adjacent genes, operons play a key role in regulation and function in prokaryotic biology, and a better understanding of how they are constituted and controlled is needed. Recent efforts have attempted to predict operonic status in sequenced genomes using a variety of techniques and data sources. To date, non-homology based operon prediction strategies have mainly used predicted promoters and terminators present at the extremities of transcriptional unit as predictors, with reasonable success. However, transcription factor binding sites (TFBSs), typically found upstream of the first gene in an operon, have not yet been evaluated.
Here we apply a method originally developed for the prediction of TFBSs in Escherichia coli that minimises the need for prior knowledge and tests its ability to predict operons in E. coli and the 'more complex', pharmaceutically important, Streptomyces coelicolor. We demonstrate that through building genome specific TFBS position-specific-weight-matrices (PSWMs) it is possible to predict operons in E. coli and S. coelicolor with 83% and 93% accuracy respectively, using only TFBS as delimiters of operons. Additionally, the 'palindromicity' of TFBS footprint data of E. coli is characterised.
TFBS are proposed as novel independent features for use in prokaryotic operon prediction (whether alone or as part of a set of features) given their efficacy as operon predictors in E. coli and S. coelicolor. We also show that TFBS footprint data in E. coli generally contains inverted repeats with significantly (p < 0.05) greater palindromicity than random sequences. Consequently, the palindromicity of putative TFBSs predicted can also enhance operon predictions.
Genes within an operon are, in the majority of cases, transcribed from a single promoter . Therefore, any integrated approach to the understanding of gene expression in prokaryotic systems must consider operons as one of the basic units, which are regulated by promoters, transcription factors and associated proteins. Although over 500 bacterial genomes have been sequenced and microarray-based expression studies are increasingly common, defining and mapping operon structure is time-consuming and complex, and the ability to predict them ab initio is therefore important. Since operon structure is frequently unstable across species boundaries [2–8], most approaches have used non-homology based data to predict operon structure. Besides predicted promoters and terminators, intergenic distance and expression data are also often used in non-homology based operon prediction strategies with varying sensitivity [9–17]. However, since there is generally limited data on known promoter locations and/or unreliable prediction methods, the identification of promoters is often difficult and makes their use problematic for operon prediction. In an attempt to avoid the problems associated with prediction of promoters it is worthwhile investigating the usefulness of other regulatory binding sites as potential predictors of operons, something not previously attempted.
Aside from promoters and terminators, transcription factor binding plays a key role in controlling the expression level of a gene or group of genes. Therefore, to coordinate the expression of genes within an operon a transcription factor binding site (TFBS), just like promoters, should be found predominantly upstream of the first gene in an operon; although there are a small number of reported cases of secondary binding sites within operons e.g. lac operon  and in our previous study , the traditional, 'static' view of operons (a group or genes transcribed together in all conditions) is applied. Prokaryotic transcription factors are usually deemed as either activators or repressors (with some having a dual role) depending on the effect they have on the rate of transcription initiation of the downstream gene(s). The dimeric nature of repressor and activator binding in prokaryotes requires that a recognition site must be a dyad, and typically an inverted repeat [20, 21]. This 'structure' of a binding sequence coupled with their typically short length (6–15 bp) can be used to predict likely TFBSs themselves [21, 22] and therefore if this can be done with confidence, can be used in the prediction of operonic gene pairs.
TFBSs are often described as consensus sequences or regular expressions that represent the most common sequence in a set of closely related binding site examples. Though simple in their construction, such consensus sequences are unable to imitate the variability that exists within true TFBSs in nature, where some sites may tolerate a range of bases and retain function. When using simple regular expressions, TFBS prediction is therefore dependent on the amount of mismatches (if any) that are tolerated. Defining the best consensus sequence for predicting sites is also difficult . Position Specific Weight Matrices (PSWMs), first used for the characterisation of translation initiation sites  are an alternative to consensus sequences. The methods used to produce PSWMs differ mainly in the type of information used to collect a set of binding site examples. For example, the identification of regulatory motifs which can be used to predict regulons, commonly adopted after microarray data collection, uses a set of co-regulated genes as a prerequisite [c.f. [25, 26]]. Similarly, prior knowledge can be used where experimental characterisation of a common regulatory binding site leads to the searching of upstream regions of the genes for a similar motif . Knowledge of protein structure has also been applied to regulatory protein binding site identification through binding energy calculations between nucleotide and amino acid (e.g. [28, 29]). In this study however, we wished to develop a method applicable to Streptomyces coelicolor genomics and due to the relative paucity of comprehensive information on Streptomyces gene regulation an a priori method that is capable of producing PSWMs is desirable. It will also support an operon prediction method that integrates data from various sources (e.g. intergenic distance  and/or expression data ) by increasing the amount of information about an operon whilst still being applied generically to all sequenced prokaryotic species without a dependency on the availability of functional or structural information.
One such method was published by the Siggia group  which requires no prior knowledge, can in principle be applied to any genome, and can generate PSWMs of dyads. The algorithm has successfully been applied in E. coli , B. subtilis  and additionally to S. coelicolor . Here the published method  was implemented and applied in Escherichia coli and Streptomyces coelicolor.
By constructing PSWMs from over-represented words in a given data set it is possible to test the upstream regions of genes for putative TFBSs. Based on the 'classical' definition of an operon  intra-operonic genes (excluding the first gene of an operon) should not have a TFBS in their upstream region. Although this is increasingly viewed as overly simplistic as an operon definition [11, 19, 32, 33] it appears to hold for the large majority of operons in most conditions. Subsequently, adjacent gene pairs that have a TFBS in between them are not expected to be part of the same operon. Using positive and negative examples of genes which are considered to be operonic members, the use of TFBSs as predictors of operons can be tested.
Prior to this, the first part of this work characterises the palindromic tendency of footprints of known transcription factors in E. coli, a property which is later shown to improve operon prediction in S. coelicolor. The second part of this work concentrates on developing a strategy of implementing the Siggia group's algorithm  through the use of different thresholds and examines the differences in results obtained. Comparisons with the previous TFBS set predicted in S. coelicolor  are also made. Finally the usefulness of TFBSs as predictors of operons in both E. coli and S. coelicolor is discussed.
PSWM construction algorithm
The algorithm published by Li and co-workers  is capable of taking any set of upstream sequences (referred to herein as UPSQs) and constructing PSWMs from significantly over-represented dimers within this set. This method can be summarised by five steps:
and n(W1) and n(W2) are the total number of occurrences of W1 and W2 in the UPSQs data set and is the number of independent positions in the data where a motif M of length L(M) can be placed (M can be W1, W2 or D). The summation is over the regulatory regions (r) over all genes, each with a length L(r) (as taken from ).
2) A similarity matrix is constructed via an all-vs-all comparison of significant dimer pairs scoring above a designated threshold. Similarity between pairs of dimers is calculated by a simple scoring method, evaluating the number of matches minus the number of mismatches. Matches to N (an unknown base) or to any other base or overhangs are ignored (a heuristic method of sliding the sequences along to get the best score is applied). This score is then normalised to between 0 and 1.
3) The similarity matrix is used to cluster the significant dimers using CAST (Clustering Affinity Search Technique ) where a similarity threshold for clustering is set.
4) For every cluster, the actual sequences in the UPSQs set matched by any member of the cluster are extracted along with up to 10 flanking nucleotides (dependent on the position of the match and length of the UPSQ sequence). A multiple alignment of these sequences using CONSENSUS  produces an alignment matrix for that cluster.
5) Each alignment matrix is converted to a PSWM using the background base counts for the organism of interest. Here the background frequencies of A = 0.2953, C = 0.2059, G = 0.2023, T = 0.2965 were used for E. coli and A = 0.1589, C = 0.3494, G = 0.3420, T = 0.1497 for S. coelicolor.
A full description of the algorithm is given in the original publication .
In this work the thresholds used in steps 2 (dimer significance) and 3 (clustering threshold) are of the most interest. It is these thresholds that can affect the integrity of the PSWMs, which in turn could affect operon prediction accuracy. The different thresholds of dimer significance used were: -log10 P > 6, approximating to -log10(1/amount of all dimers tested for), -log10 P > 3, approximating to -log10(1/amount of all perfectly palindromic (i.e. palindromic without any mismatches) dimers tested for), and two intermediate thresholds of -log10 P > 4 and -log10 P > 5. Different clustering thresholds of 0.6, 0.7 and 0.8 were also tested.
Prediction of TFBSs
With each PSWM there is an associated distribution of scores obtained when applied to a given data set of sequences. Two distributions of scores are obtained when scoring sequences from either the UPSQs set that were used to build the PSWM (collected in step 4 of the algorithm), or sequences from the UPSQs set that were not used to build the PSWM (sequences that will have a 'score' but should not have a significant match to the PSWM (as they were not used to build the PSWM) and thus provide a means of estimating noise in the PSWM prediction). These two Gaussian distributions, describe the 'real hits' and the background random matches respectively.
Using the mean and standard deviation of the two distributions the specificity and sensitivity of searching for a potential TFBS can be controlled via two thresholds:
threshold 1 = μ background + xσ background
threshold 2 = μ real - xσ real
where x represents any real number and affects the false-positive and false-negative rates when using threshold1 and threshold2 respectively.
In this study for a sequence to be predicted to contain a given TFBS the score of the sequence against the PSWM must be greater than μreal-2σreal (giving 98% coverage) and μbackground+Sσbackground, where S is used to represent trials of different thresholds from 0 to 10 in 0.5 increments as discussed further in this work. Since this term is the only part of the threshold that changes, the TFBS prediction threshold used will only be referred to herein as μbackground+Sσbackground.
UPSQs data sets used
The sequence data used to form the UPSQs and operon prediction test sets were taken from E. coli K12 strain MG1655 from Genbank (accession number: NC_000913[36, 37]) and S. coelicolor chromosome (not plasmids) in EMBL (accession number: AL645882 version 2 [38, 39]). In E. coli the UPSQs set was created using the upstream intergenic sequences (maximum length 300 bp) of the first genes of known and predicted operons in the E. coli sequence taken from RegulonDB [40, 41] as described previously . For S. coelicolor upstream intergenic sequences (maximum length 300 bp) of all genes in the chromosome were used. Testing of the constructed PSWMs was conducted using the upstream intergenic sequences (maximum length 300 bp) of the positive and negative examples of operon members of the relevant organism.
Positive and negative examples of operon members
Assuming that the entire polycistronic transcript is documented, non-operons of length 2 were formed by using the initial gene of the operon and its upstream neighbouring gene if that neighbour is transcribed in the same direction (e.g. gene pair d-e), and the last gene in the documented operon and its downstream neighbouring gene, again if it is transcribed in the same direction (e.g. gene pair g-h). These negative examples are termed as transcriptional boundaries. To increase the size of the non-operonic data set a non-operon of length 3 was also formed by collecting triplets of genes that are transcribed in the opposite direction (e.g. genes b-c-d), termed as directional examples, as the central gene (e.g. gene 'c' in Figure 2) is certainly not in the same transcriptional unit as the two flanking genes transcribed in the opposite direction (e.g. genes b and d in Figure 2). In total 35 operons and 1282 non-operons were collected for S. coelicolor, and 325 operons and 821 non-operons were collected for E. coli. Both positive and negative examples for S. coelicolor and E. coli are provided [see Additional file 1].
All examples of operonic genes (positive and negative) were split into singular genes such that the operonic connection status (in the same operon or not) of all genes with their respective upstream gene neighbours could be tested.
Characterisation of inverted repeat binding sites
Coverage of inverted repeats in E. coli footprint data
Random average coverage %
Footprint coverage %
When considering all transcription factor footprints together, as depicted in Figure 3 and Table 1, the overall coverage obtained using palindromes and mismatches to characterise binding sites can be calculated. Coverage is defined as the percentage of all footprint sites identified as containing a palindrome. Here, 77% coverage can be achieved by searching for perfect palindromes and 92–93% coverage when allowing for 1 mismatch. However, it is important to note that tolerating mismatches decreases the specificity of binding site prediction. Of course, the occurrence of a palindromic inverted repeat within an upstream intergenic region does not necessarily mean that a sequence is a true binding site either, but it is expected to increase the likelihood. Indeed, in support of this, when comparing the coverage obtained (using Z-score) searching for perfect palindromes in true footprint sites to that obtained from 500 random footprint simulations a significant difference (p < 0.05) is observed (see Table 1). Here, random bases were chosen to create a sequence the same length of true, known footprint examples taken from those used to create Figure 3. No significant difference was observed when allowing 1 or 2 mismatches for Repressors and Dual regulators.
Implementation of PSWM algorithm
Implementation of the PSWM construction algorithm using E. coli sequences resulted in 4,102 significant dimers using a Poisson probability threshold of -log10 P > 6. Although this is a similar threshold to previous work , these authors identified only 1,775 significant dimers. Despite rigorous checks with the implementation of this algorithm as published, and variation of parameters and similar experiments to reproduce the published data, this was not achieved and the implementation here consistently defined a larger number of putative sites. However, our implementation did discover all the dimers predicted by the earlier study , and this set was also used in our evaluation protocol.
Different thresholds of dimer significance (as described in the methods section) were then applied and sets of significant dimers were clustered using CAST  (step 3 of the algorithm) producing 12 data sets in total. In addition to these 12 sets of clustered sequences, the 849 clusters published previously , referred to as "LiFlank", were put through steps 4 and 5 of the PSWM construction algorithm. The resultant PSWMs for all data sets and the original PSWM data , referred to as "Li", were then used to predict binding sites in the upstream intergenic sequences of the positive and negative examples of operon members in E. coli with 0.5 increments of the TFBS prediction threshold μbackground+Sσbackground (see methods). For each increment of S, counts of TFBS predictions for the positive and negative tests sets were made enabling the calculation of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN).
Throughout this work we introduce a simple nomenclature to distinguish the PSWM sets, x_y_z, which refers to the Poisson distribution dimer significance threshold x, the word clustering threshold y, and the final number of clusters z.
Area Under the Curve scores for PSWM data sets
The second best operon predictor set (3_08_1506 with 79% accuracy) is of one of the novel PSWM sets constructed in this study (using all steps of the algorithm). The dimer significance of -log10 P > 3 and clustering threshold of 0.8 was therefore used to construct PSWMs for S. coelicolor resulting in 22,359 significant dimers producing 3,628 clusters/PSWMs (referred to herein as S3_08_3628). Previous work applying the same approach on S. coelicolor  reported 2,497 putative TFBSs. However, using dyad word lengths of 3–5 nt (used here) instead of just 4 nt results in additional matrices, as well as all of the matrices found previously .
TFBSs for operon prediction
From Figures 6 and 7 it is apparent, particularly in S. coelicolor (Figure 6), that the higher a PSWM's Z-score (and thus higher specificity), the lower the PSWM operon prediction accuracy. Somewhat counter intuitively, this suggests that the more successful PSWMs in terms of operon prediction accuracy predict many TFBSs (due to a low Z-score indicating lack of specificity in the prediction of an actual TFBS) such that very few operonic predictions are made. Further evidence for this can be seen in Figures 6 and 7 which show accuracy is correlated to the amount of hits, whereby many hits equates to higher accuracy. This demonstrates that high operon prediction accuracy is obtained by maximising sensitivity (coverage) of the TFBS prediction, albeit at the expense of TFBS prediction specificity, and without over-predicting TFBSs either.
Figures 6 and 7 also consider nucleotide biases. In E. coli the AT-richness of a PSWM does not appear to be related to accuracy due to the wide distribution of scores (Figure 7). In S. coelicolor however the matrices that are more AT rich tend to have a higher operon prediction accuracy, seen as a right shift in Figure 6. AT tracts in the transcription initiation region of genes are not uncommon  but when the PSWM operon prediction is restricted to using PSWMs of the set S3_08_3628_0 that have an AT-richness greater than 40%, no effect on operon prediction accuracy was found (data not shown). Finally there does not seem to be any relatedness between a PSWM's accuracy and its bias towards non coding regions in both E. coli and S. coelicolor (Figures 6 and 7 respectively).
To date non-homology based operon prediction methods that search upstream regions for motifs have only included promoters and/or terminators [9–12, 17]. Data is presented here that uses a different upstream motif, transcription factor binding sites, to predict operons in both S. coelicolor and E. coli. Avoiding the need for experimentally validated examples of promoters/terminators and therefore dependency on the amount of prior knowledge available [9, 10, 12, 17] this novel approach is able to predict operons in E. coli and S. coelicolor with 83% and 93% accuracy respectively. This strongly suggests that predicted TFBSs are a highly suitable feature for inclusion in integrated operon prediction strategies, not only in well characterised systems but also in other poorly-annotated but sequenced prokaryotes, where good quality TFBS predictions can be obtained ab initio with the protocol described here. It should be noted here that our definition of non-operonic genes is expanded by considering triplets with genes transcribed in opposite directions and that the overall prediction accuracy falls to around 64% when only the transcriptional boundary set are used. However, this drop off also occurs using intergenic distance as the sole predictor, reducing to 62% accuracy with the same data set (and same threshold as that optimised for the negative set including directional examples). Thus, our claim is not that predicted TFBS alone are superior operon predictors, rather that they should be used as independent features that offer a complementary approach.
Indeed, TFBS may offer other advantages. Thus far intergenic distance has proven to be the most accurate single feature for operon prediction in E. coli and B. subtilis [9, 10, 12–14, 49]. Intergenic distance is implicitly linked to our method since TFBS identification/prediction is focused on intergenic non-coding regions only. Comparing the operon prediction accuracy in S. coelicolor of the TFBS and intergenic distance approach (using the same log-likelihood method documented in ) both features achieve similar accuracy (93–95%) when applied to the same positive and negative (both directional and transcriptional boundary) operon example data set. The operon predictions themselves however are not perfectly correlated between the two approaches, although there is a strong correlation – around 0.7 depending on thresholds used for TFBS and intergenic distance predictions. Hence, the two features are complementary, and their combination is expected to produce further improvement in operon prediction. Finally, although both methods perform equally well, TFBSs as predictors of operons are advantageous since additional regulatory information about a particular putative operon may be inferred.
This protocol derives genome-specific PSWMs (Position Specific Weight Matrices) representing significant dimers, which can subsequently be exploited for operon prediction. The majority (77% – 93%) of experimentally validated E. coli transcription factor binding footprints can be characterised as palindromic (either with or without mismatches allowed) and consequently the PSWM set was filtered prior to operon prediction. Restricting the PSWM data set to 'perfect' palindromes (no mismatches allowed) increased operon prediction accuracy in S. coelicolor but not E. coli and therefore, and supports the conclusion that genome-specific models for operon prediction should be employed [19, 50]. The improved performance in S. coelicolor compared to E. coli may be a result of the PSWM construction strategy. The complete upstream intergenic regions were used in S. coelicolor whilst a restricted set based on annotated TFBSs was used in E. coli to find over-represented dimers. This would be expected to lead to increased coverage in S. coelicolor, and perhaps increased specificity in E. coli. This is borne out by the results, where high TFBS prediction coverage leads to superior operon prediction performance.
The false negatives (FNs) in this operon prediction strategy are intra-operonic genes (genes known to be transcribed together with its respective upstream gene) that have a predicted binding site (given the set of predicted PSWMs and applied threshold) in their upstream non-coding regions. The existence of these is not wholly surprising given our previous results  which demonstrated that intra-operonic genes with increased expression relative to the first gene of the same operon are more likely to have a predicted TFBS in their upstream non-coding regions. This intra-operonic promotion of gene expression has been noted by other groups [11, 32, 33]. Using a μbackground+4σbackground as a PSWM threshold we found that 23% of the S. coelicolor and 43% of the E. coli false negatives (35 and 227 FNs respectively) had higher expression than the first gene of the same operon. Therefore, when microarray expression data is available, it is possible to combine this with high quality TFBS predictions to improve the overall accuracy of operon prediction and gain information on the dynamic nature of some operons (specific, isolated control within operons for specific responses). This would not be possible with the application of intergenic distance alone to predict operons, and offers a way forward to address the true complexity of operon regulation.
This work has demonstrated the suitability of predicted TFBSs as operon predictors independently. We are currently working on a genome specific operon predictor for both S. coelicolor and E. coli that is able to combine TFBSs and other features found to be useful in operon predictions (Laing et al, Manuscript in preparation). Integrated approaches to operon prediction in prokaryotes are clearly needed, since expression of genes within operons is affected by several factors, not least the presence of suitable transcription factor binding sites, and array-based data on its own is insufficient to understand this phenomenon. Indeed, a recently developed operon prediction method for S. coelicolor  that combines intergenic distance, expression data and terminators, whilst able to show comparable operon prediction accuracy to that presented here, is unable to predict 'dynamic' operons, defined as the isolated expression of a gene(s) from their 'common' (under 'normal' conditions) transcriptional unit to which they reside. The findings of this work suggests that TFBS are a useful feature for predicting operons of all prokaryotes in their own right but also have the capacity to allow the modelling of dynamic operon structures of an entire genome in a particular, specific condition (e.g. response to a stress) when combined with expression data .
EL would like to thank the MRC for funding in the form of a studentship and the EC FP6 ActinoGEN programme.
- Jacob F, Monod J: Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol. 1961, 3: 318-356.PubMedView ArticleGoogle Scholar
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998, 23: 324-328. 10.1016/S0968-0004(98)01274-2.PubMedView ArticleGoogle Scholar
- Itoh T, Takemoto K, Mori H, Gojobori T: Evolutionary instability of operon structures disclosed by sequence comparisons of complete microbial genomes. Mol Biol Evol. 1999, 16: 332-346.PubMedView ArticleGoogle Scholar
- Lathe WC, Snel B, Bork P: Gene context conservation of a higher order than operons. Trends Biochem Sci. 2000, 25: 474-479. 10.1016/S0968-0004(00)01663-7.PubMedView ArticleGoogle Scholar
- Mushegian AR, Koonin EV: Gene order is not conserved in bacterial evolution. Trends Genet. 1996, 12: 289-290. 10.1016/0168-9525(96)20006-X.PubMedView ArticleGoogle Scholar
- Snel B, Bork P, Huynen MA: Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 2002, 20: 410-10.1016/S0167-7799(02)02040-1.View ArticleGoogle Scholar
- Wolf Y, Rogozin IB, Kondrashov AS, Koonin EV: Genome Alignment, Evolution of prokaryotic genomes organisation, and prediction of gene function using genomic context. Genome Res. 2001, 11: 356-372. 10.1101/gr.GR-1619R.PubMedView ArticleGoogle Scholar
- Xie G, Keyhani NO, Bonner CA, Jensen RA: Ancient origin of the tryptophan operon and the dynamics of evolutionary change. Microbiol Mol Biol Rev. 2003, 67: 303-342. 10.1128/MMBR.67.3.303-342.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Bockhorst J, Craven M, Page D, Shavlik J, Glasner J: A Bayesian network approach to operon prediction. Bioinformatics. 2003, 19: 1227-1235. 10.1093/bioinformatics/btg147.PubMedView ArticleGoogle Scholar
- Bockhorst J, Qiu Y, Glasner J, Liu M, Blattner F, Craven M: Predicting bacterial transcriptional units using sequence and expression data. Bioinformatics. 2003, i34-i43. 10.1093/bioinformatics/btg1003.Google Scholar
- Charaniya S, Mehra S, Lian W, Jayapal KP, Karypis G, Hu W: Transcriptome dynamics-based operon prediction and verification in Streptomyces coelicolor. Nucleic Acids Res. 2007, 35 (21): 7222-7236. 10.1093/nar/gkm501.PubMedPubMed CentralView ArticleGoogle Scholar
- Craven M, Page D, Shavlik J, Bockhorst J, Glasner J: A probabilistic learning approach to whole-genome operon prediction. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB): August 19–23 2000; San Diego. Edited by: Bourne P, Gribskov M, Altman R, Jensen N, Hope D, Lengauer T, Mitchell J, Scheeff E, Smith C, Strande S, et al. 2000, American Association for Artifical Intelligence Press, 116-127.Google Scholar
- de Hoon MJL, Imoto S, Kobayashi K, Ogasawara N, Miyano S: Predicting the operon structure of Bacillus subtilis using operon length, intergenic distance, and gene expression information. Pac Symp Biocomput. 2004, 9: 276-287.Google Scholar
- Sabatti C, Rohlin L, Oh M, Liao JC: Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res. 2002, 30: 2886-2893. 10.1093/nar/gkf388.PubMedPubMed CentralView ArticleGoogle Scholar
- Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J: Operons in Escherichia coli: Genome analyses and predictions. Proc Natl Acad Sci USA. 2000, 97: 6652-6657. 10.1073/pnas.110147297.PubMedPubMed CentralView ArticleGoogle Scholar
- Tjaden B, Saxena RM, Stolyar S, Haynor DR, Kolker E, Rosenow C: Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays. Nucleic Acids Res. 2002, 30: 3732-3738. 10.1093/nar/gkf505.PubMedPubMed CentralView ArticleGoogle Scholar
- Yada T, Nakao M, Totoki Y, Nakai K: Modelling and predicting transcriptional units of Escherichia coli genes using hidden markov models. Bioinformatics. 1999, 15: 987-993. 10.1093/bioinformatics/15.12.987.PubMedView ArticleGoogle Scholar
- Reznikoff WS, Winter RB, Hurley CK: The Location of the Repressor Binding Sites in the lac Operon (competition binding assay). Proc Natl Acad Sci USA. 1974, 71: 2314-2318. 10.1073/pnas.71.6.2314.PubMedPubMed CentralView ArticleGoogle Scholar
- Laing E, Mersinias V, Smith CP, Hubbard SJ: Analysis of gene expression in operons of Streptomyces coelicolor. Genome Biol. 2006, 7: R46-10.1186/gb-2006-7-6-r46.PubMedPubMed CentralView ArticleGoogle Scholar
- Rigali S, Derouaux A, Giannotta F, Dusart J: Subdivision of the Helix-Turn-Helix GntR Family of Bacterial Regulators in the FadR, HutC, MocR, and YtrA Subfamilies. J Biol Chem. 2002, 277: 12507-12515. 10.1074/jbc.M110968200.PubMedView ArticleGoogle Scholar
- Van Helden J, Rios AF, Collado-Vides J: Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 2000, 28: 1808-1818. 10.1093/nar/28.8.1808.PubMedView ArticleGoogle Scholar
- Li H, Rhodius V, Gross C, Siggia ED: Identification of the binding sites of regulatory proteins in bacterial genomes. Proc Natl Acad Sci USA. 2002, 99: 11772-11777. 10.1073/pnas.112341999.PubMedPubMed CentralView ArticleGoogle Scholar
- Stormo G: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16: 16-23. 10.1093/bioinformatics/16.1.16.PubMedView ArticleGoogle Scholar
- Stormo GD, Schneider TD, Gold LM: Characterization of translational initiation sites in E. coli. Nucleic Acids Res. 1982, 10: 2971-2996. 10.1093/nar/10.9.2971.PubMedPubMed CentralView ArticleGoogle Scholar
- Brazma A, Vilo J: Gene expression data analysis. FEBS lett. 2000, 480: 17-24. 10.1016/S0014-5793(00)01772-5.PubMedView ArticleGoogle Scholar
- Van Helden J, André B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotides frequencies. J Mol Biol. 1998, 281: 827-842. 10.1006/jmbi.1998.1947.PubMedView ArticleGoogle Scholar
- Bucca G, Brassington AM, Hotchkiss G, Mersinias V, Smith CP: Negative feedback regulation of dnaK, clpB and lon expression by the DnaK chaperone machine in Streptomyces coelicolor, identified by transcriptome and in vivo DnaK-depletion analysis. Mol Microbiol. 2003, 50: 153-66. 10.1046/j.1365-2958.2003.03696.x.PubMedView ArticleGoogle Scholar
- Kono H, Sarai A: Structure-based prediction of DNA target sites by regulatory proteins. Proteins. 1999, 35: 114-31. 10.1002/(SICI)1097-0134(19990401)35:1<114::AID-PROT11>3.0.CO;2-T.PubMedView ArticleGoogle Scholar
- Liu J, Stormo GD: Quantitative analysis of EGR proteins binding to DNA: assessing additivity in both the binding site and the protein. BMC Bioinformatics. 2005, 6: 176-10.1186/1471-2105-6-176.PubMedPubMed CentralView ArticleGoogle Scholar
- Mwangi MM, Siggia ED: Genome wide identification of regulatory motifs in Bacillus subtilis. BMC Bioinformatics. 2003, 4: 18-10.1186/1471-2105-4-18.PubMedPubMed CentralView ArticleGoogle Scholar
- Studholme DJ, Bentley SD, Kormanec J: Bioinformatic identification of novel regulatory DNA sequence motifs in Streptomyces coelicolor. BMC Microbiology. 2004, 4: 14-10.1186/1471-2180-4-14.PubMedPubMed CentralView ArticleGoogle Scholar
- de Hoon MJL, Makita Y, Nakai K, Miyano S: Prediction of Transcriptional Terminators in Bacillus subtilis and related Species. PLoS Comput Biol. 2005, 1 (3): e25-. 10.1371/journal.pcbi.0010025. Epub 2005 Aug 12PubMedPubMed CentralView ArticleGoogle Scholar
- Makita Y, Nakao M, Ogasawara N, Nakai K: DBTBS: Database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res. 2004, 32: D75-D77. 10.1093/nar/gkh074.PubMedPubMed CentralView ArticleGoogle Scholar
- Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6: 281-97. 10.1089/106652799318274.PubMedView ArticleGoogle Scholar
- Stormo GD, Hartzell GW: Identifying protein-binding sites from unaligned DNA fragments. Prco Natl Acad Sci USA. 1989, 86 (4): 1183-7. 10.1073/pnas.86.4.1183.View ArticleGoogle Scholar
- GenBank. [http://www.ncbi.nlm.nih.gov/Genbank]
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank: update. Nucleic Acids Res. 2004, 32: D23-D26. 10.1093/nar/gkh045.PubMedPubMed CentralView ArticleGoogle Scholar
- EMBL. [http://www.ebi.ac.uk/embl]
- Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Caochrane G, et al: The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 2005, 33: D29-D33. 10.1093/nar/gki098.PubMedPubMed CentralView ArticleGoogle Scholar
- RegulonDB. [http://regulondb.ccg.unam.mx/]
- Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-Solano F, Santos-Zavaleta A, Martinez-Flores I, Jimenez-Jacinto V, Bonavides-Martinez C, Segura-Salazar J, et al: RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res. 2006, 34: D394-D397. 10.1093/nar/gkj156.PubMedPubMed CentralView ArticleGoogle Scholar
- Ecocyc. [http://ecocyc.org/]
- Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralata-Gil M, Karp PD: EcoCyc: A comprehensive database resource for Escherichia coli. Nucleic Acids Res. 2005, 33: D334-7. 10.1093/nar/gki108.PubMedPubMed CentralView ArticleGoogle Scholar
- Sadler JR, Sasmor H, Betz JL: A perfectly symmetric lac operator binds the lac repressor very tightly. Proc Natl Acad Sci USA. 1983, 80: 6785-6789. 10.1073/pnas.80.22.6785.PubMedPubMed CentralView ArticleGoogle Scholar
- Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic acids Res. 1978, 5: 3157-3170. 10.1093/nar/5.9.3157.PubMedPubMed CentralView ArticleGoogle Scholar
- McGuire AM, Hughes JD, Church GM: Conservation of DNA Regulatory Motifs and Discovery of New Motifs in Microbial Genomes. Genome Res. 2000, 10: 744-757. 10.1101/gr.10.6.744.PubMedView ArticleGoogle Scholar
- Cavener DR: Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res. 1987, 15: 1353-61. 10.1093/nar/15.4.1353.PubMedPubMed CentralView ArticleGoogle Scholar
- Cheema AK, Choudhury NP, Das HK: A- and T-Tract-Mediated Intrinsic Curvature in Native DNA between the Binding site of the Upstream Activator NtrC and the nifLA Promoter of Klebsiella pneumoniae Facilitates Transcription. J Bacteriol. 1999, 181: 5296-5302.PubMedPubMed CentralGoogle Scholar
- Stormo GD, Tan K: Mining genome databases to identify and understand new gene regulatory systems. Curr Opin Microbiol. 2002, 5: 149-153. 10.1016/S1369-5274(02)00309-0.PubMedView ArticleGoogle Scholar
- Price MN, Huang KH, Alm E, Arkin AP: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005, 33: 880-892. 10.1093/nar/gki232.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.