P2TF: a comprehensive resource for analysis of prokaryotic transcription factors
© Ortet et al.; licensee BioMed Central Ltd. 2012
Received: 5 September 2012
Accepted: 11 November 2012
Published: 15 November 2012
Skip to main content
© Ortet et al.; licensee BioMed Central Ltd. 2012
Received: 5 September 2012
Accepted: 11 November 2012
Published: 15 November 2012
Transcription factors (TFs) are DNA-binding proteins that regulate gene expression by activating or repressing transcription. Some have housekeeping roles, while others regulate the expression of specific genes in response to environmental change. The majority of TFs are multi-domain proteins, and they can be divided into families according to their domain organisation. There is a need for user-friendly, rigorous and consistent databases to allow researchers to overcome the inherent variability in annotation between genome sequences.
P2TF (Predicted Prokaryotic Transcription Factors) is an integrated and comprehensive database relating to transcription factor proteins. The current version of the database contains 372,877 TFs from 1,987 completely sequenced prokaryotic genomes and 43 metagenomes. The database provides annotation, classification and visualisation of TF genes and their genetic context, providing researchers with a one-stop shop in which to investigate TFs. The P2TF database analyses TFs in both predicted proteomes and reconstituted ORFeomes, recovering approximately 3% more TF proteins than just screening predicted proteomes. Users are able to search the database with sequence or domain architecture queries, and resulting hits can be aligned to investigate evolutionary relationships and conservation of residues. To increase utility, all searches can be filtered by taxonomy, TF genes can be added to the P2TF cart, and gene lists can be exported for external analysis in a variety of formats.
P2TF is an open resource for biologists, allowing exploration of all TFs within prokaryotic genomes and metagenomes. The database enables a variety of analyses, and results are presented for user exploration as an interactive web interface, which provides different ways to access and download the data. The database is freely available at http://www.p2tf.org/.
Transcription factors (TFs) are DNA-binding proteins involved in the regulation of gene expression. They are found in all living organisms and activate or repress transcription by binding to specific DNA sequences. TFs are characterized by their DNA-binding domains (DBDs), of which the helix-turn-helix (HTH) domain is the most prevalent in prokaryotic genomes .
Many TFs are constitutively active and used to regulate gene expression by changing their levels inside the cell. For example, CarA of Myxococcus xanthus is a repressor of photoprotective carotenoid biosynthesis, and illumination results in the production of an anti-repressor which prevents CarA binding to its operator site . Other TFs (known as one-component systems, or OCSs) are switched on/off by the activity of a sensory domain within the protein, for instance LexA of Escherichia coli is an inhibitor of the SOS response genes, until binding to activated RecA leads LexA to autoproteolyse . More typically, the activity of OCS sensory domains is regulated by small molecule binding, for instance the PurR regulator acts as repressor of purine biosynthesis upon binding a purine co-repressor .
Another common form of TF is the response regulator (RR), which has an N-terminal phospho-acceptor domain. TF activity of the RR is regulated by the phosphorylation state of the RR, which is governed by an environmentally-sensitive histidine kinase. For example, GacA is a pleiotropic regulator in Pseudomonadales, whose activity is modulated through its phosphorylation by the histidine kinase GacS . A histidine kinase-RR pair is known as a two-component system, and these signalling pathways are abundant in prokaryotes. The final major subset of TFs is sigma factors (SFs), which are eubacterial transcription initiation factors. SFs are a labile component of the RNA polymerase holoenzyme which direct the polymerase to specific subsets of promoters. SFs are often regulated by anti-sigma factors: environmentally sensitive inhibitors of SF activity. For instance in M. xanthus the anti-SF CarR holds the SF CarQ inactive until cells are illuminated, and then CarQ is released to mediate expression of carotenoid expression .
TFs can be categorized into OCSs, RRs, and SFs (and sub-families thereof) through analyses of domain architecture. We defined as Transcriptional Regulators (TRs), a fourth category of TFs, which are not OCSs, RRs or SFs. In addition to domain composition, gene organisation in the vicinity of a TF gene can be important for understanding the function of the TF. Generally TFs tend to be located in the genome adjacent to genes whose expression they regulate (which often includes their own gene). Additionally, many TFs are regulated by adjacently encoded gene products such as histidine kinases (which regulate RRs), and anti-SFs (which regulate SFs). Thus a full computational analysis of TFs needs to provide information on both their domain architecture and also their gene neighbourhood.
Completely sequenced prokaryotic genomes continue to become available at an ever-increasing rate in the primary databases, and as genome annotation standards still differ widely, there is also an escalating need for secondary databases which undertake rigorous and consistent analysis across all available genomes.
Of prokaryotic TF-related databases, ArchaeaTF contains data from a specific kingdom , a genome-wide survey of TFs in prokaryotes analyses a subset of available genomes  and DBD identifies TFs among a set of predicted (and experimentally defined) proteins . However, to our knowledge, there is no database that provides detailed information about mis-predicted or metagenomic TFs for all completely sequenced genomes and metagenomes.
We have constructed P2TF (Predicted Prokaryotic Transcription Factors) a user-friendly database of predicted TFs for all available completely sequenced prokaryotic genomes and metagenomes. In order to reliably detect TFs, we developed a method analysing TFs in both predicted proteomes and reconstituted ORFeomes, recovering 3% more TFs. P2TF also presents a thorough analysis of the properties of each predicted TF and implements a hierarchical classification scheme. The set of TFs in P2TF can be filtered in a variety of ways (eg. by taxonomy, domain architecture, family membership, gene organisation etc.), and user-defined subsets of TFs can be outputted in a range of formats, to maximise its usefulness to the community. The result of the P2TF analysis process applied to microbial genomes and metagenomes is accessible via a web interface designed primarily for experimental biologists, at http://www.p2tf.org.
P2TF predicts TF candidates by performing a domain analysis of each protein sequence using RPS-BLAST and considering an E-value cut-off of 0.01 and minimum alignment coverage of 50% for each domain length. We manually selected a pool of domains from the Pfam  and SMART libraries , based on analysis of the literature on sequence-specific DBDs and their associated domains [8, 9, 13] (See documentation page of the database and Additional file 1). The presence in a protein of a domain defined as a DBD leads to inclusion of the protein in P2TF. In rare cases experimentally-validated TFs will not be found in P2TF because they possess relatively novel DNA-binding sequences which have not yet been ascribed a domain profile by Pfam or SMART. When such novel DNA-binding domain profiles are published they will be added to P2TF allowing recovery of further TFs.
Users are encouraged to suggest modifications to the database such as the creation of new categories or families, and/or to validate/curate predicted proteins. To ensure integrity of the database, we ask interested experts to download formatted data from P2TF and then after manual curation, the same downloaded files can be used as an exchange format for updating the database. Four genomes have already been manually curated by the authors (Ramlibacter tataouinensis, Pseudomonas brassicacearum, Deinococcus deserti and Myxococcus xanthus).
The P2TF database can be queried via two modes: keyword searches, and BLAST searches. The first search mode allows users to request genes on the basis of their locus-tag, gene name, GI (GenBank Identifier) or domain possession. To restrict their search, users can browse predictions by querying a genome of interest or a group of genomes belonging to the same taxon. A taxonomic search can be achieved by using the species name, the taxon-id or the lineage name.
A BLAST search mode was also implemented to provide a similarity search against all or individual species within P2TF. Users can use default BLAST parameters or simply modify these by entering new options. In addition users can use P2TF proteins as queries for BLAST searches for similar proteins in Uniprot or Genbank through a link on each P2TF protein page.
The search modules build search output as a tabular view that is linked to a full description and genomic context for each selected gene. A selection system has been implemented to add all or partial data to a shopping cart or perform a multiple sequence alignment using the MUSCLE program . The resulting multiple sequence alignments can then be viewed using the Jalview applet .
P2TF also provides information regarding TF gene organisation both as a CGView  zoomable genome map on each genome page, and as a zoomable genome browser linked from each gene page. The prediction of putative regulation regions is provided by considering the flanking genes of the current TF within 500 bp of the TF of interest. The categorization system uses the Cluster of Orthologous Groups (COG) classifications  to define non-TF genes.
The ORFeome search is an important feature of the P2TF process, which on average allows the recovery of nearly 3% more TFs per replicon. However some replicons feature extremely large numbers of mispredicted TF genes. For instance, 147 TFs and ODPs were identified in the proteome of Sodalis glossinidius str. ‘morsitans’, but a further 100 (68% of the original complement) were found by searching its ORFeome.
Performance statistics for P2TF
Bacillus subtilis 168
Moreno-Campuzano et al., 2006
Lactobacillus acidophilus NCFM
Wilson et al., 2008
Stenotrophomonas maltophilia R551-3
Wilson et al., 2008
Corynebacterium efficiens YS-314
Brune et al., 2005
The seven bacteria with the largest number of TFs are all Actinomycetes, one of the best-represented phyla among completely sequenced bacterial genomes. In particular, the genomes of five Actinomycetes (Streptomyces bingchenggensis BCW-1, Amycolatopsis mediterranei S699, Amycolatopsis mediterranei U32, Streptomyces violaceusniger Tu 4113 and Streptosporangium roseum DSM 43021) each harbour more than 1000 DBD-proteins, and the 806 DBD-proteins in the proteome of another Actinomycete (Kribbella flavida DSM 17836) account for 11.6% of its total protein-coding genes. The relative and absolute numbers of OCSs, RRs, SFs and other TRs varies dramatically between genomes. For the 1,227 organisms whose genomes contain 100 or more DBD proteins, Burkholderia sp. possess the greatest numbers of OCSs at the expense of SFs, while Actinobacteria such as Catenulispora sp. possess large numbers of SFs, and Firmicutes (such as Paenibacillus sp.) possess relatively large numbers of RRs. The Archaebacteria (such as Solfolobus sp.) tend to rely heavily on TRs for gene regulation, with most Archaebacterial genomes completely lacking SFs and RRs.
A search engine was developed that allows users to request genes on the basis of keywords, including locus-tag, gene name, GI or domain possession. Users can also restrict their search to a genome of interest or a group of genomes belonging to the same taxon. In addition, the database can be queried through a BLAST search, enabling users to identify homologs of a query sequences and to perform multiple sequence alignments using the MUSCLE program. The resulting multiple sequence alignments can then be viewed, edited and analysed through the incorporation of the Jalview applet. For instance alignments can be edited, coloured, sorted and outputted for publication, while phylogenetic trees can also be produced from the alignment, providing insights into evolution and the conservation of amino acid residues.
The complement of domains within a protein provides information regarding its biological function, mode of action, and evolutionary heritage. We developed a categorization system, which allocates membership of TFs to families as a consequence of their domain architecture. Thus CarA is a member of the MerR family as it contains a MerR DBD.
Proteins with multiple DNA-binding domains are well characterised in eukaryotes, for instance HMG proteins and Pax/homeodomain proteins [26, 27]. In such proteins the multiple DNA-binding domains can result in DNA cross-linking and chromatin compaction, and can enable both specific and modular regulation of gene expression by single proteins. Park et al., found that in fungal genomes many TFs carry more than one type of DNA-binding domain, suggesting they interact with multiple regulatory sequences. This also seems to be the case for prokaryotes, as P2TF identifies TFs, which possess multiple DNA-binding domains, in many cases of different types (where TFs contain more than one type of DNA-binding domain, they are assigned to a family in P2TF according to the domain which gives the lowest E-value). This was expected for SFs, which generally have two different DBDs, but not for others TFs. For instance MXAN_0631 and ABC0986 have domain architectures of HTH_8, HTH_AraC and HTH_IclR, Mga respectively. Two proteins from Desulfotomaculum kuznetsovii DSM 6115, Desku_0047 and Desku_2016, possess 3 DNA-binding domains, all of the same type (HTH_Xre). We were unable to find examples in the literature of experimentally-characterised prokaryotic TFs which contain multiple DNA-binding domains. The observation of large numbers of such proteins encoded in bacterial genomes suggests that they represent a commonplace but unappreciated group of regulators in these organisms.
There is a strong taxonomic bias affecting the families of TFs that are found in particular genomes. For taxonomic classes with 100 or more DBD proteins across their sequenced genomes, some TF families were found to be unusually common in particular classes. For instance 26% of the TFs in the Epsilonproteobacteria belong to the OmpR family, 18% of Methanomicrobia TFs are ArsR family members, 25% of Betaproteobacterial TFs are LysR, and 19% of TFs in Actinobacteridae genomes belong to the TetR family. Even at lower taxonomic levels there are considerable differences between genomes in terms of their TF family composition. Within the class Actinobacteria, 19% of genus Bifidobacterium TFs belong to the LacI family, although for all other Actinobacterial genera LacI TFs represent less than 4% of TFs. Within the Gammaproteobacterial class, LuxR family TFs account for 17% of Photorhabdus and Xenorhabdus TFs, but at most 6% of the TFs in other Gammaproteobacterial genera.
The OCS LexA is involved in DNA repair as part of the SOS response in E. coli. Most of the LexA family TFs contain a LexA_DNA_bin domain and another domain - Peptidase_S24 (pfam00717, which was not originally specified in P2TF). Searches for TFs in P2TF containing the Peptidase_S24 domain gave 1978 TFs. 1078 were associated with a LexA_DNA_bin domain (and therefore classified as members of the LexA family). The other TFs were associated with HTH_XRE or HTH__3 (Xre family) domains, and many of these are known to be true LexA repressors, for instance in the radiation resistant bacterium Deinococcus radiodurans. Therefore it seems that the identification of LexA repressors is most probably linked to the association of a Peptidase_S24 domain with any DBD, not just to the LexA_DNA_bin domain. HTH_XRE and HTH_3 domains are very similar (98%), so proteins with either domain are now grouped together in P2TF within the Xre family. Similarly any proteins with a DBD and a peptidase_S24 domain are now classified as LexA proteins.
Experimental biologists need diverse types of information to guide their experiments and to interrogate large –omics datasets. For instance lists of TFs within a genome can be vital in identifying candidate genes for disruption, for shortlisting potentially redundant orthologues, and thus for infering the genetic mechanism underlying observed phenotypic data. In addition the ability to routinely generate –omics datasets means that experimental biologists need tools for the rational exploration of –omics datasets – for instance by filtering data against rigorously-defined subsets of TFs. In addition, understanding the evolutionary events, which have led to contemporary sets of proteins can provide important insights into the functioning of contemporary regulatory networks, especially by identifying strain- and species-specific changes in TF complements. The P2TF resource provides data and analysis tools, which can assist all such applications and many more besides.
P2TF is an open resource for biologists and data are presented for user exploration as an interactive web interface. The database can be queried in a variety of ways – keyword and sequence based searches are enabled in addition to browsing, and the resulting data can be outputted in user-specified formats. Results from BLAST and domain architecture searches can be aligned using MUSCLE and the alignments edited, viewed and analysed with Jalview. P2TF uses a domain architecture-based classification scheme to categorize TFs into families, and a wide range of information is provided for each P2TF entry including PubMed links. A P2TF cart is also provided to assist analysis by users. P2TF thus provides experimental biologists and bioinformaticians with a high-quality dataset for the investigation of prokaryotic TFs.
The earlier description of how the P2TF LexA classification criteria developed illustrates how users can have input into the functionalities of P2TF. Innovations to the current P2TF scheme such as the inclusion of new domain profiles, and changes to the classification scheme can be readily implemented at the request of users. The developers hope that the user community will help develop future iterations of P2TF for everyone’s benefit.
P2TF is publicly available at http://www.p2tf.org, and runs on all web browsers tested, including Mozilla Firefox, Internet Explorer, Google Chrome and Apple Safari.
We are grateful to DSV/IBITEC-S/GIPSI team and particularly Arnaud Martel and Jean-Marc Le Failler for the hosting server installation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.