In prokaryotes biological processes tend to be regulated at the level of transcription, with subsets of genes/operons being up/down-regulated by specific DNA-binding proteins known as transcription factors (TFs). TFs can be divided into a few major categories, including sigma factors (SFs), one-component systems (OCSs) and response regulators (RRs), and the DNA-binding activity of these proteins is often regulated. SFs are the specificity-conferring sub-units of RNA polymerase holoenzymes, and they direct the transcription machinery towards particular promoter sequences [1]. The activity of SFs is often regulated by accessory proteins such as anti-SFs, which bind to and inhibit specific SFs. In addition to DNA-binding domains, OCSs possess sensory domains, which modulate DNA-binding activity according to the presence/absence of a particular stimulus [2]. Finally, the DNA-binding activity of RRs is regulated by the phosphorylation-state of their receiver domains, which can be phosphorylated by partner receptor kinase proteins called histidine kinases (HKs). Together HKs and their partner RRs (including non-DNA-binding RRs), form two-component systems (TCSs), which are the dominant phosphorylation-dependent signal transduction pathways of prokaryotes [3].
A typical prokaryotic genome encodes around 5% TFs [4] and 1.5% TCS proteins [5], and for most regulatory proteins (RPs), multiple homologues are usually found in each genome. Therefore for RPs, sequence similarity does not necessarily imply a similar functional role, and annotation of RPs by sequence similarity has resulted in many erroneous annotations.
Over-specific annotation is a common problem. For example, the E. coli PhoB/OmpR family of RRs regulate diverse processes, including potassium homeostasis (KdpE), copper tolerance (CusR) and trimethylamine N-oxide respiration (TorR), in addition to phosphatase expression (PhoB) and osmoregulation (OmpR) [6–10]. However multiple PhoB/OmpR family members in a genome are sometimes ascribed the same role. For example, Clostridium botulinum B str. Eklund 17B encodes 28 OmpR family RRs, of which seven are annotated as regulating phosphatase expression, and 11 are annotated as being VanR, which regulates vancamycin resistance [11].
Due to intrinsic problems in defining the physiological function of regulatory proteins by sequence homology, functional annotation by sequence similarity has now largely been superseded by categorisation on the basis of domain architecture [12–14]. In this manner, RPs can be divided into families, and family membership then correlates with mechanism of action rather than biological function. Several on-line databases are now available which provide the results of such classification approaches as applied to whole genomes. For example, P2CS [15], P2TF [4], MiST2 [16] and DBD [17].
However there are remaining problems with RP annotation. Many RPs contain multiple domains, and some domains are found in multiple categories of RP. This has led historically to the mis-annotation of many proteins. For instance, SAB1964 is an RR from Staphylococcus aureus RF122, yet it is annotated as a ‘two component sensor protein’, while YPA_3835 is a HK from Yersinia pestis Antiqua, which is annotated as an ‘ATPase-like ATP-binding protein’. Currently, 1.5% of all proteins now classified as RRs in the P2CS database [15], were originally described in some way as ‘sensor kinase’ proteins in the annotated genome files as retrieved from Refseq/Genbank. This problem is exacerbated by the current lack of a community-defined consensus set of categorisation criteria, or even a consensus naming system, for multi-domain RPs. However, this has been accomplished for a subset of RPs (RRs), by Galperin [13, 14].
Due to their multiplicity within genomes and their multi-domain architectures, RPs are non-trivial to identify and annotate. Currently, the annotation of regulatory genes/proteins in individual genomes and databases is often idiosyncratic, misleading or wrong, confounding between-genome comparisons, and naming conventions are also typically different between genomes/databases. There is consequently a profound need for the adoption of a consistent and harmonised categorisation and annotation system for RPs, which can be applied to any sequence dataset, whether newly derived sequences needing annotation, or previously annotated sequences which might benefit from re-annotation [3].
We have therefore developed P2RP (Predicted Prokaryotic Regulatory Proteins) – primarily to help increase (re-)annotation consistency of RPs in published genomes, and for experimental biologists who wish to investigate regulatory genes in their novel sequence data. P2RP accepts two types of input – DNA and protein sequences. For nucleotide queries there is an initial gene prediction step (using MED-Start) to generate a proteome, although. gbk (GenBank) files can also be inputted. Predicted and supplied proteomes are then screened for the presence of particular TF/TCS domains, and proteins categorised and annotated according to their domain architecture [15, 18]. Every user query is given an ID, which allows later retrieval of results, and results of the P2RP process can be viewed as a web server interface page, or downloaded in a variety of user-specified formats. P2RP can be accessed at http://www.p2rp.org and is free and open to all users, with no login requirement.