BLAST screening of chlamydial genomes to identify signature proteins that are unique for the Chlamydiales, Chlamydiaceae, Chlamydophila and Chlamydia groups of species

Background Chlamydiae species are of much importance from a clinical viewpoint. Their diversity both in terms of their numbers as well as clinical involvement are presently believed to be significantly underestimated. The obligate intracellular nature of chlamydiae has also limited their genetic and biochemical studies. Thus, it is of importance to develop additional means for their identification and characterization. Results We have carried out analyses of available chlamydiae genomes to identify sets of unique proteins that are either specific for all Chlamydiales genomes, or different Chlamydiaceae family members, or members of the Chlamydia and Chlamydophila genera, or those unique to Protochlamydia amoebophila, but which are not found in any other bacteria. In total, 59 Chlamydiales-specific proteins, 79 Chlamydiaceae-specific proteins, 20 proteins each that are specific for both Chlamydia and Chlamydophila and 445 ORFs that are Protochlamydia-specific were identified. Additionally, 33 cases of possible gene loss or lateral gene transfer were also detected. Conclusion The identified chlamydiae-lineage specific proteins, many of which are highly conserved, provide novel biomarkers that should prove of much value in the diagnosis of these bacteria and in exploration of their prevalence and diversity. These conserved protein sequences (CPSs) also provide novel therapeutic targets for drugs that are specific for these bacteria. Lastly, functional studies on these chlamydiae or chlamydiae subgroup-specific proteins should lead to important insights into lineage-specific adaptations with regards to development, infectivity and pathogenicity.


Background
The Chlamydiales are clinically important intracellular parasites and endosymbionts of eukaryotic hosts, and cause a wide spectrum of diseases in humans and animals [1]. Recently, the taxonomic classification of this group was revised and currently contains at least 4 distinct families (viz.Chlamydiaceae, Simkaniaceae, Parachlamydiaceae and Waddliaceae) based on >90% 16S rRNA identity and a common developmental cycle [2,3]. While chlamydial infections are best known for the genitourinary, ocular and respiratory infections they cause in humans, the chlamydiae are also quite common in many wild and domestic animals with potential for severe zoonotic disease [4]. The newly described chlamydial species Simkania and Parachlamydia have also been found associated with human respiratory infections, while Waddlia has been implicated in abortion in bovines [5][6][7]. These revisions reflect important changes in perceptions regarding chlamydial diversity in line with recent discoveries of novel animal isolates (Waddliae and Simkaniae) and more strikingly of chlamydia-related endosymbionts or "environmental chlamydiae" (i.e. Parachlamydiae) infecting free-living amoebae [8][9][10]. Free-living amoebae, important components of soil and water ecosystems, are being increasingly recognized as vectors for various bacterial human pathogens [4,7,11,12].
Species are currently recognized as belonging to the Chlamydiales according to 16S and 23S rRNA similarities and pathogenic traits [2]. As these organisms are antigenically and genetically diverse, it is important to develop additional means to identify and distinguish them unambiguously from all other bacteria. The intracellular nature of chlamydiae species has been a hindrance in understanding their genetics, physiology and development. Hence, other means that can provide some insight in these regards are needed. In recent years, the genomes of several chlamydiae species have been sequenced including five Chlamydiaceae members (viz. Chlamydia (Chl.) trachomatis, Chl. muridarum, Chlamydophila (Chlam.) pneumoniae, Chlam. caviae and Chlam. abortus [13][14][15][16] and the environmental chlamydiae species Protochlamydia amoebophila (Parachlamydiaceae) [11]. Until recently when the P. amoebophila genome (formerly Parachlamydia UWE25; [17]) sequence was released, the information for chlamydiae-like organisms was practically non-existent, and it was virtually impossible to establish a core set of genes that are common to various Chlamydiales species. The Protochlamydia genome (2.41 Mb) was found to be about twice the size of various Chlamydiaceae species (1.04-1.23 Mb) and it showed a number of important differences including the presence of a complete TCA cycle and highly modified gene order [11]. The availability of these genomes has made it possible to carry out comparative studies. Horn et al. [11] reported 711 open reading frames (ORFs) or coding sequences that were shared among all chlamydiae genomes. However, many of these proteins have homologs in bacteria outside of the Chlamydiales and the proteins which were unique to only chlamydiae were not examined. Similarly, other studies which have determined proteins common to all Chlamydiaceae genomes also included peptides which were ubiquitous among bacteria [13][14][15][16]. While the work of Kalman et al. [18] revealed a number of potentially chlamydiae-specific genes, these studies were based on only 2 genomes, Chlam. pneumoniae and Chl. trachomatis, and the study was carried out when sequence information for other bacteria was limited.
Our recent work on comparative genomics is aimed at identifying novel and distinctive molecular characteristics of different groups of bacteria, including chlamydiae, which can be used for their identification, classification as well as genetic and biochemical studies [19][20][21][22][23] (see also [24]). We have recently described a number of molecular signatures consisting of conserved inserts and deletions (i.e. indels) in widely distributed proteins that are distinctive characteristics of all available chlamydiae species and are not found in any other bacteria [25]. In the present study, we describe a different kind of taxonomic marker, consisting of whole proteins that are specific for different groups of chlamydiae species, which provides additional powerful means for identifying these groups of species and for understanding their evolution as well as physiological characteristics. In this work, we have carried out systematic BLAST searches on all proteins or ORFs in the genomes of a number of chlamydiae species to compile profiles of proteins that are either uniquely shared by all Chlamydiales-species or particular subgroups (family or genera) within this phylum and can likely be used as distinct molecular markers for these groups. The study of these chlamydial group specific proteins should prove instrumental in the discovery of novel physiological characteristics that are uniquely shared by this important group of pathogens.

Results
The present study was undertaken to identify unique proteins (or ORFs) which are found in the chlamydiae species at various taxonomic depths. Kalman et al [18] have previously examined the presence of chlamydiae-specific proteins in 2 sequenced genomes, Chl. trachomatis serovar D and Chlam. pneumoniae. Their study documented a number of proteins such as EUO, Gp6D, IncsB/C, LtuA/B, a number of Pmps and hypothetical proteins, which were specific for these species. However, because the number of species examined was so small, it was unclear how broadly these proteins were distributed among different chlamydiae species. At that time there was also no sequence information for the proteins from any chlamydiae-like organisms. Thus, based on this earlier study it was unclear whether the identified peptides were distinctive of different Chlamydiales, or only particular subsets of chlamydial species [18].
In order to identify chlamydiae-specific genes/proteins that are present at different phylogenetic depths, each ORF from the Chl. trachomatis and Chlam. caviae genomes were individually BLAST searched against all available databases. These species were chosen in order that sequences from both the Chlamydia and the Chlamydophila genera were represented in the searches. Results of the BLAST searches were inspected to identify proteins where either all of the observed hits were from other chlamydial  species, or where all hits showing significant homology to the query protein (see Methods section) were members of the Chlamydiales. These studies have resulted in the identification of large number of signature proteins, which are uniquely found in chlamydiae species. Some of these proteins are present in all available chlamydial sequences indicating that they are likely specific for the entire Chlamydiales phylum. Other proteins were found to be specific for only the Chlamydiaceae family or the two genera, Chlamydia and the Chlamydophila, which comprise this family. Additionally, large numbers of proteins that are unique to P. amoebophila were also identified. Most of the identified chlamydiae-specific proteins or ORFs are annotated as hypothetical, hence their functions remain unknown. The genes for most of these proteins are scattered throughout the Chl. trachomatis and Chlam. caviae genomes, however a number of these hypothetical proteins were also found in clusters. Although, it is not clear whether these genes form operons, the presence of these genes in clusters suggest that they could be involved in related functions [26]. In the description of these proteins that follows, the "CT", "CC" and "PC" part of the descriptors refer to the source of the original query protein sequence from Chl. trachomatis (CT), Chlam. caviae (CCA) and P. amoebophila (PC) genomes, respectively.

Chlamydiales-specific proteins
Proteins were considered Chlamydiales-specific if they were present in all sequenced Chlamydia and Chlamydophila genomes, as well as P. amoebophila, but were absent in all other bacteria. A total of 59 Chlamydiales-specific proteins were identified (Table 1). Of these, 49 peptides were annotated hypothetical indicating that these organisms have created new strategies allowing them to pursue a parasitic lifestyle with a minimum gene complement. Upon examination of the positions of various Chlamydiales-specific proteins within the Chl. trachomatis and Chlam. caviae genomes, some of them were found to be These proteins are uniquely found in various Chlamydiaceae species but not detected in Protochlamydia or other bacterial groups. Other details as in Table 1. **OmcA is an annotated ORF in the P. amoebophila genome, however it could not be detected using Chl. trachomatis as probe via BLASTp and so was not considered a Chlamydiales protein; SRP, sulfur rich protein. clustered on the chromosomes (those with adjacent CT numbers). A few of these gene clusters are discussed below.
Of the proteins annotated with a predicted function, only two represent outer membrane proteins (CT131 and CT546), which is surprising in light of large expansion of genes encoding chlamydial polymorphic membrane proteins [15,27]. One of these proteins CT546 is in a cluster with two other Chlamydiales-specific proteins (CT547 and CT548), whose functions are not known, but they could be involved in related functions [26]. Another important Chlamydiales-specific protein that is involved in the formation of their cell envelope is CT443 (OmcB). OmcB is not actually intercalated into a membrane. However, this protein is 4.4% cysteine, water-soluble and crosslinked to MOMP and another protein OmcA (CT444) to form the membrane complex [28,29]. OmcA and OmcB are two of the very few chlamydiae-specific proteins that have been functionally characterized [28,29]. The OmcA (CT444) protein is 15% cysteine and so has homology only to proteins with high cysteine content. It has a classic SPII signal sequence and is a proven lipoprotein [28,29]. OmcA and OmcB are transcribed together [28,29] and the proteins that are annotated as OmcA and OmcB are present in all genome sequenced chlamydiae [13][14][15][16], including P. amoebophila [11]. However, in BLASTp searches using either the Chl. trachomatis or Chlam. caviae OmcA (CCA00184) homologs as the query proteins (with or without the low complexity filter), an OmcA homolog was not detected in P. amoebophila although a protein PC0617 (accession number YP_007016) has been annotated as its OmcA homolog. Hence, by the criteria used in this study we regard CT444 as a Chlamydiaceae-specific and not a Chlamydiales-specific protein, and we have included it in Table 2. OmcA and OmcB annotation is presently based on the high cysteine content of the predicted protein products, as the cysteine residues are the basis for OmcA and OmcB function in Chlamydiaceae. Because OmcA and OmcB annotation has changed many times in the literature over the last 15 years [28,29], these proteins are best located by BLAST search. It should be mentioned that besides the Chlamydiales a homolog of CT443 with low E value (3e-25) is also found in Rhodopirellula baltica (Table 1). The R. baltica protein (accession number CAD72259) is larger in length (907aa) in comparison to the chlamydiae homolog and the sequence similarity between these two proteins is mainly seen in the C-terminal region. Importantly, the R. baltica homolog of OmcB contains no cysteine, hence the possible significance of the presence of this homologous protein in R. baltica (Planctomycetes) is presently unclear. A close relationship of chlamydiae to the Planctomycetes, which also lack peptidoglycan in their cell wall, has been noted in earlier studies [30,31]. Although such a relationship was not supported in a later study [32], in view of the presence of this commonly shared protein, it would be of interest to reexamine the relationship between these groups based on genomic sequences.
The finding of only a limited number of outer membrane proteins unique to various Chlamydiales suggests that the chlamydiae species most likely favour lineage-specific mechanisms or surface receptors for interacting with their different hosts rather than sharing a homologous system. Cell surface interactions are not limited to the outer membrane proteins. The LysM domain is a widespread protein module, which the available evidence suggests is a general peptidoglycan-binding module in cell surface proteins associated with cell wall degradation [33]. A LysM domain protein (CT474) was found to be common in all members of the Chlamydiales sequenced to date. The presence of this protein is intriguing as chlamydiae are not known to contain detectable levels of peptidoglycan (PG), despite housing genes for almost a complete PG biosynthetic pathway [34][35][36].
In addition to interacting with their hosts via cell surface molecules, the chlamydiae are also known to utilize a type III secretion system [37]. Genes encoding the basic type III These proteins are uniquely found in species belonging to the Chlamydia genus and are absent in the Chlamydophila and Protochlamydia.
secretory apparatus have been identified in all sequenced Chlamydia, Chlamydophila and Protochlamydia genomes [11,[13][14][15]. These surface projections are used to inject chlamydiae-derived proteins into the host cytosol from within the inclusion body [38]. Orthologs or paralogs to effector proteins commonly found in other bacteria utilizing this apparatus have not been identified in chlamydial systems although they are likely to be critical for transducing the signals required to modulate host cell function. A type III secretion chaperone (CT274) was identified in all Chlamydiales members, which may facilitate a number of processes such as adhesion, cell signaling, transport and the perpetuation of infection. Another Chlamydiales-specific protein, CT273, was found in a tight cluster with CT274. The function of CT273 is not known, but it could also play a role in type III secretion [26]. Both of these genes lie in close proximity to CT271, another Chlamydiales-specific hypothetical protein which resides in the chromosome in the opposite orientation of the above mentioned ORFs.
Due to the reduction in metabolic capabilities of the chlamydiae, it is likely that many different permeases would be necessary for acquiring sufficient levels of different substrate molecules [18,39]. Members of the YjgP/ YjqE family are predicted integral membrane proteins containing 6 transmembrane regions which are predicted to function as permeases [40]. Although YjgP/YjqE proteins are distributed throughout the major domains of life, a novel protein which is indicated to be related to this family, CT838, is uniquely found in various Chlamydiales. The iron-sulfur protein ferredoxin (Fd) is an electron acceptor which participates in the redox-based metabolisms in plastids, mitochondria and bacteria [41]. A novel predicted ferredoxin is uniquely present in various Chlamydiales species (CT312) suggesting that it serves a common evolutionary adapted function in these groups of species.
Another Chlamydiales specific protein, EUO (CT446), encoded by the euo gene (early upstream open reading frame) is found just upstream of the 2 genes encoding lysine-rich proteins Hc1 and Hc2 which are highly similar to the eukaryotic H1 histone [42,43]. One study has shown that Hc1 binds to DNA, inducing nucleoid compaction observed late in the chlamydial developmental cycle when vegetative reticulate bodies differentiate into the metablocially inert infectious particles known as elementary bodies [42]. The EUO protein, which is expressed early in the chlamydial cycle, has been shown to specifically cleave the C-terminal portion of Hc1, initiating dissociation of DNA-Hc1 complexes and DNA decondensation [42][43][44]. Little is known about the signals that trigger the different events throughout the chlamydial life cycle. However, the unique presence of the EUO protein (CT446) in various Chlamydiales species suggests a common mechanism for chromatin remodeling.
The Chl. trachomatis locus CT583 encodes Gp6D, a genomic paralog (31% similarity) of a plasmid-born virulence factor pGp6D [45]. Three of the four sequenced Chlamydiaceae species (except Chlam. abortus) [2] have been found to contain extrachromosomal DNA elements (cryptic plasmids) believed to encode genes which might play a role in pathogenicity and/or modulating virulence [46]. P. amoebophila does not contain an extrachromosomal plasmid. The chlamydial plasmids encode 8 ORFs of significance (>100 amino acids), and although 5 have been characterized, the function of pGp6D is unknown [45]. While the plasmid-borne pGp6D is only 102aa long, the genomic Gp6D homolog found in all Chlamydiales is more than twice that size (263aa). The high degree of similarity among plasmids from both Chlamydia and Chlamydophila species suggests that an ancestral plasmid was acquired by the chlamydial lineage, perhaps before divergence of the two Chlamydiaceae genera. That plasmid-less chlamydiae are rare suggests that plasmid loss is selected against -that the plasmids do have a function specific to this family, which is apparently absent in chlamydiae-like organisms. All of the proteins in this category, which are uniquely present in various Chlamydiales, have likely evolved in the last common ancestor of all chlamydiae and were subsequently passed on to various descendent species through vertical descent.

Loci and Accession Numbers Loci and Accession Numbers Loci and Accession Numbers
Loci and Accession Number These chlamydiae-specific proteins are either missing in some species from particular groups or show unexpected presence in some other groups.
The E values for these proteins are not listed.

Chlamydiaceae-specific proteins
The proteins in this grouping corresponded to those which were found strictly in species belonging to members of the Chlamydia and Chlamydophila genera, but whose homologs were not detected in P. amoebophila. Interestingly, among this category of proteins, 12 distinct gene clusters were located (described below). Of the 79 Chlamydiaceae-specific proteins identified in this work (  The temporal expression patterns of these proteins is based upon the microarray analysis of Nicholson et al [63]. According to the microarray profiles, significant differences were seen in the expression pattern of genes for these proteins at 4 different time points (6-24 hours post infection, 12-24 h.p.i., 18-24 h.p.i., and 24-36 h.p.i.) which correspond to different phases of the chlamydial developmental cycle where key phenotypic changes take place. It is of interest that the majority of the temporally regulated Chlamydiales and Chlamydiaceae-specific genes are expressed between 12-24 h.p.i. and 24-36 h.p.i. Such coordinate regulation suggests they play essential roles during these phases of the chlamydial cell cycle.
Other chlamydiae-specific proteins which are not listed in this Table were not found to show significant differences in their temporal expression in the study by Nicholson et al. [63]. Abbreviation; h.p.i., hours post infection. and CT080 respectively) are two other proteins, which are uniquely shared by members of the Chlamydiaceae family. Both of these proteins are not expressed until the RBs begin to reorganize to EBs. These molecules do not bear strong homology to any other known protein and their functions are not known. The ltuB gene exhibits unusual stability properties in that the 5' end appears to break down, while the 3' end accumulates as a stable fragment of about 240 bases [47]. It has been hypothesized that the shorter ltuB RNA functions in someway in the late stage of the chlamydial developmental cycle [47]. These unique Chlamydiaceae specific proteins are likely playing distinctive stage-specific roles in the developmental cycles of the Chlamydia and Chlamydophila genera. The ltuB gene is found in close proximity to a cluster containing two other hypothetical Chlamydiaceae-specific proteins (CT082 and CT083).
A vast array of Chlamydiaceae-specific membrane proteins have been identified in our study, including Major outer membrane protein (MOMP or OmpA, CT681), polymorphic membrane protein PmpA (CT412) and PmpB (CT413) which lie in a cluster, the putative outer membrane proteins OmpD (CT812), OmpE, OmpF, OmpG, OmpH (CT869-CT872, which are clustered together on the C. trachomatis chromosome) and OmpI (CT874). Several of these predicted outer membrane proteins have recently been shown to be translated and localized to the surface of the chlamydial outer membrane [27,48]. Outer membrane proteins of microbial pathogens serve essential roles in engaging the host environment and can be important immunotherapeutic targets [49]. Besides the above outer membrane proteins, several other members of the outer membrane complex were uniquely present in all Chlamydiaceae species. Some of these e.g., porinb(CT713) or SRP (previously annotated as CrpA) (CT442) [50], have been described as being immunogenic. PorB (CT713) is the target of neutralizing antibody responses in vitro and it lies in a cluster with another Chlamydiaceaespecific protein (CT712) of unknown function [51]. OmcA (CT444) is a well characterized chlamydial envelope-related protein (discussed in the previous section) that is found to be Chlamydiaceae-specific in our work. None of these proteins could be detected in the Protochlamydia by BLASTp searches using Chl. trachomatis proteins. Unlike the Pmps and Omps, chlamydial Inc proteins are localized to the chlamydial inclusion membrane. Different Inc proteins share minimal primary The pair-wise amino acid sequence identity was determined for various chlamydial species for six C. trachomatis proteins that are specific for Chlamydiales (A) and seven other proteins that are broadly distributed in bacteria (B). The numbers in these Tables indicate the length over which the two sequences match each other as well as the % amino acid identity between the two sequences. The abbreviations in protein names in part (B)  sequence identity with each other or with other proteins in the databases making the assignment of specific function difficult. Studies of individual Incs have shown that these peptides protrude from the surface of the inclusion and may mimic or bind to host cell proteins that selectively manicure the surface of the inclusion to avoid fusion with phagosomes or the host cell exocytic pathway [38,52,53]. Two Inc proteins, IncB (CT232) and IncC (CT233), were found to be unique to various Chlamydiaceae species. IncC is secreted by the type III secretion system and is expressed early during the initial stages of infection, and the gene for this protein lies adjacent to the incB loci in a cluster [37,54].

Chlamydia-and Chlamydophila-specific proteins
The Chlamydia and Chlamydophila constitute two of the genera within the Chlamydiaceae family [1,2]. Thus, it is likely that unique proteins will be found exclusively in the species belonging to these two groups of bacteria. The genomes of two Chlamydia (Chl. trachomatis and Chl. muridarum) and three Chlamydophila species (viz. Chlam. pneumoniae, Chlam. caviae and Chlam. abortus) have been sequenced. In our work, we have identified twenty unique proteins for each of these two genera. A brief description of these results is presented below.
Of the twenty Chlamydia-specific ORFs identified in this work (Table 3), 16 were hypothetical. Those proteins of predicted function were Incs, specifically IncD, IncE, IncF and IncG (CT115-CT118). These 4 ORFs are transcribed from a single operon in Chlamydia species and when expressed, are not uniformly distributed [56]. Instead, these peptides appear to be concentrated at discrete sites in the inclusion membrane, primarily at sites of membrane-RB contact which can be visualized by both immunofluorescence and electron microscopy [56]. Unlike IncD-G localization, chlamydial MOMP appears to be depleted at the point of contact of RBs with the inclusion membrane but is found throughout the rest of the chlamydial outer membrane [56]. The asymmetric distribution of these proteins suggests that RBs exhibit polarity. The presence of such diversity of chlamydial polypeptides in the inclusion membrane lends support to the concept that chlamydiae actively control the interactions of the inclusion with the host cell to maintain a highly specialized environment favourable to chlamydial replication, in a lineage-specific manner. The other Chlamydia-specific proteins are of unknown function and some of these are arranged in 3 additional clusters, consisting of loci CT134/CT135, CT226-CT229 and CT357/ CT358.
Of the 20 Chlamydophila-specific proteins identified in this work (Table 4), 9 are hypothetical while 11 have been minimally characterized previously. Among this group, four putative lipoproteins were identified, whose functions are not yet understood (CCA00246, CCA00575, CCA00578, CCA00738). Additionally 6 proteins have been described as putative membrane proteins in Chlam. abortus (CCA00222, CCA00261, CCA00621, CCA00360, CCA00361 and CCA00434), of which the last three are thought to be localized to the inner membrane [16]. Loci CC00360 and CC00361 are part of a cluster, which is the only cluster found among Chlamydophila-specific ORFs. This group also contained a serine-rich protein of unknown function, annotated as YwbM (CCA00062), which is absent from other chlamydiae as well as all other organisms.

Protochlamydia-specific proteins
Protochlamydia amoebophila provides the only completed genome of a Chlamydiales species that belongs to a family other than the Chlamydiaeceae. The comparative genomics studies by Horn et al. [11] determined which of the Protochlamydia proteins were not present in other sequenced chlamydiae genomes. Although the list compiled by these authors contained over 1000 ORFs, which were present in Protochlamydia but were absent in other Chlamydiales, their study did not examine which of these proteins were also present in other bacterial phyla. We have undertaken this analysis in order to determine Protochlamydia-specific proteins. A total of 445 Protochlamydial ORFs were identified, and all were of unknown function. The listing of these proteins is provided in the Additional file 1. Most of these peptides were large in size (>300aa). Fifty-five gene clusters were found. These clusters were located at loci

Cases of putative gene loss or transfer
The impact of gene acquisition and loss on bacterial genome evolution is currently a central question in microbiology which remains unresolved. It was therefore of interest to identify cases in which a gene appears to have been lost or possibly transferred from chlamydial genomes. Such cases were detected when all other species of a particular chlamydial group (e.g. Chlamydiales, Chlamydiaceae etc) contained a specific protein, with the exception of one or only a few species. In all, 33 such cases were identified among the different chlamydial groups (Table 5). Interestingly, 10 proteins which were found in the Chlamydophila and Protochlamydia were absent in the two available Chlamydia species (CCA00154, CCA00168, CCA00180, CCA00266, CCA00333, CCA00495, CCA00619, CCA00733, CCA00743, CCA00855). These genes may represent incidences of gene loss in the Chlamydia, or possible gene transfer between the Chlamydophila and Protochlamydia. Some phylogenies based on rRNA support the early separation of Chlamydia [57,58]. However, indel analysis and some protein phylogenies indicate that the Chlamydophila are ancestral to the Chlamydia [25]. Among the results, 6 proteins were found in all The branching pattern of the chlamydiae species based upon the distribution pattern of different chlamydiae specific proteins

Chlamydophila abortus
Chlamydiaceae species with the exception of Chl. trachomatis (CCA00231, CCA00396, CCA00425, CCA00538, CCA00827, CCA00910). Furthermore, 6 different proteins were found in all Chlamydiaceae species except for Chlam. pneumoniae (CT049, CT050, CT051, CT214, CT867, CT868). Also, 2 proteins were missing from Chlam. caviae which were found in all other available chlamydial species (CT244 and CT277), while 2 different proteins were absent in Chl. muridarum which were present in other Chlamydiaceae genomes (CCA00470, CCA00259). One protein which was unique to all Chlamydiales (CT550) and two proteins which were found in various Chlamydiaceae (CT444.1 and CT702) were also not found in the Chlam. abortus genome. One protein was Multiple sequence alignments of a representative Chlamydiales-specific protein CT429 (A), and a Chlamydiaceae-specific protein CT712 (B) Figure 2 Multiple sequence alignments of a representative Chlamydiales-specific protein CT429 (A), and a Chlamydiaceae-specific protein CT712 (B). Both of these proteins are of unknown function and they are uniquely found in the indicated groups of species. The sequence alignments for all available homologs for these proteins are shown. Dashes (-) in the alignments indicate identity with the amino acid on the top line (i.e. Chl. trachomatis sequence). Both these proteins contain many highly conserved regions suitable for PCR primer design that can be used for diagnostic purposes.
found which was uniquely shared by all Chlamydia species and Chlam. pneumoniae, to the exclusion of all other Chlamydophila which may represent a lateral gene transfer (CCA00075). A single protein was absent in Chlam. pneumoniae which was present in all other Chlamydophila genomes (Chlam. caviae and Chlam. abortus) (CT785). Since Chl. trachomatis and Chlam. pneumoniae were used as probes, it is possible that other genes may have been lost from these genomes which were not identified during the searches. When the functions of these proteins are elucidated, these patterns of putative gene loss will likely shed light on their importance in regards to the differences in chlamydial biology, as well as why such gene losses may have occurred. Another well-studied example of a Chlamydiales protein that is also present in various Rickettsiales as well as various plants and algal plastids, consist of the nonmitochondrial ADP-ATP translocases (CT065 and CT495). Phylogenetic studies indicate that these genes have been transferred from a Chlamydiales ancestor to the other groups where these proteins are found [59,60].

Divergence rates of Chlamydiae-specific proteins
We have also examined the sequence divergence rates for chlamydiae-specific proteins as compared to more ubiquitous proteins to determine whether the former proteins are evolving at a much faster rate. A relative measure of this was obtained by determining the degree of sequence conservation (as measured by % amino acid identity) among different chlamydial species for a number of Chl. trachomatis proteins that are either Chlamydiales-specific or which are more broadly distributed among bacteria.
Results of these analysis for the two classes of proteins (i.e. Chlamydiales-specific vs broadly distributed) are presented in Tables 7A and 7B, respectively. As seen, in both cases, Chl. trachomatis proteins exhibited highest identity (~75-90%) to the Chl. muridarum homologs. This was followed by the species from the Chlamydophila genus (41-62% identity for the chlamydiae-specific proteins vs 60-80% identity for the broadly distributed group) and the P. amoebophila. The overall pattern that was observed was very similar for the two sets of proteins, although the proteins that are more broadly distributed were found to exhibit between 5-15% higher degree of conservation (i.e. slower rate of divergence) in comparison to the Chlamydiales-specific proteins.  (Table 8) The smaller numbers of unique ORFs that are found in the Chalmydiaceae species in comparison to the P. amoebophila may be related to their smaller genome sizes and also the fact that former species grow in a more stable niche as compared to the latter species [11]. Also, 31 cases of putatively lost genes from the different chlamydial genomes were detected. The simplest and most parsimonious explanation for these shared unique signature proteins is that they were introduced only once in a common ancestor of the indicated chlamydial taxonomic groups located at different phylogenetic depths and then vertically passed on to species that descended from that particular ancestor [64,65]. In our analysis, we have also found some examples of chlamydiae-specific proteins which show sporadic species distribution (e.g. present in only a few Chlamydia and Chlamydophila species, but not in other species from these genera). Such proteins could possibly arise from two different mechanisms. First, it is possible that such genes were introduced initially at a deeper phylogenetic level (i.e. order or family) and then selectively lost in some species from the latter branching taxa. Alternatively, their distribution can be explained if the gene in question was initially introduced in one chlamydia species and then laterally transferred to some others, as is the case for the nonmitochondrial ADP-ATP translocases [59,60]. The numbers of such proteins is relatively small in comparison to the proteins that show taxa specific distribution.

Conclusion
We have recently described a number of molecular signatures consisting of conserved indels (i.e. inserts and deletions) in widely distributed proteins viz. RNA polymerase α subunit, elongation factor-Tu, elongation factor-P, DNA gyrase B and lysyl-tRNA synthetase, that are distinctive characteristics of all chlamydial species [25]. The sequence information for these proteins was also obtained from Simkania negevensis, Waddlia chondrophila, and in a number of cases for Neochlamydia hartmanellae, covering all families within the Chlamydiae phylum. The unique presence of these conserved indels in all chlamydiae species provides strong evidence that they are distinctive characteristics of the entire Chlamydiales order. In phylogenetic trees based on a combined dataset of these protein sequence as well as 16S rRNA, the traditional Chlamydiaceae species (i.e. Chlamydophila and Chlamydia) and the chlamydiae-like organisms (viz. Simkania, Waddlia and Parachlamydia) were found to form two distinct clades indicating that these two groups have diverged from each other very early in the evolution of chlamydiae. The chlamydiae-specific proteins identified in the present work provide additional powerful means for understanding and constructing a reliable phylogeny of the Chlamydiales. Based upon the shared presence of these chlamydial-specific proteins, different taxonomic clades or groups within the Chlamydiales could be identified. Figure 1 shows a tree indicating branch points marked by the different chlamydiae group-specific proteins. In addition to distinguishing P. amoebophila from other Chlamydiaceae species, these proteins also support the distinctness of the two genera (i.e. Chlamydophila and Chlamydia) within this family. The phylogenetic inferences based on these proteins are in complete agreement with the relationships observed within this phylum based on different proteins as well as 16S rRNA trees [2,8,25]. The genes encoding this set of peptides were likely acquired or evolved sometime after the divergence of the Protochlamydia and the other chlamydia-like organisms. Although sequence information for other chlamydiae (viz. Simkania and Waddlia) is lacking at present, it is likely that many of the Chlamydiales-specific proteins identified here will also be found in these species. Further, the clustering of the chlamydiaelike species in phylogenetic trees [2,25]suggests that many of the proteins which are presently indicated to be specific for P. amoebophila may also be found in the other chlamydiae-like organisms and they may provide unique molecular markers for these groups of species.
The analyses of bacterial genomes have indicated that a substantial proportion of their gene repertoire is comprised of signature proteins or ORFans i.e., open reading frames which have no known homologs and consequently no known function [64][65][66]. The mechanisms by which such novel genes arise in different genomes are presently not clearly understood and they have been discussed in other recent publications [67][68][69]. However, it is possible that homologs of some of these genes/proteins are also present in other bacteria but because of their extensive sequence divergence significant similarity to the chlamydiae homologs is not observed. One example where this may be the case is that of the OmcA (CT444) protein, which is listed as a Chlamydiaceae-specific protein in this study (Table 2). An OmcA homolog is annotated in P. amoebophila genome because it is cysteine-rich and upstream of a large cysteine-rich protein, but it does not show significant similarity to the Chlamydiaceae homologs in BLASTp searches. Another protein that we have not included in any of the Tables but is of interest is CT020 (signal peptidase I). The homologs of this protein are highly conserved in all chlamydiae (E values ranging from 0 to 1e-101) and there is a large drop in E value from chlamydiae to the first non-chlamydiae hit (E value 3e-09), which is also indicated as signal peptidase I. The sequence of this protein has again greatly diverged between chlamydiae and non-chlamydiae species.
Although the presence of these proteins could be explained by extensive sequence divergence, the possibility that they may have originated independently cannot be excluded. Our survey of the divergence rates of proteins that are either specific for the Chlamydiales vs those that are broadly distributed in different bacteria, indicate that Chlamydiales-specific proteins exhibit only slightly higher divergence rate (~10%) in comparison to the more broadly distributed proteins. This difference in our view is unlikely to be generally responsible for the failure to detect the orthologs of most of the chlamydial-specific proteins in other species.
Most of the chlamydiae-specific proteins that have been identified in the present work are of unknown function. Many of these genes are present in distinct clusters in genomes indicating that these clusters could comprise functional units and the corresponding proteins may be involved in related cellular functions [26,70]. The studies aimed at understanding the cellular functions of the Chlamydiales-specific proteins (i.e. their localization, expression and function) should prove highly informative for understanding chlamydial development, virulence and infectivity. The studies on the Chlamydiaceae-specific proteins should provide important insights regarding characteristics that distinguish them from other chlamydiae families. Additionally, a large number of protein coding sequences that are specific for either particular genera or different chlamydial species have been identified. The functional studies of such proteins could provide important information regarding factors, which are presumably responsible for their different properties such as tissue tropism and disease spectrum.
The identified chlamydiae-specific proteins also provide novel biomarkers that should prove of much value in the diagnosis of these bacteria and in exploration of their prevalence and diversity. The identification and screening of chlamydiae and chlamydiae-like organisms in the past has relied on a small number of biomarkers (e.g. 16S and 23S rRNA, MOMP, GroEL protein, lipopolysaccharide epitope Kdo, small cysteine-rich lipoprotein and 60-kDa cysteine-rich protein) [71][72][73][74]. In this context, our identification of numerous proteins, which are uniquely found in either various Chlamydiales, Chlamydiaceae, Chlamydophila or Chlamydia groups of species is of great interest and they provide novel and valuable means for detection of these groups of species. Many of these group-specific proteins are highly conserved and sequence alignment for one Chlamydiales-specific protein (CT429) and one Chlamydiaceae-specific protein (CT712) are presented in Figure 2. The sequence alignments of both these proteins contain many highly conserved regions that are suitable for design of PCR primers that should enable successful amplification of all species belonging to these groups. Monoclonal and polyclonal antibodies based upon different chlamydial-specific proteins provide another means for diagnostic studies. Thus, based upon different groupspecific proteins that have been identified in the present work, it should be possible to develop novel diagnostics that are capable of distinguishing different groups of chlamydiae species (viz. Chlamydia, Chlamydophila, Chlamydiaceae and Chlamydiales) from each other with high degree of accuracy.

Identification of Chlamydiae-specific proteins
In order to identify proteins which are chlamydiae-specific, systematic BLAST searches were carried out on different proteins in the genomes of Chl. trachomatis and Chlam. caviae against all available sequences in the databases. BLASTp searches were performed using each of the proteins or ORFs in these genomes as probes to identify all other bacteria which contain related protein sequences [75]. The searches were carried out using default search parameters, as set by the protein-protein BLAST program, which included the low complexity filter. The low complexity filter removes regions of query sequence which are highly repetitive and which could lead to spurious high similarity with unrelated proteins. The results obtained were visually inspected for homologs showing specificity to the chlamydiae with no other similar homologs present in any other bacteria. This was done by analyzing expect values (E values) for all proteins. The Expect value (E value) represents the statistical significance of a hit in a BLAST search. This number equates to the number of hits that one would find by chance when searching a database of a particular size. The E values depend upon the length of the protein as well as the extent of sequence homology that is observed with any given protein.
The E values which are very low (generally less than e-200) are indicated in the results obtained as '0'. In our extensive work, the E value for the top hit (i.e. to the query itself) is generally found to be '0', when the query protein is of a length longer than 325 aa. For proteins of smaller lengths, the E value for the top hit has a finite value. The E value of top hits may be slightly affected by the application of the filter which effectively shortens the query length [75]. Proteins were considered Chlamydiales-specific if all hits belonged to the chlamydiae species, or if the next best non-chlamydiae hit had an expect value which was not significant and could occur by chance. BLAST results with E values > 10 -4 are in range where the observed similarity could occur by chance. All such hits were examined for protein lengths to ensure that it was similar to that of the query protein before it was considered as a related protein.
Occasionally higher E-values were permitted in analysis when the length of the query protein was small as with fewer characters the E values are higher and this often produce hits with higher E-values (i.e. > 10 -4 ) that are significant for the study. It should be mentioned that BLAST searches can sometime indicate misleading relationships, particularly when no close relatives of the query species are in the database [76]. However, in the present study where most of the reported BLAST hits are for chlamydiae species, for which there are several published genome, such a possibility is highly unlikely. For all chlamydiae-specific proteins, E-values for each chlamydiae BLAST hit, as well as the first non-chlamydial hit are reported here. Furthermore, the sizes of each protein in Chl. trachomatis (or other genomes that were used as BLAST probes) are also listed, and the sizes of other chlamydial homologs are highly similar, unless it is otherwise stated. The same criteria were used for the detection of Chlamydiaceae-specific, Chlamydia-specific and Chlamydophila-specific proteins. For clarification purposes, the chlamydiae loci and accession numbers are given as reference for the different proteins. The genome of P. amoebophila was also examined in this manner to identify proteins that are specific for Protochlamydia. Those genes which were found in all species within a given group (i.e. Chlamydiales, Chlamydiaceae etc) except for one or a few species, were considered to be due to gene loss or lateral gene transfer.

Alignments of Chlamydiae-specific proteins and percentage identity determination
The global alignments of Chlamydiales-specific protein CT429 and Chlamydiaceae-specific protein CT712 were carried out by first retrieving chlamydiae sequences from the NCBI database [77] in FASTA format. ClustalX program [78] was then used to construct both alignments using the default parameters. Pair-wise alignments of sequences for determining percentage identity between Chl. trachomatis proteins and those from other chlamydial species were carried out using the ALIGN Plus 4 program package (Scientific and Educational Software) using the BLOSUM 62 scoring matrix [79] and the default parameters of the alignment program.

Analysis of temporal expression of Chlamydiae-specific proteins
Lineage-specific chlamydial proteins were compared to the microarray data of Nicholson et al. [63]. In that work, the expression patterns of Chl. trachomatis genes were divided into different stages: early (6-24 hours post infection), midlate I (12-24 hpi), midlate II (18-24 hpi) or late (24-36 hpi). Proteins identified in that work which exhibited at least 3-fold differences in expression patterns at different stages, were then placed into one of these different groups. The information for stage-specific expression of chlamydiae-specific proteins was extracted from this study.

Authors' contributions
MSV was assigned to carry out the initial BLAST searches on Chl. trachomatis and Chlam. pneumoniae genomes to identify chlamydial specific proteins. Subsequently, EG was asked to confirm these results and she also carried out BLAST searches on Protochlamydia proteins to identify proteins that are unique to this species. EG also prepared an initial draft of the manuscript. RSG was responsible for conceiving and directing this study from the beginning to the end, for the final evaluation of all results, and for preparing the final submitted manuscript. All authors have read and approved the final manuscript.