Genomic identification of POP homologs in bacteria and archaea
We first implemented direct and profile-based sensitive sequence search methods to identify POP homologs from 23 bacterial and 4 archaeal phyla (Additional file 1-S1a and Additional file 2). Hits were considered as ‘true’, if the sequence search algorithms could identify them with both β-propeller (POP_N) and α/β hydrolase (POP_C) domains, or with at least α/β hydrolase domain. At a stringent E-value of 10-10, only 1,791 POP homologs could be identified, while relaxing the E-value to 10-3 could capture 3,387 additional POP homologs (Additional files 3 and 4). In total, 3,010 POP homologs were collected using exhaustive Phmmer, Jackhmmer and profile-based approaches, including 2,919 bacterial and 91 archaeal POP homologs [29, 30].
The collected hits included annotated POPs, POP family members and nearby hydrolases of α/β hydrolase superfamily. Altogether, they are referred as ‘POP homologs’ in this report. In certain bacterial (Aquificae, Deferribacteres, Elusimicrobium, Dictyoglomi, Tenricutes) and archaeal (Nanoarchaeota and Thaumarchaeota) lineages no POP homologs could be identified. BLAST searches also failed to capture POP homologs in these phyla except in Dictyoglomi[31]. However, sequence searches against appended-PALI + database could pick at least one POP homologue in the above phyla except for Nanoarchaeota[32].
Wide distribution of POP homologs in prokaryotic lineages
We noticed that all the collected POP homologs were widely distributed across all the major lineages of bacteria and archaea with apparent loss in Nanoarchaeota. Phylum Actinobacteria was identified to be the most populated with ~1000 POP homologs (Figure 1), while in archaea, many POP homologs were captured from Euryarchaeota and Crenarchaeota. In POP family, POPs were more abundant (44%) in prokaryotic lineages than DPPs (24%) and OPBs (10%) (Figure 1c Additional file 5). We could also capture all the 638 annotated POPs from prokaryotes.
Bacterial POP homologs are both secretory and membrane proteins
Earlier studies have shown that bPOPs are associated with the signal peptides [13]. Signal peptides are sequence motifs that permit the proteins to translocate across endoplasmic reticulum in eukaryotes and to the cytoplasmic membrane in prokaryotes. Therefore, we examined all the collected POP homologs for the presence of signal peptides. Our results showed that 20% of the POP homologs were predicted to be associated with such signals, from which 225 (35%) were annotated POPs (Figure 2). Bacteroides (78%) and Acidobacteria (75%) had maximum number of POP homologs with signal peptides, while in some bacterial phyla (e.g. Fusobacteria, Spirochaetes, Thermotogae and Synergistes) signal peptides were completely absent. POP homologs from gram-positive bacterial phyla (Actinobacteria and Firmicutes) showed relatively less number of signal peptides.
Recently, membrane-bound forms of POPs isolated from synaptosomal membranes of bovine brain were also reported [33, 34]. Cytosolic and membrane forms are different with respect to sensitivity to inhibitors, relative molecular mass, affinities for the peptide substrate and the presence of a hydrophobic membrane anchor [33, 34]. Transmembrane helix prediction by TMHMM identified 236 annotated bPOPs with single transmembrane helices located at the N-terminal [35]. Transmembrane helices were absent in POPs of phyla Spirochaetes and Fusobacteria.
Diverse domain architectures reveal putative functions of POP homologs
We then investigated the coexisting domains to understand the possible biological functions of POP homologs in the prokaryotic lineages. Bacterial and archaeal POP homologs were associated with 105 and 8 different domain architectures respectively (Figure 3, Additional file 6). Both the archaeal and bacterial POP homologs share similar domain architectures suggesting similar function of POP homologs in these two kingdoms. Domain architectures of POP homologs were also mapped on species tree of bacteria and archaea. As shown in Figure 4, POPs were associated with diverse domain combinations in Proteobacteria, while in mycobacterial species POPs were replaced by other hydrolases. Within a phylum, anomalous distribution of POPs was observed. Mapping of domain architecture on archaeal species tree depicted presence of only C-terminal POP domain in most of the organisms, while full-length POP domains were observed in a few species of Crenarchaeota (Figure 5).
POP homologs were frequently associated with protein-protein interaction domains e.g. PDZ and tetratricopeptide (TPR) repeats. Two of the ‘C-terminal processing peptidases’ (S41) had PDZ domains, which are associated with signaling proteins of bacteria, plants and higher order organisms. PDZ domains are involved in assembly of large protein complexes, thereby coordinating and guiding the flow of regulatory information [36, 37]. PDZ domains present in peptidase S41 of Candidatus Solibacter usitatus (YP_821861) and Roseiflexus (YP_001276641) were associated with WD40 and DPP domains. TPR repeat motifs facilitate interaction with other proteins. These motifs were also related with hydrolase domain in Candidatus Solibacter usitatus (YP_824720). TPR-proteins are also associated with multi-protein complexes, and are involved in functioning of chaperones, cell-cycle, transcription and protein transport complexes [38, 39].
POP homologs were also associated with signaling modules such as WD-repeats. Proteins with WD-repeat exhibit high degree of functional diversity [40–42]. Some archaeal POPs were also predicted to be associated with WD-repeats suggesting conserved function of POPs in the two domains of life. Besides WD repeats, POP homologs were also related with Sel1 repeats, which are subfamily of TPR sequences. In prokaryotes, these repeats allow proteins to be membrane attached and mediate interaction between bacterial and eukaryotic host cells [43, 44]. One of the POP proteins from Ferrimonas balearica (YP_003914375) was predicted to be associated with Sel1 repeats.
Bacterial POP homologs were also found to co-exist with several DNA-binding modules of transcription regulatory domains. Numerous bacterial transcription regulatory proteins bind DNA via a helix-turn-helix motif [45]. These are sequentially diverse transcriptional activators and most of them are known to negatively regulate their own expression. Transcription regulatory domain is associated with response regulator receiver domain and plays an important role in DNA-binding and regulation of transcription [46, 47]. POP homologs that co-existed with bacterial regulatory domains include Candidatus Solibacter usitatus (YP_827731) and Caulobacter segnis (YP_003594106), and those with transcription regulatory domain include four homologs (two each from Actinobacteria and Gammaproteobacteria (YP_888147 (Mycobacterium smegmatis), YP_001759306 (Shewanella woodyi), YP_735011 (Shewanella sp. MR-4) and YP_954217 (Mycobacterium vanbaalenii)). Targeted deletions of the predicted accessory domains will be beneficial to understand the related biological functions.
Different cellular localization of annotated bPOPs
We have also examined the cellular localization of annotated bPOPs to infer the possible functions of POP in more detail. Prediction of cellular localization using PSORT-b also revealed cytoplasmic nature of the annotated POPs (176 versus 115 POPs which were predicted to be periplasmic) (Additional files 4 and 7) [48]. Interestingly, we predicted some of these POPs to be localized in cell wall, cytoplasm and outer membranes of bacteria and archaea. Different bacterial phyla depicted differences in preferred cellular localization of POPs. For example, in phylum Proteobacteria, most of the POPs were periplasmic in nature. Clustering analysis of the predicted cytoplasmic and periplasmic POPs resulted in a clear separation of cytoplasmic and periplasmic POPs with a few exceptions.
Phylogenetic analysis of annotated bPOPs shows high co-clustering
To investigate the differences in the annotated bPOPs, we next performed phylogenetic clustering of 638 annotated POPs that showed nine distinct clusters, with co-clustering among members of different phyla (Figure 6). This co-clustering trend and absence of phylum-specific clusters suggested high conservation of POPs within bacterial lineages. Genus Shewanella of marine metal-reducing bacteria was highly populated with considerable number of annotated bPOPs in all the nine clusters. Similarly, archaeal POPs were also co-clustered well with other bPOPs. This co-clustering suggested the possibility of lateral transfer of POP genes among bacteria and between archaeal and bacterial species (Additional files 4 and 8).
Unique sequence signature motifs depict diverse sequence properties
To further analyse the co-clustering trend of annotated bPOPs, we identified conserved class specific sequence motifs. An alignment stretch was considered as a ‘conserved motif’, if 95% of the sequences had conserved amino acids at least at three consecutive positions. From these highly conserved sequence motifs, we next identified class specific motifs. A ‘class-specific motif’ was defined as a sequence motif in a cluster, which was completely absent from all the other clusters. In the first and seventh clusters (Figure 6), no class-specific motifs were observed. Figure 7 shows a part of the alignment of fifth cluster of bPOPs representing class specific motifs. Detailed analysis of motifs of all the clusters was carried out to understand their relative position on the structure of bPOPs. Class-specific motifs of second, sixth, eighth and ninth cluster were localized in the hydrolase domain, while motifs of cluster third, fourth and fifth were distributed on both the domains (see Additional files 9 and 10 for details).
Classification of annotated bPOPs into eight subtypes
Detailed analysis of class specific sequence motifs indicated high sequence variations in annotated bPOPs. Therefore, on the basis of identified class specific motifs, we propose a classification of bPOPs into eight different subtypes as shown in Figure 8. Some of these class-specific motifs were surface exposed, depicting their possible involvement in protein-protein interactions with other interacting partners (for details see Additional file 4), while some other motifs were located in the core of protein, near functionally important residues, which could possibly cause differences in interaction with the versatile substrates of POPs.
Subtypes of bPOPs differ in the conservation of functionally important residues
We then investigated the conservation of functionally important residues in different subtypes of bPOPs. Detailed analysis revealed high conservation of catalytic triad residues in all the subtypes. However, high number of non-permissible amino acid replacements was observed to be concentrated at two sites―Ser-571 and Thr-573 (numbering according to the bPOP crystal structure, PDB id: 2BKL), which are located at the interface of two domains. These sites were replaced by non-polar and positively charged amino acids in most of the subtypes of bPOPs (Additional file 11). These two residues were also situated in vicinity of Arg-572 and Ile-575 that were reported to be crucial for the incoming peptide substrate in bPOPs. W-575, which is important for the substrate binding was conserved in some of the bPOPs, while in a few other bPOPs it was substituted by other amino acids. Altogether, the hydrophobic environment required for the substrate binding was not conserved in all the bPOPs. These findings strengthen our hypothesis that the proposed bPOP subtypes can also be different with respect to the possible substrate. Mutation experiments of these functionally important residues can provide further insights about their role in the catalytic activity and substrate specificity.
Divergence of POP family members
Besides analysing the co-clustering pattern of annotated bPOPs, we have also examined the divergence of POP family members. From all the 3,010 collected POP homologs, we could obtain 1,421 POP family members including 638 annotated POPs, 156 OPBs, 293 ACCs and 334 DPPs. These members were used to construct a joint phylogenetic tree, where a set of bacterial carboxylesterases (20 sequences) were considered as an outgroup. We observed that OPBs and DPPs were distinctly clustered, while the ACCs and POPs were dispersed all over the tree (Additional files 12 and 13). bPOP family tree was contradictory to the tree earlier reported by Venäläinen et al., where distinct clusters of POP family members from all the domains of life were observed [13].
The phylogenetic analysis also suggested high divergence of other POP family members. Some of the POPs have diverged from the rest of the POP family members before OPBs, followed by the divergence of ACCs and DPPs. ACCs were diverged along with some of the other POPs, since distinct cluster of ACCs could not be obtained. This clustering pattern was confirmed by generating additional phylogenetic tree, where only DPPs, OPBs and ACCs were considered to understand their phyletic distribution (Additional file 4). If POPs were excluded from the phylogenetic tree, other members of POP family formed distinct clusters, which revealed that POPs were responsible for the observed co-clustering among the POP family members.
Anomalous distribution of annotated bPOPs revealed many multi-POP bacterial genomes
While performing the sequence analysis, we noticed high variations in the number of annotated POP genes in bacterial genomes, ranging from no POPs to multiple copies of POPs within a genome. Overall, out of 269 identified bacterial genomes with annotated POPs, 148 had a single copy of POP gene. The overrepresentation of POP was particularly observed in genus Shewanella of Gammaproteobacteria, where most of the species had multiple copies of POP gene. One of the interesting examples of multi-POP proteome was Shewanella woodyi with 16 POPs sharing an average sequence identity of 15% (ranging from 8 to 35%). Moreover, we could identify 12 copies of POP gene in Shewanella piezotolerans, and 10 copies each in Shewanella pealeana and Shewanella sediminis. Besides genus Shewanella, 15 POP genes were also identified in Solibacter usitatus. High sequence variations in paralogs of POP suggested that they are not closely related to each other, except in S. thermophilus genome (Figure 9, Additional file 4). These multiple POPs within a genome also differ in their cellular localizations (Additional file 1-S1a).
Horizontal gene transfer as a driving force for the expansion of POP gene family in bacteria
Examination of the complete genomes of bacterial and archaeal lineages showed considerable variations in the number of annotated POP genes within a genome. Horizontal gene transfer (HGT) and gene duplication are the two driving forces, which may lead to expansion of gene families in prokaryotic systems [49]. We have studied the expansion of POP gene family in more detail using POP rich genus Shewanella. Members of genus Shewanella have been described from diverse habitats, including deep cold-water marine environments to shallow Antarctic ocean habitats, to hydrothermal vents and freshwater lakes [50]. We examined sequence similarity and chromosomal positioning to determine if HGT is prevalent in these genomes. Chromosomal mapping of 16 annotated POP genes of S. woodyi depicted non-co-localization, representing possible HGT events during evolution (Additional file 14). Only two genes (6118839 and 6118846 bearing a low sequence identity of 20%) were found to be slightly closer on the genome, still separated by six other genes. Similar patterns were also observed in POP genes of other species of Shewanella. This suggests possibility of multiple HGT events during the evolution of these bacteria (Additional file 1-S1b and 4).
Annotation of uncharacterized POP homologs of bacterial lineages
During sensitive sequence searches, we could identify many hypothetical proteins with POP-like signatures. We have implemented various approaches such as protein domain identification, secondary structure prediction, protein fold prediction and GO annotation mapping to characterize 38 hypothetical sequences as POPs and 159 proteins as α/β hydrolases [29, 51–53]. A hypothetical sequence was annotated as POP if an annotated POP query picked the sequence at least at an E-value of 10-3 and it had similar domain architecture (with both α/β hydrolase and β-propeller domains). During this analysis some partial POPs comprising of only catalytic domain were also identified (Additional file 1, S1c). RPS-BLAST (Reversed Position Specific BLAST) using four different profiles (annotated POPs, ACCs, DPPs and other hydrolases of α/β hydrolase superfamily) was carried out to further scan each unannotated sequence, thereby confirming that these sequences are α/β hydrolase superfamily members [31].
Limitations of the computational methods used in this study
Although we have used multiple methods for the detailed analysis of POP homologs from the bacterial and archaeal lineages, yet the current study has certain limitations. Instead of relying on any one-sequence search method, here, we have employed multiple sequence search algorithms to detect all possible homologs. We validated all the obtained hits by mapping functional domains and active site residues, yet the possibility of obtaining false positives cannot be ignored.
The complete absence of POP genes from some of the bacterial phyla could be because of the caveats of the sequence search algorithms. It is possible that POPs of such phyla are so diverged that most of the methods failed to identify them, including current remote homology detection methods. Therefore, further experimental characterization of these genomes is essential to conclude the presence or absence of POPs. The available computational methods to predict protein domains, cellular localizations and signal peptides of protein sequences are also associated with wrong predictions.
During this study, we have also encountered many incomplete POP sequences with either missing N-terminal hydrolase domain (e.g. YP_001519174.1, Acaryochloris marina) or with incomplete propeller domain (e.g. YP_003320038.1, Sphaerobacter thermophilus) or with only hydrolase domain (e.g. Mycoplasma genomes). These partial POPs could be due to errors in the available gene prediction algorithms. Additionally, wrong or incomplete annotation of collected protein sequences could also lead to another source of error. Experimental validation of these reported sequences would help in improving the current annotations of the corresponding genomes.