Genome-wide protein localization prediction strategies for gram negative bacteria
© Romine; licensee BioMed Central Ltd. 2011
Published: 15 June 2011
Skip to main content
© Romine; licensee BioMed Central Ltd. 2011
Published: 15 June 2011
Genome-wide prediction of protein subcellular localization is an important type of evidence used for inferring protein function. While a variety of computational tools have been developed for this purpose, errors in the gene models and use of protein sorting signals that are not recognized by the more commonly accepted tools can diminish the accuracy of their output.
As part of an effort to manually curate the annotations of 19 strains of Shewanella, numerous insights were gained regarding the use of computational tools and proteomics data to predict protein localization. Identification of the suite of secretion systems present in each strain at the start of the process made it possible to tailor-fit the subsequent localization prediction strategies to each strain for improved accuracy. Comparisons of the computational predictions among orthologous proteins revealed inconsistencies in the computational outputs, which could often be resolved by adjusting the gene models or ortholog group memberships. While proteomic data was useful for verifying start site predictions and post-translational proteolytic cleavage, care was needed to distinguish cellular versus sample processing-mediated cleavage events. Searches for lipoprotein signal peptides revealed that neither TatP nor LipoP are designed for identification of lipoprotein substrates of the twin arginine translocation system and that the +2 rule for lipoprotein sorting does not apply to this Genus. Analysis of the relationships between domain occurrence and protein localization prediction enabled identification of numerous location-informative domains which could then be used to refine or increase confidence in location predictions. This collective knowledge was used to develop a general strategy for predicting protein localization that could be adapted to other organisms.
Improved localization prediction accuracy is not simply a matter of developing better computational algorithms. It also entails gathering key knowledge regarding the host architecture and translocation machinery and associated substrate recognition via experimentation and integration of diverse computational analyses from many proteins and, where possible, that are derived from different species within the same genus.
Knowledge of the subcellular localization of proteins can provide important insights into protein function and thus is particularly useful in the annotation of genomes and the identification of candidate proteins having functions of interest. For example, microbial proteins that are secreted outside the cell are expected to perform functions associated with cell-cell communication and competition, hydrolysis of membrane impermeable polymers, or creating extracellular structures that enable cell motility, attachment to surfaces, or passage of materials between cells. The discovery of novel surface-localized proteins is useful for the development of drug targets, identification of microbial biomarkers and factors contributing to host invasion, and discovery of more efficient enzymes for use in bioprocesses associated with the breakdown of membrane-impermeable polymers, such as those released during the processing of plant materials for alternative fuel production. In some instances, unexpected localization of proteins belonging to a well studied functional class can lead to exciting new discoveries of cellular function. For example, the discovery that c-type cytochromes associated with Mn(IV) and Fe(III) reduction were localized to the cell surface of Shewanella oneidensis MR-1  rather than the inner membrane or periplasm where respiratory proteins are typically found, initiated a whole new field of research in extracellular respiratory metabolism.
A wide variety of computational tools have been developed as a rapid, inexpensive means to predict protein localization using only amino acid sequence information. New tools continue to be developed with improved accuracy or specificity making it difficult to decide which one(s) to use for genome-wide prediction of protein locations. The primary improvements to predictive accuracy center on the identification of the substrates of the Sec inner membrane export system, which is responsible for translocation of the majority of extra-cytoplasmic proteins across the cytoplasmic membrane in bacteria  and the Tat inner membrane export system which translocates a smaller number of proteins in a pre-folded state . However, bacteria with dual membranes also encode additional machinery for export of proteins from the cytoplasm, for inserting them in the outer membrane, or secreting them beyond the outer membrane. The protein substrates of these systems carry N- or C-terminal signal peptides that are distinct from those recognized by Sec and Tat, or lack them all together, thus requiring the application of alternative computational tools or approaches to identify them. Consequently, prediction of protein localization at the genome scale requires combining multiple tools/methods to account for substrates of both the common export systems, such as Sec, and the less frequently used export or secretion systems.
Prior to applying available bioinformatics tools to predict protein localization, it is important to first establish what types of subcellular compartments are present in the organism of interest. The information can then be used to develop a strain-specific strategy for predicting protein localization at the genome scale. Electron microscopy and the genome annotation are useful resources for determining the compartmental organization of the host, but are limited to detection of structures present under the conditions used to generate the sample. However, when supplemented by information garnered from genome annotations this limitation can be overcome. In the sequenced shewanellae, manual curation of the genome annotation suggested that 1) most of the strains harbor at least one bacteriophage within their genomes, some of which have been observed as distinct entities in stressed cultures cells [4, 5] and 2) under selected growth conditions S. benthica and S. putrefaciens strains CN-32 and W3-18-1  will produce cytoplasmic microcompartments that house specific enzymes and associated reactions that benefit from the resulting secluded environment . These observations and sequence-based predictions should be taken into account when predicting protein localization. Bacteriophages encode viral structural proteins that are not components of the cell and, therefore, not appropriate targets for predicting subcellular localization. The genes that encode these structural proteins are frequently co-localized in operons  and can often be identified through blast analysis against domains/proteins stored in the Aclame database . Proteins likely to be encapsulated in microcompartments, on the other hand, can be identified by searching for proteins that exist only in organisms encoding microcompartment structural proteins (identified by hits to pfam00936) and frequently are encoded in the same neighborhood with them.
Core components and associated domains of gram negative export and sorting systems
Translocation of unfolded proteins across the inner membrane
Signal peptides cleaved by LspA tend to be shorter than those cleaved by LepB
Twin arginine (Tat)
Translocation of folded proteins across the inner membrane
respiratory proteins that require cytoplasmic enzymes to covalently attachment metal cofactors (e.g. have iron sulfur, copper, molybdopterin) are expected substrates
Translocation of phage endolysin across the inner membrane
Numerous, but Genus-specific
Encoded near endolysin in double-strand phage
Insertion of lipoproteins in the outer membrane
beta and gamma Proteobacteria also have LolA having TIGR00547 and pfam03548 domains
Insertion of beta barrel proteins in the outer membrane
With the exception of proteins having large periplasmic domains, expect a genus-specific C-terminal sorting motif
In addition to these export and sorting systems, gram negative bacteria may also encode protein secretion systems, named T1SS-T8SS, that translocate proteins to sites beyond the outer membrane . Secretion systems are often poorly annotated by automated pipelines due to the fact that certain components of different classes of secretion systems (e.g. T2SS and T4SS components) have significant sequence similarity to one another while others, that belong to the same class and that are functionally equivalent, have little similarity to one another (e.g. pilin proteins). In addition, many secretion systems have not yet been characterized and/or informative domains that detect their signature components have yet to been defined and deposited in public databases. Fortunately, the genes encoding the key components of these systems are typically co-localized on the genome and thus one can often use genome context analysis to readily identify their constituents and assign them to appropriate secretion classes.
Domains that Identify Secretins and Ushers in Shewanella
Short Model Descriptor
Predicted Localization in Shewanella
Outer membrane efflux protein
type I secretion outer membrane protein, TolC family
pilus (MSHA type) biogenesis protein MshL
Secretin N-terminal domain
type IV pilus secretin (or competence protein) PilQ
Bacterial type II and III secretion system protein
GspD, PilQ, MshL, YscC, RcpA, SspD
Bacterial type II/III secretion system short domain
GspD, PilQ, YscC
Flagellar L-ring protein
Conjugal transfer protein
YadA-like C-terminal region
OM | extra
OM | extra
Secretin and TonB N terminus short domain
type III secretion outer membrane pore, YscC/HrcC family
type-F conjugative transfer system secretin TraK
type VI secretion lipoprotein, VC_A0113 family
Fimbrial Usher protein
Curli production assembly/transport component CsgG
While suitable for detecting many of the secretion systems, the domains listed in Table 2 were not able to detect all of the predicted outer membrane protein translocases in the sequenced shewanellae, requiring that other approaches are taken to identify them. For example, protein localization predictions (described below) and comparative genome context analysis can be used to identify commonly occurring genomic loci that encode putative extracellular proteins along with putative outer membrane or lipoproteins. Other types of functional evidence (e.g. domain content, sequence similarity, and literature searches for experimental data on similar proteins) can then be gathered and reviewed for further clues that are indicative of protein secretion machinery. This approach led to the discovery of a conserved five gene locus in two Shewanella that includes proteins (previously annotated as hypothetical) with similarity to the recently identified components of the Fap amyloid fiber .
T5SS systems, in which the secretin and extracellular function are encoded in the same protein were particularly difficult to confidently identify since the channel forming domain of these systems are highly variable in sequence and currently only detectable by two domains, PF03797 and PF03895 . A review of the literature revealed a new T5SS subclass (T5dSS) that is present in all of the sequenced Shewanella and lacks these domains , instead having C-terminal domains (PF07244 and PF01103) that are characteristic of the BamA component of the outer membrane protein assembly complex and an N-terminal patatin domain, which is frequently found in extracellular proteins. The orthologous Shewanella proteins were all predicted to have a Sec signal peptide by SignalP and to reside in the outer membrane by Bomp, but predicted to localize extracellularly by PsortB and Subloc or to a mixture of outer membrane and extracellular environment by Cello and SosuiGramN. Phobius also detected a signal peptide, but suggested that a single transmembrane span remains at the C-terminus. This region matches the TIGR03501 gamma proteobacterial enzyme C-terminal transmembrane domain, an extracellular location informative domain that is predicted to be proteolytically removed prior to protein secretion (Dan Haft, personal communication). These observations suggest that additional novel T5SS can potentially be identified by searching for proteins with similar mixed evidence of location. Another feature to look out for is the occurrence of exceptionally long Sec leaders known to occur in some T5SS proteins [19, 20]. Since its length may preclude its detection by computational tools designed to detect signal peptides (see below), manual inspection of candidate dual domain T5SS translocases for Sec leaders may be necessary.
Characteristics of signal peptidases and target signal peptides
Model Signal Peptidase
Merops Family & domains
Signal peptide domain
PilD PulO GspO
pfam02501 pfam03934 pfam08334 pfam12019
Signal peptides similar, except ones in IVb pili are longer (~25 aa) than others (~7 aa), GspK, PilX, PilW are not detected by pfam07963
type IVa pili
type IVb pili
class Ia & IIa-b bacteriocin, microcins
pfam01721 pfam10439 pfam10439
The signal peptidase activity is encoded in the permease component of the T1SS system that exports the bacteriocin
T4bSS - IncF
mature TraA is about ~68 aa in length with two TM spans that circularizes
T4bSS - IncH
Substrates have Sec signal peptide that is cleaved by LepB
T4bSS - IncJ
T4bSS - IncP
Because the detection of signal peptides is an important step in localization prediction, errors in prediction of the 5’ end of a gene can displace or truncate N-terminal signal peptides and thus impact the accuracy of localization predictions. Significant improvements have been made in the ORF calling algorithms since the advent of whole genome sequencing and, therefore, the gene models for genomes produced with the earlier generation ORF calling algorithms can be readily improved by comparing the output of the newer algorithms to those used in the original Genbank deposit, or simply using the newer gene model predictions. The output of several of these newer algorithms (Glimmer v. 3, Prodigal v. 2, GeneMarkHMM-2.6r, and GeneMark-2.5m) are pre-computed and available to the research community via FTP from NCBI Refseq (ftp://ftp.ncbi.nih.gov/genomes/Bacteria).
Another means to improve the gene model is to map the termini of transposons, insertion sequences, and other mobile elements in the genome as we reported previously for S. oneidensis MR-1 . This task is not routinely part of the automated genome annotation process and results can reveal that seemingly intact genes are truncated at their 5’ end or interrupted and hence localization predictions can be erroneous. Identification of mobile elements is facilitated by the use of resources like ISfinder  and ACLAME  that provide information regarding the sites targeted by and characteristics of the termini of insertion elements and prophage, respectively. Programmed recoding of genes, whereby genes are translated by non-standard rules (e.g., programmed ribosomal frameshifting, translational bypassing, and utilization of alternative tRNAs to decode stop codons as an amino acid) can also be missed during automated annotations, sometimes even resulting in their erroneous annotation as pseudogenes. The Recode database (http://recode.ucc.ie/)  has compiled numerous examples of recoded genes and thus provides a useful resource for identifying genes likely to be subject to recoding.
Comparative analysis of the protein size, domain content, and localization predictions among orthologous proteins can also prove useful for identifying errors in gene models. Inconsistencies in these values among orthologous Shewanella proteins could often be eliminated by adjusting gene start/stop positions or membership within a predicted orthologous group. In some cases, inconsistencies suggested that one or more members of the group possessed longer signal peptides than detectable by programs such as SignalP or LipoP or that a proposed signal peptide was more likely an uncleaved N-terminal transmembrane domain. As mentioned earlier, unusually long leaders would be expected in some T5SS autotransporters and the secreted component of T5SS two partner secretion systems since some members of this class have signal peptides that are preceded by an additional charged (n-region) and hydrophobic domain (h-region) .
Proteomic data can prove especially useful for improving the gene model, but there are several caveats to their use in validation of genes models that one should be aware of. Trypsin, which specifically cleaves proteins C-terminal to arginine (R) or lysine (K) residues, is the most common enzyme used to digest proteins into fragments of suitable number, size, and charge for subsequent sequence identification by gel-free mass spectrometric-based methods for global characterization of proteins. The C-terminus of each peptide generated is expected to be an R or K and the N-terminus should map adjacent to an R or K in the parent protein. In theory, the only peptides with ends that do not match these criteria, should result from host-mediated proteolytical processing (e.g. by LepB) of the parent protein prior to its tryptic digestion and thus detection of partially tryptic peptides should be indicative of host-mediated post-translational processing of proteins or incorrect assignment of a start codon. However, in practice, partially tryptic peptides can also result from the harsh conditions associated with sample processing, sample fragmentation during ionization, or erroneous peptide identification [43–45]. Therefore, when using proteome data for identifying the N-terminus of mature proteins it is prudent to consider only partially tryptic peptides that, among all peptides detected, are the ones mapping most closely to the N-terminus of the parent protein. Furthermore, the N-termini of these peptides should map to a site that is consistent with predicted protease cleavage sites. In Shewanella, the most frequently encountered proteolytic processing event detected in Shewanella was due to cleavage by AmpP or Map (both present in all the Shewanella genomes), which remove the N-terminal methionine when it is adjacent to Pro or a small amino acid (Ala, Ser, Gly, Cys, Thr, Pro, or Val), respectively [46, 47]. In most cases where a partially tryptic peptide did not map to position 2 of the parent protein (AmpP or Map processed) the detected partially tryptic peptides mapped to signal cleavage sites predicted by SignalP or TatP. A notable exception was the long signal peptide (68 amino acids) found in the small subunit of the NiFe hydrogenase, an expected TAT substrate whose cleavage was not recognized by TATP (except in 1 out of 17 strains having this protein) but for which validating partially tryptic peptides were detected in 4 different strains of Shewanella (see Additional file 1) (M. Romine, unpublished results).
Global analyses of cellular proteomes by mass spectrometry uses the protein sequences deduced from the genomic sequence for peptide matching and thus peptides that map outside of the defined gene termini go undetected. Therefore, searches of MS-MS spectra against protein sequences derived from translations between all stop codons (stop-to-stop databases) or between each stop codon and the furthest upstream start codon (start-to-stop databases) have also been used to increase the number of identifiable peptides in hopes of validating earlier start sites or missed open reading frames . However, non-standard start codons, such as GTG and TTG, are frequently used in bacteria and archaea, but would not be translated as methionine in stop-to-stop in-silico translations. Therefore, N-terminal peptides produced from proteins whose translation is initiated at alternative start codons would still go undetected and consequently the returns from such an effort are diminished. Furthermore, since these databases are significantly larger, the chance of erroneous peptide matching is significantly increased and thus warrants manually evaluating each peptide mapping outside pre-defined open reading frame, especially when the peptide is infrequently detected in samples analyzed.
Computational Tools used in Studies of Shewanella
primarily prediction of Sec signal peptides that are cleaved by LspA but also provides prediction of inner membrane or cytoplasmic localization as well as LepB cleavage
does not detect Tat substrates
prediction of Sec signal peptides that are cleaved by LspA
does not detect Tat substrates
prediction of Sec signal peptides that are cleaved by LepB
does not detect Tat substrates
prediction of alpha helices in inner membrane proteins, distinguishing N-terminal TM from signal peptides
prediction of alpha helices in inner membrane proteins
Signal peptides are often erroneously counted as TM spans
prediction of beta barrel spans in outer membrane proteins
prediction of localization (Cyt, IM, Peri, OM, Extra)
does not predict lipoprotein location in OM or IM
prediction of localization (Cyt, IM, Peri, OM, Extra) in gram negatives only
does not predict lipoprotein location in OM or IM, no scores given
prediction of localization (Cyt, Peri, Extra)
not appropriate for membrane bound proteins
prediction of localization (Cyt, IM, Peri, OM, Extra)
does not predict lipoprotein location in OM or IM, many proteins assigned
prediction of Tat and Sec signal peptides
does not detect lipoproteins that have Tat signal peptide; some very long signal peptides not detected
obtained from Dr. Pohlschroder 1
Prediction of Tat signal peptides
does not require the presence of an adjacent LepB or LspA site or that it occurs at the protein N-terminus (though this can be advantageous when the start codon prediction is wrong)
Tool Performance Across 19 Proteins in Each of the 1990 Core Ortholog Groups
Groups with no match
Disagree with Curation
Groups with match
Disagree with Curation
Groups with mixed predictions
Curated as having Match
Sig Pep cleaved by LspA
Sig Pep cleaved by LepB
Sig Pep cleaved by LepB
Sig Pep recognized by TAT
Inner membrane protein
Inner membrane protein
Outer membrane protein
Performance of Localization Predictors Across 19 Proteins in Each of the 1990 Core Ortholog Groups
The prediction schema is initiated with the curation of secretion systems, whose components often have distinct signal peptides that are not recognized by predictors listed or that are secreted during assembly of the machinery. In addition, the structural components of bacteriophage are identified at this stage as they would otherwise often be erroneously predicted to localize to the cell envelope. Next automated searches for signal peptides are conducted, working first on the less common signal peptides associated with lipoproteins and Tat substrates and then followed by searching for transmembrane spans and Sec signal peptides. A comparison of the latter two results assisted in distinguishing signal peptides from transmembrane spans, but the availability of additional information (e.g., expected location of a protein based on annotation, detection of peptides that map at or near the N-terminus) was generally needed for deciding whether the N-terminus was removed versus being retained for anchoring a protein in the membrane. Domain content and functional annotations were used through-out this decision tree to increase the confidence and accuracy of the predictions. Location informative domains were identified by searching for Pfam and TIGRfam domains that consistently occurred only in proteins predicted to localize to the same site and/or had a known association with proteins found in specific subcellular or extracellular compartments. In addition, results of searches for a C-terminal outer membrane localization signature were used to enhance outer membrane location predictions, recognizing that those having large periplasmic domains (e.g. TolC family proteins) are expected to lack these signatures or contain them at internal sites instead. This species-specific C-terminal signature consists of alternating hydrophobic residues at positions 5, 7, and 9 from the C-terminus and a Phe or Tyr at the terminus [51, 52]. Since shewanellae have numerous TonB receptors (620 in 19 genomes) we used their C-termini to develop a Shewanella-specific signature that could be used to search for additional substrates of this system.
Characterization of the protein content of subcellular fractions by mass spectrometry is also a useful type of evidence for assessing protein localization. This information is particularly useful for identifying proteins that are tethered to the membrane via protein-protein or protein-lipid interactions or for condition-specific changes in protein localization which cannot be revealed by analyses of protein sequence content alone. However, results must be interpreted with caution as there can be significant cross-contamination between subcellular fractions which may vary depending on the protocol used to fractionate and analyze the protein content or the cell type being studied. In reviewing the data from LC/MS-MS analysis of S. oneidensis MR-1 subcellular fractions prepared with a sarkosyl-based method, we found that fractions with the greatest abundance of peptides partitioned were usually consistent with the predicted locations of the parent protein with the notable exception that many more lipoproteins partitioned to the inner membrane than expected . Sarkosyl was chosen over other detergents because of it compatibility with high through-put MS-based proteomic analysis and reduced time and labor required to conduct the cellular fractionation. While this detergent has been shown to preferentially solubilize inner membrane proteins  thus allowing efficient separation of inner and outer membranes, it is possible that it also solubilizes the loosely associated outer membrane lipoproteins.
Alternatively, the predicted localization of these proteins is incorrect. The rules for predicting lipoprotein sorting are based on extensive research on Escherichia coli lipoproteins and suggest that lipoproteins with an aspartic acid (D) at position +2 (D+2) of the mature protein are retained in the inner membrane while the remainder are attached to the outer membrane by Lol . However, numerous exceptions have been found in other organisms [56–58] suggesting that these rules likely only apply to enterobacteria. Indeed, our analysis of over 3000 predicted lipoproteins in this Genus revealed a lack of consistency in occurrence of D+2 in orthologs and that only 5 out of 112 efflux pump membrane fusion lipoproteins, which are expected to be anchored to the inner membrane, have D+2. Furthermore, like selected other bacteria [59–62], Shewanella can also localize lipoproteins to the outer face of the outer membrane and thus must use alternative sorting signals. While it is known that the T2aSS machinery is responsible for their surface translocation in Shewanella[32, 63, 64], the characteristics of the sorting signals used are currently unknown. The large number of putative lipoproteins identified in this genus and combined knowledge available regarding their localization (experimentally validated as well as predicted based on function or domain content), however, provided a more sensitive means to search for conserved sequences that are characteristic of surface lipoproteins. In Shewanella such analyses suggest that enrichment in glycine and serine residues coincides with predicted surface localization (Romine, unpublished results). These same amino acids have recently been reported to be enriched in extracellular proteins  and are commonly found in other sorting signals used for secretion of proteins [66, 67].
While the methodological process described here was derived from studies of a Genus that shares many structural and functional features with organisms from which much of our current understanding of translocation models have been developed, the overall strategy described for predicting protein localization should prove useful for studying other microbes as well. Knowledge gathered regarding distinctive architectural features or unusual translocation machinery content (e.g. missing components, duplications) prior to applying automated sequence analysis methods can significantly impact the choice of computational tools to use and subsequent interpretation of the results. Proteomic analyses can be especially useful for confirming predictions or discovering novel sorting signals, while less costly computational localization predictions, conducted at the genome scale, can reveal novel characteristics of an organism that might not be readily derived from functional annotations derived solely from sequence similarity.
Subcellular localization and ortholog grouping predictions (Additional file 2) and associated protein sequences (Additional file 3) that were used to for making calculations provided in tables 5 and 6 are provided in the supplementary material so that interested parties can use them for evaluating their own prediction strategies to those used by the author. However, it should be noted that updates to the gene models and ortholog membership is an ongoing process, with the most current versions available at http://shewanella-knowledgebase.org:8080/Shewanella/. Updated localization predictions are available through the author.
This research was supported by the U.S. Department of Energy (DOE), Office of Biological and Environmental Research (BER), as part of BER’s Genomic Science Program (GSP). This contribution originates from the GSP Foundational Scientific Focus Area (FSFA) at the Pacific Northwest National Laboratory (PNNL). The Pacific Northwest National Laboratory is operated for the DOE by Battelle Memorial Institute under Contract DE-AC05-76RLO 1830. I would like to thank Tatiana Karpinets, Guru Kora, Denise Schmoyer, and Michael Lueze for developing the ortholog and genome editors that I use for curating gene models and ortholog groups and also Mustafa Syed who conducted some of the automated localization predictions. In addition, I would like to thank Margrethe Serres for conducting domain analyses and assisting in curating the functional predictions.
This article has been published as part of BMC Genomics Volume 12 Supplement 1, 2011: Validation methods for functional genome annotation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.