Skip to main content

Comparison of theoretical proteomes: Identification of COGs with conserved and variable pI within the multimodal pI distribution



Theoretical proteome analysis, generated by plotting theoretical isoelectric points (pI) against molecular masses of all proteins encoded by the genome show a multimodal distribution for pI. This multimodal distribution is an effect of allowed combinations of the charged amino acids, and not due to evolutionary causes. The variation in this distribution can be correlated to the organisms ecological niche. Contributions to this variation maybe mapped to individual proteins by studying the variation in pI of orthologs across microorganism genomes.


The distribution of ortholog pI values showed trimodal distributions for all prokaryotic genomes analyzed, similar to whole proteome plots. Pairwise analysis of pI variation show that a few COGs are conserved within, but most vary between, the acidic and basic regions of the distribution, while molecular mass is more highly conserved. At the level of functional grouping of orthologs, five groups vary significantly from the population of orthologs, which is attributed to either conservation at the level of sequences or a bias for either positively or negatively charged residues contributing to the function. Individual COGs conserved in both the acidic and basic regions of the trimodal distribution are identified, and orthologs that best represent the variation in levels of the acidic and basic regions are listed.


The analysis of pI distribution by using orthologs provides a basis for resolution of theoretical proteome comparison at the level of individual proteins. Orthologs identified that significantly vary between the major acidic and basic regions maybe used as representative of the variation of the entire proteome.


A protein's Isoelectric point (pI) – the pH at which a protein has no net charge – is the basis for its isolation using isoelectric focussing and along with Molecular Mass (Mr) is exploited in two dimensional gel electrophoresis used to seperate a cell's protein content. Bjellquist [1, 2] has shown that a the pI of a denatured linear protein can be calculated with high accuracy using the pK values of the amino acids responsible for charge. Using these calculations, it is posible to create an image of an organisms theoretical proteome, by plotting the theoretical pI against their theoretical Mr. The distribution of pI in these plots have a multimodal distribution. Early results on bacteria demonstrated a bimodal distribution with peaks centered around pH 5.5 and pH 9 [3]. This bimodality was explained as being caused by the fact that as proteins are least soluble at their pI, they have evolved to have pI's away from neutral pH – which was assumed to be the intracellular pH. Schwartz et al [4], further showed the presence of a trimodal pI distribution in Eukaryotes, and observed a correlation of pI to intracellular localisation. Cytoplasmic, nuclear and membrane proteins seemed to lie largely in the acidic, neutral and basic portions of the trimodal distribution, respectively.

Exhaustive work on virtual proteomes has been completed recently by two groups. Weiller et al [5] have proved that the multimodal distribution of protein pI is present in randomly generated sequences, and is a function of allowed combinations of amino-acid pKa values, rather than a cause of sequence evolution. A trimodal distribution is present in the virtual proteome of most organisms, with minima at 7.4 and 8.0. Knight et al [6] have shown that the variation in proteomes though the trimodal distribution is largely maintained – is influenced by the ecological niche of the organism.

Environmental influences of the proteome are known[7]. Amino acid usage is influenced by the G+C content of a genome [8]. Acidic residues predominate over basic residues in halophilic bacteria [9, 10], and compositional properties are further distinguished in thermophilic and mesophilic bacteria – with a preference for salt-bridges (residues with opposite charges) and long-chain hydrophobic residues in the former for increased stability [11].

All studies so far have used the gross properties of the proteome, or broad functional groups (e.g membrane proteins), and not attempted to resolve the multimodal distribution on the basis of individual proteins. This analysis could provide answers to the apparant conflict that though the pI multimodal distribution is caused by the properties of amino acids and not evolutionary factors, environmental influences induce variation in the sizes of each cluster in the distribution. By mapping the variation of pI in orthologs, one can in principal identify proteins whose pI is conserved as well as those whose pI does not seem to be responsible for its function, and whose variation maybe used as markers for an organism's environment.

The COG (Cluster of Orthologous Groups) database [12, 13] lists orthologs present across completed genomes. In this study, we consider the subset of an organism's proteome, as specified by the COG database, which contain orthologs present in other genomes – providing a basis to study the variation of pI of individual proteins across genomes.

Results and discussion

Comparison of virtual proteomes among different Bacteria organisms

The predicted proteomes, using values of Mr and pI calculated from the protein ortholog sequence, all displayed a trimodal distribution, with minima at 7.4 and 8.1. For convenience, the three major peaks demarkated by these minima are referred to as the acidic cluster (pI less than 7.4), "neutral" cluster (pI between 7.4 and 8.1) and basic cluster (pI greater than 8.1). This observation of a trimodal distribution is consistent with earlier results calculated from the complete genome [46], showing that orthalogs maybe used as representative samples of the complete genome. Figure 1 shows representative plots of proteomes using whole genomes and orthalog sub-sets for Escherichia coli K12 and Helicobacter pylori, Buchenara and Halobacterium. As only orthologs are used, it is possible to compare the pI of two genomes, by using a scatter plot. Diagonal points show invariant pI between the organisms, while a shift in pI is visible as an off-diagonal point. As the trimodal distribution is dominated by the large acidic and basic clusters, A shift in the pI from one cluster to the other is visable on the upper left and lower right quadrants of the plot. Figure 2 shows the pairwise comparison of the pI and Mr for the proteomes of two variants of E. coli (K-12 and 0157), between E. coli K12 and H. pylori which have different environmental pI – and extremophiles Buchenara and Halobacterium. Closely related variants of the same species do not have much variation in both pI and Mr, however the effect of a changed pH environment – specifically the acidic stomach for H. pylori and the basic intestine for E. coli – causes a large proportion of orthologs to change pI to compensate for the external pH change. Organisms that have extreme proteome pI distributions are Halobacterium – which has a large acidic cluster and Buchnera – a large negative cluster. As expected, most orthologs shared by the two organisms show a shift in their pI from one cluster to the other, however a baseline of pI conservation is maintained, implying that the pI maybe a conserved property for a few orthologs. The Mr is much more highly conserved, with the scatter plot clustered along the diagonal.

Figure 1

Ortholog (left) and whole genome (right) isoelectric point (pI) frequency distributions for (A) Escherichia coli K 12, (B) Helicobacter pylori (C) Buchenera APS and (D) Halobacterium sps NRC-1

Figure 2

Pairwise comparison of isoelectric point and molecular mass of orthologs. (A) Escherichia coli K12 and 0517; (B) Escherichia coli K12 and Helicobacter pylori (C) Halobacterium sps NCR-1 and Buchnera sps APS

Variation in pI among functional categories of COGS

The COG database is functional classified into eighteen categories [12] (Table 1, shows a listing of these functions, which the single letter code used in the figures). The distribution of pI values across all organisms for each function, summarised by the mean and standard deviation, is shown in Figure 3. The mean values for each function is close to neutral pH, with a deviation spreading across both clusters. Again the use of orthologs provides the means to correlate conservation of pI between pairs of organisms. The mean pairwise-correlation of ortholog pI corresponding to each function was also computed (Figure 4A), along with the corresponding pairwise correlation of ortholog Mr (Figure 4B). We have used Kruskal-Wallis multiple comparison testing [14] to identify groups that significantly deviate from the expected distribution. Five functional groups deviate significantly, three towards the basic cluster and two towards the acidic cluster (Table 1).

Table 1 Variation in pI in functionally classified groups of COGs. The table lists functional groups of COGs as defined in the COG website [22] along with the results of the Dunn Multiple Comparison test. Ri – mean rank for the group, Rt – mean rank for sample representing the complete distribution. Significant values are those where |Rt - Ri| > 467.35 which is calculated for α = 0.2
Figure 3

Frequency distribution for COG mean pI (grey vertical bars). The variation of pI (mean and standard deviation) for functionally classified groups of COGs is overlayed. Functional groups denoted by single letters is expanded in Table 1

Figure 4

Mean correlation of pI (A), Mr (B) and sequence distance (C) for all organism pairs for the functional groups of COGS. Functional groups are denoted by single letters, expanded in Table 1

Membrane proteins are known to have a preference for the basic cluster, caused by the larger proportion of basic charged residues to compensate for the negatively charged membrane bilayer [4, 6]. A functional requirement for either basic or acidic charges could thus influence the protein's pI. Orthologs, by definition, perform the same function in different organisms, and if preference for either basic or acidic charged residues is related to this function, this should be reflected in the protien's pI having a bias for the respective cluster. However, the conservation of pI maybe accidental, especially if the proteins are highly conserved. We have calculated the pairwise distance of all proteins within a COG as a measure of their sequence similarity. The distributions of distance for each functional group is shown in Figure 4C.

Proteins associated with the group "J", involved with translation, ribosomal structure and biogenesis, are dominated by highly conserved ribosomal proteins – and this high level of conservation and intolerance to mutation is reflected in an invariance of pI. This conservation is also reflected in the higher correlation for this group of proteins in their Mr. Other groups which are polarised to either the acidic or basic modes of the pI distribution do not show such a high level of conservation, and pI conservation is dictated by the nature of their function.

Analysis of pI variation for individual COGS

The general function of a group of proteins and their level of conservation may dictate a preference for a specific range of pI, as had been shown earlier for membrane proteins, and for functional groups of COGs in the previous section of this paper. Individual proteins maybe identified that have a preference for either the acidic or basic cluster. The scatter plot of the pI of all ortholog proteins used in this analysis is sorted by COG mean pI and plotted in Figure (5). The allowed trimodal regions are clearly visable. However at both the left and right extremes, the scatter plot show that COGs do exist with a preference for the acidic or basic cluster respectively. We have computed the frequency distribution for each COG in the acidic, neutral and basic clusters. On analysis, no COG is conserved in the neutral cluster, the largest frequency being 0.5 for COGs 1689, 3016, 3783 and 3801, however their frequency of occurance among all organisms is less than ten percent. This is an expected result as the pI distribution is caused by an interplay of acidic and basic residues which acount for the large acidic and basic clusters. The "neutral" cluster is not caused by the absense or balance of charged residues but because allowed pK combinations of charged residues are minimum at pH 7.1 and 8.4 [5].

Figure 5

Isoelectric point distribution of orthologs sorted by mean isoelectric point value. Black dot – mean value for COG, grey bar – standard deviation for COG, red point – pI value of individual genes classified under the COG. The sorted list of COG's used for the X-axis is available as an additional file.

Proteins conserved in the acidic and basic clusters however have a preponderance of acidic and basic residues respectively, which maybe required for their function. A complete list of COGs whose protein pI are conserved in these clusters is listed as an additional file. We have scaled the frequency of conservation by the frequency of occurance, so that only COGs which are maximally represented across the organisms in the dataset are used as markers of each cluster. These are listed in Table 2 and Table 3 respectively. The dominant groups of proteins which seem to require an acidic pI are the amino acid tRNA synthetases. Among those proteins whose pI is highly conserved in the basic cluster are a large number of ribosomal proteins. Although highly conserved and intolerant to mutation, ribosomal proteins interweave with negatively charged RNA to form the ribosome, and being positively charged will be a requirement for strong electrostatic interactions.

Table 2 COGs conserved in the acidic cluster of the multimodal distribution. P(a) and P(b) are the frequency of occurance in the acidic and basic clusters respectively, F is the function group. The COGs are listed with their P(v) which is score of their variability between the acidic and basic clusters of the pI distribution weighted by their frequency of occurance across all genomes in the dataset. pI values of four organisms – the extremophiles Buchnera (environment bacteriocyte), Halobacterium (highly acidic – high salt), and E. coli and H. pylori – intestinal bacteria with different environmental pH.
Table 3 COGs conserved in the Basic cluster of the multimodal distribution. Table headers are the same as in Table 2.

A majority of COGs however show distributions across both the basic and acidic clusters. Knight et al, have shown a correlation to a change in proteome patterns with the organism's ecological niche. Since the proteome always exists in a trimodal distribution, it will only vary from organism to organism in the relative amounts of proteins which are present in each of the three clusters of the pI distribution. We have computed the probability of being in both the acidic and basic clusters weighted by the frequency of occurance in the organisms under consideration in order to identify COGs which are highly represented and show no particular preference for either cluster. This list of COGs, with a joint probability greater than a cutoff value of 0.6 is tabulated in table 4. For reference, the individual values of genes corresponding to the four organisms is also listed, and in a majority of cases show good agreement with the shift in an organism's total theoretical proteome towards either the basic or acidic clusters. Except for some ribosomal proteins, which appear on the list because of their high frequency of occurance, all other COGs are membrane based proteins, which would have direct contact with the external environment. These COGs best represent an organism's shift from expected levels of the acidic and basic clusters of the multimodal distribution, and it is possible that they maybe used as markers to predict an organisms ecological niche, with particular reference to its environmental pH in free living microorganisms.

Table 4 Highly occuring COGs with pI varying across both the acidic and basic clusters. Table headers are the same as in Table 2.

Extrapolating results obtained from using the theoretical proteome must be viewed with caution as predicted pI values are for unfolded proteins obtained from sequence and not the native folded proteins in the cellular microenvironment, which remain unknown. We have resisted attempting to correlate observations of an ortholog's thoeretical pI with its function, unless clearly obvious, for this reason. The theoretical proteome is also generated from the total set of proteins present in an organism's genome, while only a subset maybe expressed at any given time and will vary in response to external stimuli. An organism may also respond to evolutionary pressure such as environmental pH by increasing the number of copies of a charged protein, a ploy frequently adopted in drug resistance. The effect of a shift in the relative levels of the acidic and basic peaks maybe replicated by an increase in copy number of proteins belonging to the relevant cluster. As theoretical proteome studies are not weighted by the copy number of the individual proteins, for lack of relevant data related to the quantity of individual proteins for the entire proteome, these observations are impossible to make.


Analysis of ortholog pI across forty two microorganism genomes which contain reprentatives of free living archea and bacteria are analysed to identify orthologs which are invariant in pI as well as those amenable to changes in pI. Orthologs with a high frequency of occurance and variation in pI are shortlisted, and maybe used as markers in future studies which attempt to map proteome properties to variation in the organisms ecological niche.


Genome sequences and the COG database generated with 43 organisms with sequences and alignments were sourced from NCBI [15]. Sacharomyces cerevisae was removed from the dataset so that it contained only archeae and bacteria which were single-celled organisms. This was intentionally to include only genomes of organisms which would have their membrane directly in contact with the external environment and included a range of organisms with diverse ecological niches.

The isoelectric point is determined from an equation subtracting the sum of all negative charges (of all acidic residues) from the sum of all positive charges (of all basic residues), by varying the pH by bisectional nesting of intervals on the pH scale until the interval is smaller than a given level (0.001), as described in [2]. Fixed values for pK are considered to calculate the isoelectric point, and are the same as used in the molecular weight/isoelectric point program at[16].

The molecular weights and isoelectric point of all the predicted open reading frames from the completed genomes of a number of prokaryotes were generated. All ORFs that are classified by respective COG identities and assigned to one of the 18 functional groups [12] were used in comparitive analysis between genomes.

Correlation of molecular weights and pI were performed, using scripts written in house, for all COGs classified to a particular function for each pair of organisms. Pairwise distances were calculated for all pairs of proteins belonging to the specified COG. Sequences assigned to each COG were aligned with ClustalW [17] using default parameters, and the resulting multiple alignment used to generate a distance matrix with the program PROTDIST, using default parameters (Jones-Taylor-Thornton model, where the distance is scaled in units of the expected fraction of amino acids changed), from the Phylip package [18]. Perl scripts used to automate this protocol were written using BioPerl Modules [19].

Individual molecular weight and isoelectric point along with pairwise distances were stored in a MySQL database v 4.1.9, running on a Fedora Core Linux dual Xeon Server, to enable easy retrieval. Statistical properties on the data (mean, variance and correlation coefficients) were calculated using standard methods after sectioning the data at the levels of COGs and functional groups. Figures plotted in this paper were made using Sigmaplot (Systat Software Inc, Richmond, California, USA).

Statistical significance of functional groups was assessed using non-parametric Kruskal-Wallis tests and Dunn's Multiple Comparison [20] tests (alpha = 0.2), after randomly sampling three hundred sequences from each group, and from the total population to form the sample sets.

To score the COGs both for variation and conservation, probabilities were estimated from frequencies in the dataset. The frequency of occurance of a COG i, F(i), calculated as

Where no is the number of organisms in which the COG is present and nt is the number of organisms in the sample dataset (42).

where i is the cog, j is the cluster – either acidic, neutral or basic, and r the range corresponding to the cluster – 0–7.4 for acidic, 7.4–8.1 for basic, and 7.4–14 for basic. nr is the number of proteins in the given cluster, ni total number of proteins corresponding to i. Pi (j) is the conservation score for the cluster j, and is the joint probabiity of conservation and occurance of the COG i.

We estimate the joint probability of variation between the acidic and basic clusters for a COG i, Pi(v), as the probability of variation into its frequency of occurance calculated as

where na and nb are the number of proteins belonging to the COG i in the acidic and basic cluster of the distribution respectively, ni being the total number of proteins classified under COG i .

Membrane proteins were predicted using TOPPRED [21].


  1. 1.

    Bjellqvist B, Hughes GJ, Pasquali C, Paquet N, Ravier F, Sanchez JC, Frutiger S, Hochstrasser D: The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis. 1993, 14 (10): 1023-31. 10.1002/elps.11501401163.

    PubMed  CAS  Article  Google Scholar 

  2. 2.

    Altland K, Becher P, Rossmann U, Bjellqvist B: Isoelectric focusing of basic proteins: the problem of oxidation of cysteines. Electrophoresis. 1988, 9 (9): 474-85. 10.1002/elps.1150090906.

    PubMed  CAS  Article  Google Scholar 

  3. 3.

    Van Bogelen RA, Schiller EE, Thomas JD, Neidhardt FC: Diagnosis of cellular states of microbial organisms using proteomics. Electrophoresis. 1999, 20: 2149-2159. 10.1002/(SICI)1522-2683(19990801)20:11<2149::AID-ELPS2149>3.0.CO;2-N.

    CAS  Article  Google Scholar 

  4. 4.

    Schwartz R, Ting CS, King J: Whole proteome pI values correlate with subcellular localization of proteins for organisms within the three domains of life. Genome Res. 2001, 11: 703-709. 10.1101/gr.GR-1587R.

    PubMed  CAS  Article  Google Scholar 

  5. 5.

    Weiller GF, Caraux G, Sylvester N: The modal distribution of protein isoelectric points reflects amino acid properties rather than sequence evolution. Proteomics. 2004, 4 (4): 943-9. 10.1002/pmic.200200648.

    PubMed  CAS  Article  Google Scholar 

  6. 6.

    Knight CG, Kassen R, Hebestreit H, Rainey PB: Global analysis of predicted proteomes: functional adaptation of physical properties. Proc Natl Acad Sci U S A. 2004, 101 (22): 8390-5. 10.1073/pnas.0307270101.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  7. 7.

    Brocchieri L: Environmental signatures in proteome properties. Proc Natl Acad Sci. 2004, 101 (22): 8257-8258. 10.1073/pnas.0402797101.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  8. 8.

    Tekaia F, Yeramian E, Dujon B: Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene. 2002, 297 (1–2): 51-60. 10.1016/S0378-1119(02)00871-5.

    PubMed  CAS  Article  Google Scholar 

  9. 9.

    Kennedy SP, Ng WV, Salzberg SL, Hood L, DasSarma S: Understanding the adaptation of Halobacterium species NRC-1 to its extreme environment through computational analysis of its genome sequence. Genome Res. 2001, 11 (10): 1641-50. 10.1101/gr.190201.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  10. 10.

    Karlin S, Brocchieri L, Trent J, Blaisdell BE, Mrazek J: Heterogeneity of genome and proteome content in bacteria, archaea, and eukaryotes. Theor Popul Biol. 2002, 61 (4): 367-90. 10.1006/tpbi.2002.1606.

    PubMed  Article  Google Scholar 

  11. 11.

    Kumar S, Nussinov R: How do thermophilic proteins deal with heat?. Cell Mol Life Sci. 2001, 58 (9): 1216-33.

    PubMed  CAS  Article  Google Scholar 

  12. 12.

    Tatusov RL, Koonin EV, Lipman DJ: A Genomics Perspactive on Protein Families. Science. 1997, 278 (5338): 631-7. 10.1126/science.278.5338.631.

    PubMed  CAS  Article  Google Scholar 

  13. 13.

    Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4 (1): 41-10.1186/1471-2105-4-41.

    PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Kruskal WH, Wallis WA: Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association. 1952, 47: 583-621. Addendum 1953, 48:907–911

    Article  Google Scholar 

  15. 15.

    FTP sites for COG and Genome data used in this analysis. [] and []

  16. 16.

    Compute pI/MW tool at ExPASy proteomics server of the Swiss Institute of Bioinformatics. []

  17. 17.

    Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research. 1994, 22 (22): 4673-4680.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  18. 18.

    Felsenstein J: PHYLIP: Phylogeny inference package. Cladistics. 1989, 5: 164-166.

    Google Scholar 

  19. 19.

    Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Research. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.

    PubMed  CAS  PubMed Central  Article  Google Scholar 

  20. 20.

    Dunn DJ: Multiple Comparisons Using Rank Sums. Technometrics. 1964, 5: 241-252.

    Article  Google Scholar 

  21. 21.

    Claros MG, von Heijne G: TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci. 1994

    Google Scholar 

  22. 22.

    COG help file showing functional categories of orthologs. []

Download references


The authors thank Department of Biotechnology, Government of India for financial support.

Author information



Corresponding author

Correspondence to Alok Bhattacharya.

Additional information

Authors' contributions

Mr Soumyadeep Nandi did all computation and database work, Mr. Nipun Mehra started this work, and generated preliminary results, Dr. Andrew Lynn guided the implementation of the project and wrote the manuscript, Prof. Alok Bhattacharya provided overall guidance and interpretation of results.

Soumyadeep Nandi, Nipun Mehra contributed equally to this work.

Electronic supplementary material

List of COGS sorted by mean pI.

Additional file 1: File with rows containing COG, mean, standard deviation and pI of individual ortholog proteins that were used in the calculation described in this paper. The rows are sorted by mean pI (column 2) and used to plot figure 5 of the manuscript. The file is tab-delimited and will open in any spreadsheet program. (CSV 726 KB)

Additional file 2: COGs with pI conserved in acid and basic clusters. Text file containing the comprehensive list of COGs which are conserved in either the acidic or basic cluster of the multimodal distribution. (TXT 19 KB)

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Nandi, S., Mehra, N., Lynn, A.M. et al. Comparison of theoretical proteomes: Identification of COGs with conserved and variable pI within the multimodal pI distribution. BMC Genomics 6, 116 (2005).

Download citation


  • Basic Residue
  • Charged Residue
  • Basic Cluster
  • Multimodal Distribution
  • Amino Acid Usage