Comparison of theoretical proteomes: Identification of COGs with conserved and variable pI within the multimodal pI distribution

Background Theoretical proteome analysis, generated by plotting theoretical isoelectric points (pI) against molecular masses of all proteins encoded by the genome show a multimodal distribution for pI. This multimodal distribution is an effect of allowed combinations of the charged amino acids, and not due to evolutionary causes. The variation in this distribution can be correlated to the organisms ecological niche. Contributions to this variation maybe mapped to individual proteins by studying the variation in pI of orthologs across microorganism genomes. Results The distribution of ortholog pI values showed trimodal distributions for all prokaryotic genomes analyzed, similar to whole proteome plots. Pairwise analysis of pI variation show that a few COGs are conserved within, but most vary between, the acidic and basic regions of the distribution, while molecular mass is more highly conserved. At the level of functional grouping of orthologs, five groups vary significantly from the population of orthologs, which is attributed to either conservation at the level of sequences or a bias for either positively or negatively charged residues contributing to the function. Individual COGs conserved in both the acidic and basic regions of the trimodal distribution are identified, and orthologs that best represent the variation in levels of the acidic and basic regions are listed. Conclusion The analysis of pI distribution by using orthologs provides a basis for resolution of theoretical proteome comparison at the level of individual proteins. Orthologs identified that significantly vary between the major acidic and basic regions maybe used as representative of the variation of the entire proteome.


Background
A protein's Isoelectric point (pI) -the pH at which a protein has no net charge -is the basis for its isolation using isoelectric focussing and along with Molecular Mass (Mr) is exploited in two dimensional gel electrophoresis used to seperate a cell's protein content. Bjellquist [1,2] has shown that a the pI of a denatured linear protein can be calculated with high accuracy using the pK values of the amino acids responsible for charge. Using these calculations, it is posible to create an image of an organisms theoretical proteome, by plotting the theoretical pI against their theoretical Mr. The distribution of pI in these plots have a multimodal distribution. Early results on bacteria demonstrated a bimodal distribution with peaks centered around pH 5.5 and pH 9 [3]. This bimodality was explained as being caused by the fact that as proteins are least soluble at their pI, they have evolved to have pI's away from neutral pH -which was assumed to be the intracellular pH. Schwartz et al [4], further showed the presence of a trimodal pI distribution in Eukaryotes, and observed a correlation of pI to intracellular localisation. Cytoplasmic, nuclear and membrane proteins seemed to lie largely in the acidic, neutral and basic portions of the trimodal distribution, respectively.
Exhaustive work on virtual proteomes has been completed recently by two groups. Weiller et al [5] have proved that the multimodal distribution of protein pI is present in randomly generated sequences, and is a function of allowed combinations of amino-acid pKa values, rather than a cause of sequence evolution. A trimodal distribution is present in the virtual proteome of most organisms, with minima at 7.4 and 8.0. Knight et al [6] have shown that the variation in proteomes though the trimodal distribution is largely maintained -is influenced by the ecological niche of the organism.
Environmental influences of the proteome are known [7]. Amino acid usage is influenced by the G+C content of a genome [8]. Acidic residues predominate over basic residues in halophilic bacteria [9,10], and compositional properties are further distinguished in thermophilic and mesophilic bacteria -with a preference for salt-bridges (residues with opposite charges) and long-chain hydrophobic residues in the former for increased stability [11].
All studies so far have used the gross properties of the proteome, or broad functional groups (e.g membrane proteins), and not attempted to resolve the multimodal distribution on the basis of individual proteins. This analysis could provide answers to the apparant conflict that though the pI multimodal distribution is caused by the properties of amino acids and not evolutionary factors, environmental influences induce variation in the sizes of each cluster in the distribution. By mapping the variation of pI in orthologs, one can in principal identify proteins whose pI is conserved as well as those whose pI does not seem to be responsible for its function, and whose variation maybe used as markers for an organism's environment.
The COG (Cluster of Orthologous Groups) database [12,13] lists orthologs present across completed genomes. In this study, we consider the subset of an organism's proteome, as specified by the COG database, which contain orthologs present in other genomes -providing a basis to study the variation of pI of individual proteins across genomes.

Comparison of virtual proteomes among different Bacteria organisms
The predicted proteomes, using values of Mr and pI calculated from the protein ortholog sequence, all displayed a trimodal distribution, with minima at 7.4 and 8.1. For convenience, the three major peaks demarkated by these minima are referred to as the acidic cluster (pI less than 7.4), "neutral" cluster (pI between 7.4 and 8.1) and basic cluster (pI greater than 8.1). This observation of a trimodal distribution is consistent with earlier results calculated from the complete genome [4][5][6], showing that orthalogs maybe used as representative samples of the complete genome. Figure 1 shows representative plots of proteomes using whole genomes and orthalog sub-sets for Escherichia coli K12 and Helicobacter pylori, Buchenara and Halobacterium. As only orthologs are used, it is possible to compare the pI of two genomes, by using a scatter plot. Diagonal points show invariant pI between the organisms, while a shift in pI is visible as an off-diagonal point. As the trimodal distribution is dominated by the large acidic and basic clusters, A shift in the pI from one cluster to the other is visable on the upper left and lower right quadrants of the plot. Figure 2 shows the pairwise comparison of the pI and Mr for the proteomes of two variants of E. coli (K-12 and 0157), between E. coli K12 and H. pylori which have different environmental pI -and extremophiles Buchenara and Halobacterium. Closely related variants of the same species do not have much variation in both pI and Mr, however the effect of a changed pH environment -specifically the acidic stomach for H. pylori and the basic intestine for E. coli -causes a large proportion of orthologs to change pI to compensate for the external pH change. Organisms that have extreme proteome pI distributions are Halobacterium -which has a large acidic cluster and Buchnera -a large negative cluster. As expected, most orthologs shared by the two organisms show a shift in their pI from one cluster to the other, however a baseline of pI conservation is maintained, implying that the pI maybe a conserved property for a few orthologs. The Mr is much more highly conserved, with the scatter plot clustered along the diagonal.

Variation in pI among functional categories of COGS
The COG database is functional classified into eighteen categories [12] (Table 1, shows a listing of these functions, which the single letter code used in the figures). The distribution of pI values across all organisms for each function, summarised by the mean and standard deviation, is shown in Figure 3. The mean values for each function is close to neutral pH, with a deviation spreading across  Figure 4B). We have used Kruskal-Wallis multiple comparison testing [14] to identify groups that significantly deviate from the expected distribution. Five functional groups deviate significantly, three towards the basic cluster and two towards the acidic cluster (Table  1).
Membrane proteins are known to have a preference for the basic cluster, caused by the larger proportion of basic charged residues to compensate for the negatively charged membrane bilayer [4,6]. A functional requirement for either basic or acidic charges could thus influence the protein's pI. Orthologs, by definition, perform the same function in different organisms, and if preference for either basic or acidic charged residues is related to this function, this should be reflected in the protien's pI having a bias for the respective cluster. However, the conservation of pI maybe accidental, especially if the proteins are highly conserved. We have calculated the pairwise distance of all proteins within a COG as a measure of their sequence similarity. The distributions of distance for each functional group is shown in Figure 4C.
Proteins associated with the group "J", involved with translation, ribosomal structure and biogenesis, are dominated by highly conserved ribosomal proteins -and this high level of conservation and intolerance to mutation is reflected in an invariance of pI. This conservation is also reflected in the higher correlation for this group of proteins in their Mr. Other groups which are polarised to either the acidic or basic modes of the pI distribution do not show such a high level of conservation, and pI conservation is dictated by the nature of their function.

Analysis of pI variation for individual COGS
The general function of a group of proteins and their level of conservation may dictate a preference for a specific range of pI, as had been shown earlier for membrane proteins, and for functional groups of COGs in the previous section of this paper. Individual proteins maybe identified that have a preference for either the acidic or basic cluster. The scatter plot of the pI of all ortholog proteins used in this analysis is sorted by COG mean pI and plotted in Figure (5). The allowed trimodal regions are clearly visable. However at both the left and right extremes, the scatter plot show that COGs do exist with a preference for the acidic or basic cluster respectively. We have computed the frequency distribution for each COG in the acidic, neutral and basic clusters. On analysis, no COG is conserved in the neutral cluster, the largest frequency being 0.5 for COGs 1689, 3016, 3783 and 3801, however their frequency of occurance among all organisms is less than ten percent. This is an expected result as the pI distribution is caused by an interplay of acidic and basic residues which acount for the large acidic and basic clusters. The "neutral" cluster is not caused by the absense or balance of charged residues but because allowed pK combinations of charged residues are minimum at pH 7.1 and 8.4 [5].
Proteins conserved in the acidic and basic clusters however have a preponderance of acidic and basic residues respectively, which maybe required for their function. A complete list of COGs whose protein pI are conserved in these clusters is listed as an additional file. We have scaled the frequency of conservation by the frequency of occurance, so that only COGs which are maximally represented across the organisms in the dataset are used as markers of each cluster. These are listed in Table 2 and Table 3 respectively. The dominant groups of proteins which seem to require an acidic pI are the amino acid tRNA synthetases. Among those proteins whose pI is highly conserved in the basic cluster are a large number of ribosomal proteins.
Although highly conserved and intolerant to mutation, ribosomal proteins interweave with negatively charged RNA to form the ribosome, and being positively charged will be a requirement for strong electrostatic interactions.
A majority of COGs however show distributions across both the basic and acidic clusters. Knight et al, have shown a correlation to a change in proteome patterns with the organism's ecological niche. Since the proteome always exists in a trimodal distribution, it will only vary from organism to organism in the relative amounts of proteins which are present in each of the three clusters of the pI distribution. We have computed the probability of being in both the acidic and basic clusters weighted by the frequency of occurance in the organisms under consideration in order to identify COGs which are highly represented and show no particular preference for either cluster. This list of COGs, with a joint probability greater than a cutoff value of 0.6 is tabulated in table 4. For reference, the individual values of genes corresponding to the four organisms is also listed, and in a majority of cases show good agreement with the shift in an organism's total Frequency distribution for COG mean pI (grey vertical bars) Figure 3 Frequency distribution for COG mean pI (grey vertical bars). The variation of pI (mean and standard deviation) for functionally classified groups of COGs is overlayed. Functional groups denoted by single letters is expanded in Table 1  These COGs best represent an organism's shift from expected levels of the acidic and basic clusters of the multimodal distribution, and it is possible that they maybe used as markers to predict an organisms ecological niche, with particular reference to its environmental pH in free living microorganisms.
Extrapolating results obtained from using the theoretical proteome must be viewed with caution as predicted pI values are for unfolded proteins obtained from sequence and not the native folded proteins in the cellular microenvironment, which remain unknown. We have resisted attempting to correlate observations of an ortholog's thoeretical pI with its function, unless clearly obvious, for this reason. The theoretical proteome is also generated from the total set of proteins present in an organism's genome, while only a subset maybe expressed at any given time and will vary in response to external stimuli. An organism may also respond to evolutionary pressure such as environmental pH by increasing the number of copies of a charged protein, a ploy frequently adopted in drug resistance. The effect of a shift in the relative levels of the acidic and basic peaks maybe replicated by an increase in copy number of proteins belonging to the relevant cluster. As theoretical proteome studies are not weighted by the copy number of the individual proteins, for lack of relevant data related to the quantity of individual proteins for the entire proteome, these observations are impossible to make.

Conclusion
Analysis of ortholog pI across forty two microorganism genomes which contain reprentatives of free living archea and bacteria are analysed to identify orthologs which are invariant in pI as well as those amenable to changes in pI. Orthologs with a high frequency of occurance and variation in pI are shortlisted, and maybe used as markers in future studies which attempt to map proteome properties to variation in the organisms ecological niche.

Methods
Genome sequences and the COG database generated with 43 organisms with sequences and alignments were sourced from NCBI [15]. Sacharomyces cerevisae was removed from the dataset so that it contained only archeae and bacteria which were single-celled organisms. This was intentionally to include only genomes of organisms which would have their membrane directly in contact with the external environment and included a range of organisms with diverse ecological niches. , by varying the pH by bisectional nesting of intervals on the pH scale until the interval is smaller than a given level (0.001), as described in [2]. Fixed values for pK are considered to calculate the isoelectric point, and are the same as used in the molecular weight/isoelectric point program at http://www.expasy.org [16].
The molecular weights and isoelectric point of all the predicted open reading frames from the completed genomes of a number of prokaryotes were generated. All ORFs that are classified by respective COG identities and assigned to one of the 18 functional groups [12] were used in comparitive analysis between genomes.
Correlation of molecular weights and pI were performed, using scripts written in house, for all COGs classified to a particular function for each pair of organisms. Pairwise distances were calculated for all pairs of proteins belonging to the specified COG. Sequences assigned to each COG were aligned with ClustalW [17] using default parameters, and the resulting multiple alignment used to generate a distance matrix with the program PROTDIST, using default parameters (Jones-Taylor-Thornton model, where the distance is scaled in units of the expected fraction of amino acids changed), from the Phylip package [18]. Perl scripts used to automate this protocol were written using BioPerl Modules [19].
Individual molecular weight and isoelectric point along with pairwise distances were stored in a MySQL database v 4.1.9, running on a Fedora Core Linux dual Xeon Server, to enable easy retrieval. Statistical properties on the data Isoelectric point distribution of orthologs sorted by mean isoelectric point value Figure 5 Isoelectric point distribution of orthologs sorted by mean isoelectric point value. Black dot -mean value for COG, grey barstandard deviation for COG, red point -pI value of individual genes classified under the COG. The sorted list of COG's used for the X-axis is available as an additional file.
(mean, variance and correlation coefficients) were calculated using standard methods after sectioning the data at the levels of COGs and functional groups. Figures plotted in this paper were made using Sigmaplot (Systat Software Inc, Richmond, California, USA).
Statistical significance of functional groups was assessed using non-parametric Kruskal-Wallis tests and Dunn's Multiple Comparison [20] tests (alpha = 0.2), after randomly sampling three hundred sequences from each group, and from the total population to form the sample sets. To score the COGs both for variation and conservation, probabilities were estimated from frequencies in the dataset. The frequency of occurance of a COG i, F(i), calculated as

COG P(a) P(b) P(v) F Role
Where n o is the number of organisms in which the COG is present and n t is the number of organisms in the sample dataset (42).
where i is the cog, j is the cluster -either acidic, neutral or basic, and r the range corresponding to the cluster -0-7.4 for acidic, 7.4-8.1 for basic, and 7.4-14 for basic. n r is the number of proteins in the given cluster, n i total number of proteins corresponding to i. P i (j) is the conservation score for the cluster j, and is the joint probabiity of conservation and occurance of the COG i.   We estimate the joint probability of variation between the acidic and basic clusters for a COG i, P i (v), as the probability of variation into its frequency of occurance calculated as where n a and n b are the number of proteins belonging to the COG i in the acidic and basic cluster of the distribution respectively, n i being the total number of proteins classified under COG i .

Authors' contributions
Mr Soumyadeep Nandi did all computation and database work, Mr. Nipun Mehra started this work, and generated preliminary results, Dr. Andrew Lynn guided the implementation of the project and wrote the manuscript, Prof. Alok Bhattacharya provided overall guidance and interpretation of results.