Protein abundance profiling of the Escherichia coli cytosol
© Ishihama et al. 2008
Received: 24 January 2008
Accepted: 27 February 2008
Published: 27 February 2008
Skip to main content
© Ishihama et al. 2008
Received: 24 January 2008
Accepted: 27 February 2008
Published: 27 February 2008
Knowledge about the abundance of molecular components is an important prerequisite for building quantitative predictive models of cellular behavior. Proteins are central components of these models, since they carry out most of the fundamental processes in the cell. Thus far, protein concentrations have been difficult to measure on a large scale, but proteomic technologies have now advanced to a stage where this information becomes readily accessible.
Here, we describe an experimental scheme to maximize the coverage of proteins identified by mass spectrometry of a complex biological sample. Using a combination of LC-MS/MS approaches with protein and peptide fractionation steps we identified 1103 proteins from the cytosolic fraction of the Escherichia coli strain MC4100. A measure of abundance is presented for each of the identified proteins, based on the recently developed emPAI approach which takes into account the number of sequenced peptides per protein. The values of abundance are within a broad range and accurately reflect independently measured copy numbers per cell.
As expected, the most abundant proteins were those involved in protein synthesis, most notably ribosomal proteins. Proteins involved in energy metabolism as well as those with binding function were also found in high copy number while proteins annotated with the terms metabolism, transcription, transport, and cellular organization were rare. The barrel-sandwich fold was found to be the structural fold with the highest abundance. Highly abundant proteins are predicted to be less prone to aggregation based on their length, pI values, and occurrence patterns of hydrophobic stretches. We also find that abundant proteins tend to be predominantly essential. Additionally we observe a significant correlation between protein and mRNA abundance in E. coli cells.
Abundance measurements for more than 1000 E. coli proteins presented in this work represent the most complete study of protein abundance in a bacterial cell so far. We show significant associations between the abundance of a protein and its properties and functions in the cell. In this way, we provide both data and novel insights into the role of protein concentration in this model organism.
Proteins fulfill a wide variety of functions and are central to almost all processes in living cells. In order to improve our understanding of the complex network of protein interactions in the cell, it is of central importance to obtain information about the activities of the individual components; these are directly linked to their cellular concentrations. The fast development of genomic and proteomic methods has already revealed the basic protein inventory of a few hundred different organisms, but large scale quantitative information on protein concentrations is still largely missing. Comprehensive analyses of cellular mRNA levels have proven to be highly useful tools to monitor the state of a cell, but by design they are missing all influences of the vast amount of posttranscriptional regulations. One of the few organisms where direct protein concentrations are available on a nearly proteome wide level is the yeast Saccharomyces cerevisiae. It has been subject to large scale protein quantification using epitope tagging of virtually the whole proteome followed by quantitative western blotting  and to single cell based quantitative proteomic analysis using flow-cell cytometry and a library of GFP-tagged yeast strains . While both methods provided high-quality abundance data for nearly the entire proteome, their dependence on the availability of a strain library containing tagged versions of all proteins of interest presents a serious limitation. Depending on the organism under study, to generate such a library may involve an immense amount of work or may even be impossible to achieve.
The proteomics field and its key technology mass spectrometry are developing rapidly from qualitative towards quantitative measurements without the need for individual tagging of proteins. These efforts, however, are mostly restricted to the comparison of relative concentrations of the same proteins in different samples. Direct, non-relative abundance data of proteins, allowing a comparison of different proteins within and between samples, are still difficult to obtain on a large scale.
Mass spectrometry, in combination with protein and peptide separation methods, allows the efficient qualitative identification of proteins in complex mixtures. As an alternative to two-dimensional gel electrophoresis (2-DE) and mass spectrometric analysis of the resulting individual spots, shotgun approaches have been developed as suitable tools for large scale proteome analysis [3, 4]. These are based on protease digestion of the sample as a whole and subsequent peptide separation and identification by multidimensional LC-MS/MS. However, in contrast to the 2-DE approaches, information about protein abundances is initially unavailable in the shotgun approaches. Relative quantification for abundance comparison of the same protein in different samples can be realized by incorporation of stable isotopes into the samples [5–7] which is utilized in methods like cICAT , iTRAQ™ , 18O-labeling  or SILAC . Relative changes in concentration of the same protein between different experimental setups can be very accurately determined by these methods, but a major disadvantage is the absence of a direct measure of protein concentrations. Abundance comparison of different proteins is hence not possible.
Several mass spectrometric strategies have been reported to overcome this limitation. The more traditional ones utilize internal standards, e.g. spiking the complex mixture with peptides of known concentration [11, 12], and typically require calibration for each protein to be quantified. A more recently introduced method describes a new parameter to express protein concentrations without the need of introducing labels or internal standards. It is calculated from the averaged ion intensities of the three most intense tryptic peptides per protein, as extracted from the ion current chromatograms. This parameter is called 'xPAI' for 'extracted ion intensity-based protein abundance index'. It has been shown to correlate well with known protein concentrations in the human RNA polymerase II complex  and rat mitochondria . However, xPAI is limited to samples of low complexity since selection of only the three most intense peptides becomes unreliable with an increasing number of different proteins in the sample. Additionally, it is difficult to apply the xPAI approach to samples which were pre-fractionated at the peptide level, due to carry-over effects between the different fractions. A similar method has been described using an alternate scanning LCMS method (LCMS(E)), which is available on certain mass spectrometer instruments . Here, all peaks in the MS spectra are selected as precursor ions for subsequent MS/MS scans resulting in lower peak intensity dependence of peptide identification as is the case for conventional data-dependent MS/MS scans. If the MS device allows this kind of detection mode it is preferable to xPAI, but it is still presented with the mentioned basic challenges of this approach.
Other label free ways of large scale protein quantification by MS make use of correlations between the number of actually identified tryptic peptides per protein and the theoretical number of tryptic peptides , or the molecular weight of the proteins . These ratios have been termed 'protein abundance index' (PAI). More recently, we found empirically that PAI correlates better with the logarithm of protein concentration and defined an exponentially modified PAI (emPAI) . Although such a method of concentration determination may not be expected to be overly precise, the accuracy of emPAI-derived concentration measurements has been shown to lie within an error range of only a factor of maximally 3.4 for 46 proteins in whole cell lysates of murine neuroblastoma (N2A) cells  and is therefore in the same range or better than protein concentration measurements based on staining methods. A major advantage is that the emPAI based protein concentration is automatically and quickly available for all proteins identified by MS without the need of any additional experimental setup. A similar approach was reported recently for the membrane proteome of S. cerevisiae, where protein concentrations were estimated by using the number of obtained spectra per protein divided by the length of the protein .
Determination of emPAI-based direct protein abundances can also be carried out in combination with some of the more accurate relative abundance measurement methods, e.g. iTRAQ, 18O-labeling or SILAC, since these do not introduce a detection bias towards certain peptides in the protease digested samples. ICAT, on the other hand, is dependent on the presence of a cysteine in a peptide in order for it to be detected, and cannot therefore easily be combined with the emPAI approach. The specificity to only a subset of all peptides renders this relative quantification method less well suited for concurrent direct protein quantification.
Protein identification in whole proteome analyses by mass spectrometry is still far from reaching complete coverage. Using state-of-the-art methods, up to ~50% of all expressed S. cerevisiae proteins could be identified by MS in a recent study , and it was concluded that current MS sensitivity and speed would still need to improve about tenfold to approach a proteome identification of 100%. Expected coverage should be a little higher for smaller proteomes and thus less complex samples. Nevertheless, to our knowledge, reported protein identification coverage values have not yet exceeded 61% for any proteome. The highest reported value so far was achieved by LC-MS analysis of the ionizing radiation-resistant bacterium Deinococcus radiodurans . The study employed accurate mass and elution time tags to avoid time-consuming MS/MS events. The obtained coverage, however, was still far from complete, and, importantly, protein abundance information was absent. Here, we describe an approach to maximize MS based proteome identification coverage in an application to the E. coli cytosol, in combination with a reliable and quick concentration estimation of the identified proteins.
E. coli is a Gram-negative bacterium of the family Enterobacteriacae. Due to its simple cellular structure and the relative ease of its cultivation and biological modification, it has become the standard 'workhorse' of molecular biology, genetics and biotechnology. This resulted in E. coli becoming one of the most completely characterized organisms in biology. The genome of the laboratory strain E. coli K12 has been among the first organisms to be completely sequenced . It has a relatively small size of ~4.6 Mb, and is predicted to code for approximately 4300 proteins. The genes, proteins, biochemical pathways and molecular interactions in E. coli have been subject to countless experimental studies and the growing number of available information in large scale databases like Genbank and Swissprot, but also in more specialized database projects like e.g. EcoCyc  or EchoBase  allows easy access to a wealth of information. However, in spite of the combined efforts of the scientific community, the complex network of molecular interactions within living organisms, including E. coli, is still far from being fully understood. Deciphering these interaction networks will be a major task of biology in coming years, and detailed knowledge about the concentrations of the individual parts in the system will be an important step on the way to accomplishing this goal.
Pioneering studies on two-dimensional electrophoresis of the E. coli proteome  were followed by 2-DE coupled MALDI-TOF approaches, which led to the identification of 381 E. coli proteins . The first shotgun approach towards the identification of the E. coli proteome was reported by Gevaert et al. . This study focused on methionine-containing peptides and identified approximately 800 proteins from an unfractionated E. coli lysate. It has, however, been suggested that such an approach may result in biased protein abundance data [13, 17–19]. Corbin et al.  and Taoka et al.  then performed LC/LC-MS/MS approaches using multidimensional ion-exchange/reversed phase separation prior to MS/MS analysis. They reported protein expression profiling and protein abundance estimations, but based these purely on the number of identified peptides of each protein.
In order to extend the proteomic coverage of the E. coli cytosol and concurrently obtain minimally biased emPAI derived protein concentrations, we employed approximately 200 LC-MS/MS runs in combination with a variety of peptide/protein fractionation methods, different protease digestion schemes, LC-MS conditions and MS/MS fragmentation. Following this shotgun approach we identified more than thousand different proteins. We also report abundance data for these proteins based on emPAI, thereby providing the largest protein abundance set of the E. coli cytosol available to date.
Protein fractionation, peptide separation and mass spectrometric identification strategies for enhancement of proteome identification coverage explored in this study.
(A) Protein fractionation
(1) SDS-PAGE slicing
(2) Serial ultrafiltration
(3) No protein fractionation
(B) Tryptic digestion
(1) In-solution digestion
(2) In-gel digestion
(C) Peptide chromatography
(1) Strong cation exchange chromatograhpy
(2) Strong anion exchange chromatograhpy
(3) C18 ion pair chromatograhpy
(4) PSDVB with NH4OH, using StageTip
(5) No peptide chromatography
(D) Parent ion selection in LC-MS
(1) Simple repetition
(2) Sequential static exclusion
(3) Different ion pair reagents in subsequent runs
(4) Subdivided scan range
(5) Shallow gradient elution
(E) CID for MS/MS
(2) Linear ion trap
Evaluating the efficiency of the different protein and peptide separation methods and MS approaches listed in Table 1 (for details [see Additional file 1]), we found the following scheme to be optimal for our shotgun analysis of a cytosolic lysate of E. coli MC4100: Initial SDS-PAGE of the lysate sample with subsequent slicing of the gel lanes in five fractions was followed by in-gel tryptic digestion. The resulting peptide mixtures were subjected to strong cation exchange chromatography (SCX) (5 fractions, stepwise elution) and threefold ion pair chromatography (IPC) (60 min gradient) which was directly coupled to LC-MS/MS for peptide identification. Following this procedure with a quadrupole-TOF mass spectrometer, we identified a total number of 810 non-redundant proteins in a single E. coli cytosolic lysate sample. Including results of all previous runs during method comparison with this MS instrument type lead to the detection of a total of 1324 unique proteins of the E. coli cytosol. Note, however, that these numbers were preliminary and were based on a criterion where peptides with probability scores p < 0.05 and rank = 1 were temporarily accepted, even if only a single peptide was observed per protein. This acceptance criterion was subsequently strengthened to a minimum of two peptides per protein for compilation of the final list, as described in Materials and Methods.
We also performed experiments with a linear ion trap (LIT) with faster scan cycles. Parent ion selection with this device differed from the quadrupole-TOF instrument, leading to an increased applicable m/z range (Supplementary Table S1 [see Additional file 1]). Measurements with unfractionated samples of the E. coli cytosol revealed a considerably better performance of LIT when compared to quadrupole-TOF, as shown in Supplementary Figure S1 [see Additional file 1]. Combining the results from both types of MS instruments with protein and peptide pre-fractionation [see Additional file 1] further improved identification coverage and resulted in a total of 1655 proteins of the cytosolic lysate of E. coli MC4100, grown in rich medium.
This combined and more stringent dataset yielded a total of 1103 proteins, quantified by emPAI, based on 13469 observed peptides with unique parent ions (10339 unique sequences) from 209 LC-MS/MS runs with less than 5% false positive rate (see Supplementary Tables S2 [see Additional file 2] and S3 [see Additional file 3], for all identified proteins and peptides, respectively). Our measurements thus provide ~32 - 41 % coverage of the approximately 2680 cytosolic proteins in E. coli, depending on the exact definition of the cytosolic dataset, as defined in Materials and Methods.
To test for potential biases in the peptide identification process we compared a number of physico-chemical properties of the observed peptides with all predicted peptides from the corresponding proteins. These parameters are expected to influence the peptide behavior during many of the employed fractionation and separations steps as for instance chromatography. As listed in Supplementary Table S4 [see Additional file 1], the two sets did not exhibit a significant difference in peptide length, mass, pI or hydrophobicity. Peptide identification should therefore not be largely influenced by the separation and fractionation methods, which is a basic requirement for valid estimation of protein abundance by the emPAI approach .
Independent measurements of emPAI values from biological replicates revealed a good reproducibility with a Pearson correlation coefficient of 0.78 (Supplementary Figure S5 [see Additional file 1]). To further validate the protein abundance values based on emPAI and also test for potential biases introduced by the protein and peptide fractionation schemes, we compared the emPAI based concentrations of 40 proteins from our final set with independently determined concentrations. This was achieved by isotope dilution with a lysate of the E. coli K12 strain BW25113, for which accurate concentrations of these 40 proteins are known  (see Methods section for details). As shown in Figure 2, emPAI correlates well with the copy numbers per cell of these proteins over a range of approximately four orders of magnitude, with a Pearson correlation coefficient of 0.84 and a p-value < 10-10. The achieved accuracy of emPAI derived protein abundance in E. coli is therefore similar to the reported values  and the employed protein and peptide fractionation schemes did not introduce a detectable bias for the tested 40 proteins.
Comparison of the experimental cytosolic sample with the complete predicted E. coli proteome with respect to the number of predicted transmembrane segments (TMS), cellular localization from the PSORT-database and experimental localization data (EXP). Shown is the amount of unique proteins and the relation to the measured number of molecules in the cell.
E. coli complete
Experimental cytosolic dataset
% Abundance d
TMS = 0
TMS = 1
TMS = 2
TMS = 3
TMS = 4
TMS = 5
TMS = 6
TMS = 7
TMS = 8
PSORT = Cytoplasmic (C)
PSORT = CytoplasmicMembrane (CM)
PSORT = Periplasmic (P)
PSORT = OuterMembrane (OM)
PSORT = Extracellular (E)
PSORT = Unknown (U)
PSORT = Unknown (multiple sites) (UM)
PSORT = C| CM | U | UM
PSORT = C | U
TMS = 0 & PSORT = C
TMS = 0 & PSORT = C | CM
TMS = 0 & PSORT = C | CM | U
TMS < = 1 & PSORT = C
TMS < = 1 & PSORT = C | CM
TMS < = 1 & PSORT = C | CM | U
TMS < = 1 & PSORT = C | U
EXP = C
EXP = IM
EXP = OM
EXP = P
TMS < = 1 & EXP = C
TMS < = 1 & EXP = IM
TMS < = 1 & EXP = OM
TMS < = 1 & EXP = P
TMS < = 1 & (PSORT = C|U | EXP = C)
(TMS < = 1 & PSORT = C|U) | EXP = C
The most abundant functional groups in the E. coli cytosol.
FunCat category description
Distinct proteins in this group
Rank (by mean copy number)
nucleic acid binding
Protein with binding function or cofactor requirement (structural or catalytic)
protein folding and stabilization
other rRNA-transcription activities
The most abundant protein folds in the E. coli cytosol.
Number of distinct proteins with this fold a
Rank (by mean copy number)
Ribonuclease H-like motif
NAD(P)-binding Rossmann-fold domains
DNA/RNA-binding 3-helical bundle
P-loop containing nucleoside triphosphate hydrolases
Class II aaRS and biotin synthetases
Adenine nucleotide alpha hydrolase-like
Periplasmic binding protein-like II
Comparison of features associated with protein aggregation between high abundant proteins and the remaining detected proteins. The high abundant group is defined as described in Material and Methods.
Low abundant proteins Mean (Median)
High abundant proteins Mean (Median)
P-value KS-, MW-test
Protein length (in amino acids)
Number of alternating hydrophobic-/hydrophilic stretches (> = 5aa)
pI distance from neutrality
Hydrophobicity (Kyte-Doolite scale)
In agreement with Greenbaum et al. , greater frequencies of small amino acids Ala, Gly and Val were found in highly abundant proteins. Additionally we determined that Leu, Gln, Pro, Ser and Trp are more common in low abundance proteins whereas Lys and Glu is more common in the high abundance group. These compositional differences are a direct consequence of the functional bias observed in abundant and scarce proteins, as described above. Amino acid preferences in proteins of different functionality have been utilized before for coarse function prediction from sequence alone (e.g. ).
The extent to which protein abundance correlates with the level of gene expression has been the subject of intensive studies in the past, primarily based on available yeast data. Early studies made on relatively small sets of abundance measurements were either inconclusive  or reported only a weak correlation between protein and mRNA abundance due to different rates of translation and protein degradation as well as various post-translational modifications . In a more recent study Beyer et al.  hypothesized that a stronger correlation between mRNA and protein abundance may exist within functional modules such as "Metabolism", "Energy", and "Protein synthesis" and within cellular compartments.
In this study we have developed a scheme to maximize the coverage of a proteomic study of the 'shotgun approach' in a reasonable timeframe and number of experimental steps. A combination of both protein and peptide separation methods before application to LC-MS/MS has proven to be the most efficient method to obtain a large and unbiased dataset. For the E. coli cytosol we found a combination of SDS-PAGE protein separation, strong cation exchange chromatography of the in-gel tryptic digest and LC-MS/MS with exchange of ion-pair reagents in subsequent runs to be most efficient. We show that our method is very sensitive to identify and quantify even proteins with extremely low copy numbers. For samples of different origin, the scheme would probably have to be slightly adapted, but it may serve as a good starting point for the experiments.
Calculation of the emPAI values from the mass spectrometrical data allowed us to obtain concentration information for all identified proteins and we therefore achieved to generate the most complete dataset on protein abundance in E. coli to date. Based on available experimental data as well as theoretical predictions of protein localization we estimate that our abundance measurements cover at least 32% of the E. coli cytosolic proteins by identity, with a contamination of non-cytosolic proteins of less than 0.1% by mass. The 197 identified proteins predicted not to reside in cytosol are all very low abundance proteins representing less than 5% of the protein copies of the cell even if the most stringent criteria are applied and ribosomal proteins are excluded.
Abundance of E. coli proteins strongly correlates with gene expressivity and displays a very broad dynamic range - from as high as 105 for molecular components of the biosynthetic machinery to a mere 65 typical for enzymes. There is also a marked bias in the occurrence of structural folds as a function of protein abundance. We found the barrel-sandwhich-fold as defined by the SCOP database to be the most characteristic topology for high-abundance proteins, while P-loop, TIM barrel, and Rossmann folds are associated with less copious gene products. Other essential traits distinctive for highly abundant proteins are less pronounced and include aggregation propensity and significantly higher chance to be essential.
E. coli MC4100 cells were grown at 37°C in rich or minimum medium to exponential phase (OD600nm~0.4), as described . Lysis was induced by dilution of the spheroplasts into an equal volume of 25°C hypo-osmotic lysis buffer (50 mM Tris-HCl (pH 8), 0.01% (w/v) Tween 20, 10 mM MgCl2, 25 U/ml benzonase, 2 mM Pefabloc (Roche), 10 mM glucose and 20 U/ml hexokinase (Roche)). The supernatant was cleared at 30,000 × g for 10 min.
See Supplementary Materials and Methods [see Additional file 1].
See Supplementary Materials and Methods [see Additional file 1].
MS peak lists were created by scripts in Analyst QS (MDS-Sciex) or by Bioworks 3.1 (Thermoelectron) on the basis of the recorded fragmentation spectra and were submitted to the Mascot database searching engine (Matrix Sciences, London, UK) against the E. coli SwissProt database to identify proteins. The following search parameters were used in all Mascot searches: maximum of one missed trypsin cleavage, cysteine carbamidomethylation, methionine oxidation, peptide tolerance ± 0.2 Da for QSTAR data and ± 2.0 Da for LTQ data, MS/MS tolerance ± 0.2 Da for QSTAR data and ± 0.8 Da for LTQ data. All peptides with scores less than the identity threshold (p = 0.05) or a rank > 1 were automatically discarded. We also used the parent ion mass accuracy (mass deviation < 50 ppm for QSTAR data), the predicted retention times  (difference < 10 min), and protein molecular weight estimated from the gel slice as additional requirements for protein identification. Finally, using peptides within the above criteria, we only accepted proteins with two or more peptide hits. For decoy database searching, all peak lists were merged into two files to create QSTAR and LTQ peak lists. These merged peak lists were searched against a decoy database created by the Mascot script 'decoy.pl' supplied by Matrix Sciences. The obtained false positive proteins from two searches were merged and the final false positive rate was estimated to be 4.26% for the final protein identification list (containing a total of 1103 proteins).
Protein abundance expressed as emPAI scale was calculated using the number of observable peptides and the number of the observed parent ions. To calculate the number of observable peptides per protein, proteins were digested in silico and the obtained peptide masses were compared with the scan range of the mass spectrometer. In addition, the expected retention times under our nanoLC conditions were calculated according to the procedure of Meek  and Sakamoto et al.  with our own coefficients based on results of approximately 1500 peptides. Peptides that were too hydrophilic or hydrophobic were eliminated. In-house software was used to calculate emPAI values, the program is accessible at the Keio University web site. Redundancy of unique parent ions in the entire dataset was removed and the number of the unique parent ions per protein was counted. emPAI values were calculated as follows:
where Nobsd and Nobsbl are the number of observed parent ions per protein and the number of observable peptides per protein, respectively.
E. coli MC4100 cells were grown at 37°C in SILAC minimum medium containing Leu-D3 instead of Leu. A stock sample of unlabeled E. coli BW25113 cell pellet, including 59 enzymes with known amounts ranging from 9 to 70,000 copies per cell , was kindly provided by Drs. N. Sugiyama and K. Nakahigashi (Keio Univ). Based on total protein contents, these two samples were mixed at 1:1, 1:10 and 10:1, and were digested by trypsin. After desalting with C18-StageTip, each sample was analyzed with LC-MS/MS using QSTAR as described and was quantified by Mass Navigator version 1.2 (Mitsui Knowledge Industry, Tokyo, Japan). According to the dynamic range of the instrument, peptides with SILAC ratios of 0.1–10 were accepted for calculation of protein concentrations. A total of 40 proteins with at least two quantified peptides per protein were directly quantified from three samples.
Amino acid sequences of all proteins identified in this study were obtained from Swiss-Prot . Throughout this work the primary Swiss-Prot accession code in conjunction with the Swiss-Prot entry name are used as unique protein identifiers. Codon Adaptation Index values (CAI) according to the method of  were used as reported by . Classification of E. coli genes into three groups - (E) genes essential for cell growth (essential), (N) those dispensable for cell growth (non-essential), and (U) those unknown to be essential or non-essential - was based on the comprehensive experimental analysis of . In the latter work, 630 genes were identified as being essential and 3126 as being dispensable using a genetic fingerprinting technique. Data on predicted expression measure of E. coli proteins  were downloaded from the Stanford University web server. Proteins possessing significant sequence similarity (BLAST  E-value threshold 0.001) to one or several domains of known three-dimensional structure as classified in the SCOP database  were attributed to the corresponding SCOP fold. Assignment of genes to functional roles as defined by the MIPS functional catalog version 1.3  was conducted manually at Biomax Informatics AG. Where necessary, correspondence between published protein datasets and the SwissProt database was established based on sequence identity (at least 98%), with some ambiguous cases resolved manually. Minor discrepancies such as a missing methionine at the sequence start or a single amino acid replacement were tolerated.
To compare the coverage of our experimental cytosol sample with the theoretical protein content of cytosol we combined several recent sources of data as well as bioinformatics prediction techniques. For 13% (568 out of 4289) of E. coli proteins experimentally determined cellular localization information has been reported by Lopez-Campistrous et al. . We further utilized the PSORT database  version 2.0 that provides localization annotation for 62% of the complete E. coli proteome (2678 proteins). The remaining E. coli proteins are classified in the PSORT database as "unknown" or "unknown with multiple possible localizations". We complemented this information with the number of transmembrane segments predicted using TMHMM  version 2.0.
Proteins with a high number of predicted transmembrane segments can be safely assumed to be not located within the cytosol. However the TMHMM predictions may lead to an over prediction of cytosolic proteins as this method reliably allows to exclude only those proteins that have multiple integral membrane segments. Furthermore, the possibility of falsely predicted membrane segments needs to be considered. We therefore combined the three data sources described above - the number of transmembrane segments, PSORT localization, and experimental localization - to find the most accurate definition of the E. coli cytosol proteome. First we consider all proteins that have at most one membrane predicted region and are annotated as "cytosolic" or "unknown" in the PSORT database. This criterion would predict 61.46% (2636 of 3289) of the E. coli proteome to be cytosolic (Table 2). The advantage of this estimate is twofold. On the one hand a false positive prediction of one membrane region is still tolerated and thus does not lead to loss of information. On the other hand the intersection with the independent PSORT data ensures that an over prediction of cytosolic proteins is avoided as much as possible. Finally we extend our previous definition and add all proteins that were experimentally determined as cytsolic proteins. This results in 2680 proteins that we adopt as our final estimate of the E. coli cytosol proteome. It is notable that the experimental localization data hardly increase the number of the defined cytosolic proteins (plus 1% or 44 of 2680 difference only). This shows the almost complete overlap of the first definition with the experimentally confirmed protein set and confirms the validly of our approach.
For convenience we considered proteins with copy number values greater than 2050 (emPAI > 29.0) highly abundant, while the rest of the proteins were attributed to the low abundance category. This optimal threshold was automatically found by clustering of the log-copy number values using the Expectation Maximization algorithm  as implemented in the WEKA machine learning workbench , version 3.5.6 using default parameters with the number of clusters set to two. As the copy number values are distributed according to the extreme value distribution, they were logarithmized to be useable with the Gaussian distribution approximation in the clustering process.
All statistical tests and most figures were prepared with the R software package version 2.0 and PROMPT . To compare the distributions of two unpaired samples with non-Gaussian or unknown distributions, the rank-sum Mann-Whitney (MW) test and the two sample Kolmogorov-Smirnov (KS) test were applied using the significance threshold α = 0.05. The null hypothesis of the Mann-Whitney test is that the abundance means are equal. The null hypothesis of the Kolmogorov-Smirnov test is that the values of the two samples are drawn from the same continuous distributions. Both tests have the advantage that they make no assumptions about the distribution of data. To ensure that our tests are not biased by small sample sizes while comparing essential genes with their counterparts, the test results were verified with additional random sampling whereby each of the applied tests was repeated 105 times with a randomly drawn sample of the associated basic population. Then the p-value of the actual test was compared with the p-value distribution of random samples (data not shown). An observed p-value which lies in the 5% quartile shows a reliable test outcome independently of the sampling bias. Descriptive boxplot distribution statistics such as median, quartiles and outliers were generated with R. According to the canonical statistical definition, values greater than the 3rd quartile plus the inter quartile range (IQR) were considered outliers. The IQR is defined as the 3rd quartile value minus the first quartile value. Relationships between variables were analyzed utilizing the least squares regression, loess estimation and the Pearson or Spearman rank correlation methods implemented in R with default parameters.
A set of known E. coli operons was obtained from RegulonDB . For all operons with abundance information available for at least 3 proteins the variance of the natural logarithm of the emPAI values was calculated. The variance indicates how similar the abundance of the proteins within each operon is.
Functional roles of gene products were described in terms of the manually curated hierarchical functional catalog (FUNCAT) . In this catalog each of the 16 main classes (e.g., metabolism, energy) may contain up to six subclasses. An essential feature of FUNCAT is its multidimensionality, meaning that any protein can be assigned to multiple categories. Carefully verified manual assignment of E. coli gene products to functional categories was obtained from Biomax Informatics AG, Martinsried, Germany. Likewise, the SCOP database  provides a hierarchical classification of protein structural domains. SCOP fold assignments to gene E. coli products were based on BLAST E-value of 0.001. In this work both FUNCAT and SCOP designators were truncated to include only the two upper levels of hierarchy. Proteins assigned to the same SCOP fold were grouped and the average emPAI value for each group was calculated. To avoid individual outliers with very high or very low expression levels, only groups with 10 or more proteins were considered. The EC Enzyme Nomenclature information was taken from the Swiss-Prot protein descriptions.
Disorder predictions were taken from our PEDANT database where they are calculated with the software GlobPlot . GlobProt utilizes the statistics of proteins known to have unstructured regions [67, 68]. The number of alternating hydrophobic/hydrophilic stretches was computed as described . The residues A, C, F, G, I, L, M, P, V, W and Y were considered to be hydrophobic and H, Q, N, S, T, K, R, D, E were considered hydrophilic in this study. The hydrophobicity of a protein was defined as , with H i denoting the hydrophobicity value of the amino acid at position i of a protein of n amino acids. Hydrophobicity values were calculated using the Kyte-Doolittle scale .
(liquid chromatography coupled to) tandem mass spectrometry
sodium dodecylsulfate polyacrylamide gel electrophoresis
strong cation exchange
strong anion exchange
C18 reversed phase chromatography with di-n-butylamine acetate as a mobile phase additive
Ion pair chromatography
collision induced dissociation
stable isotope labeling by amino acids in cell culture
We thank Philip Wong for helpful discussions about aggregation and designability, Tobias Mayer for E. coli sample preparation, and Biomax Informatics AG for providing manual annotation of the E. coli genome. We also thank Naoyuki Sugiyama, Takeshi Masuda, Kenji Nakahigashi and other members in Keio University for providing E. coli BW25113 cells with their quantitative proteome data. Y.I. thanks Eisai Co., Ltd. for the opportunity to stay at CEBI.
This work was supported by the Integrated Project "Interaction Proteome" of the European Commission and the program "Bioinformatics Initiative Munich" by Deutsche Forschungsgemeinschaft.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.