How to find soluble proteins: a comprehensive analysis of alpha/beta hydrolases for recombinant expression in E. coli

Background In screening of libraries derived by expression cloning, expression of active proteins in E. coli can be limited by formation of inclusion bodies. In these cases it would be desirable to enrich gene libraries for coding sequences with soluble gene products in E. coli and thus to improve the efficiency of screening. Previously Wilkinson and Harrison showed that solubility can be predicted from amino acid composition (Biotechnology 1991, 9(5):443–448). We have applied this analysis to members of the alpha/beta hydrolase fold family to predict their solubility in E. coli. alpha/beta hydrolases are a highly diverse family with more than 1800 proteins which have been grouped into homologous families and superfamilies. Results The predicted solubility in E. coli depends on hydrolase size, phylogenetic origin of the host organism, the homologous family and the superfamily, to which the hydrolase belongs. In general small hydrolases are predicted to be more soluble than large hydrolases, and eukaryotic hydrolases are predicted to be less soluble in E. coli than prokaryotic ones. However, combining phylogenetic origin and size leads to more complex conclusions. Hydrolases from prokaryotic, fungal and metazoan origin are predicted to be most soluble if they are of small, medium and large size, respectively. We observed large variations of predicted solubility between hydrolases from different homologous families and from different taxa. Conclusion A comprehensive analysis of all alpha/beta hydrolase sequences allows more efficient screenings for new soluble alpha/beta hydrolases by the use of libraries which contain more soluble gene products. Screening of hydrolases from families whose members are hard to express as soluble proteins in E. coli should first be done in coding sequences of organisms from phylogenetic groups with the highest average of predicted solubility for proteins of this family. The tools developed here can be used to identify attractive target genes for expression using protein sequences published in databases. This analysis also directs the design of degenerate, family- specific primers to amplify new members from homologous families or superfamilies with a high probability of soluble alpha/beta hydrolases.


Background
It was observed that screening of libraries derived by expression cloning for new gene products with a given catalytic activity can be limited by formation of inclusion bodies. In these cases it would be desirable to construct libraries which a higher fraction of soluble gene products. Enrichment of soluble proteins could be achieved either by limiting the screening to coding sequences from phylogenetic groups with mainly soluble proteins or by the use of degenerated primers which are specific for homologous families with mainly soluble proteins.
Alternatively solubility can be improved by the use of fusion proteins [2] like NusA, MBP, Thioredoxin, GrpE, BFR, GST, DsbA [3], the N-terminal domain of IF2 [4] or phage coat protein III [5]. In addition fusion proteins allow affinity chromatography and thus simplify purification [6]. However in some cases the fusion partner has to be removed after protein purification [7] and therefore constitutes and additional step in protein preparation. Other strategies try to improve in vivo solubility of a recombinant protein by protein engineering using strategies like molecular evolution [8,9] or rational protein design. Examples for protein design are the insertion of positively charged residues into hydrophobic patches on the surface [10], exchange of phenylalanines by serines [11] or asparagine residues by aspartic acids [12]. It has been shown that single residues can have a major impact on solubility [13][14][15][16]. But also the expression system is important for in vivo solubility [2,17]. Other factors are cultivation and disruption conditions [18][19][20], rate of protein synthesis [21], fermentation temperature [20,22] and the amount of helper protein [12,[23][24][25][26]. Fusion proteins have already been successfully used in high throughput expression-studies to improve solubility [8,[27][28][29][30].
However, engineering approaches are limited to specific proteins or growth conditions and are hardly applicable to high throughput expression. Therefore, in projects where libraries are screened for activity, solubility is always an implicit criterion. Previously Harrison et al. introduced a two-parameter model based on an analysis of 81 proteins for which experimental data on solubility exists [1,3]. The model proposes that solubility of proteins in E. coli at physiological conditions depends mainly on protein charge and the relative number of turn-forming residues. Interestingly, the model holds for a broad variety of proteins. We applied this model to a comprehensive analysis of the Lipase Engineering Database (LED) [31,32] which includes more than 1800 α/β-hydrolases. α/β-hydrolases share the same fold but are highly diverse in sequence. They are ubiquitous and include cellular and secreted proteins from a wide range of organisms.
The aim of this study was to find correlations between predicted solubility in E. coli and protein size or phylogenetic origin. Homologous families and superfamilies were analysed for the predicted solubility of their members.

Kingdoms of life
According to Harrison et al. [1,3] the canonical variable CV-CV' predicts from the protein sequence whether a recombinantly expressed protein is soluble in the cytoplasm of E. coli (CV-CV' < 0) or will form inclusion bodies (CV-CV' > 0). For CV-CV' = 0 the probability of solubility is 0.5. The probability of solubility or insolubility rises for higher absolutes of CV-CV'. An analysis of the Lipase Engineering Database indicates that most of the hydrolases are predicted to be insoluble in E. coli, with a major peak at CV-CV' = 0.8 and a minor peak at CV-CV' = 0.2 ( Figure 1). A separate analysis of hydrolases from eukaryotic and prokaryotic origin (679 and 686 hydrolases, respectively) demonstrates that these two peaks are formed predominantly by the hydrolases from each of the two kingdoms of life. The distribution of CV-CV' of bacterial hydrolases is characterized by an average of 0.42 and a first quartile of -0.18. Because the first quartile indicates the minimum solubility of the 25 % most soluble hydrolases, more than 25 % of all bacterial hydrolases are predicted to be soluble. In contrast, for eukaryotic hydrolases the average of CV-CV' is 0.94; the first quartile is 0.58. Thus, only a small fraction of eukaryotic hydrolases is predicted to be soluble in E. coli. α/β-hydrolases from archaea were not investigated because they are only represented with 23 hydrolases.
Thus, in general, bacterial α/β-hydrolases are predicted to be more soluble in E. coli than eukaryotic α/β-hydrolases.

Protein size
The family of α/β-hydrolases falls into three major groups of protein size: small (150-380 amino acids), mediumsized (380-520 amino acids) and large hydrolases (more than 520 amino acids) ( Figure 2). Large hydrolases are mainly from eukaryotic origin while small hydrolases are mainly from bacterial origin. Correlating sequence length and predicted solubility in E. coli demonstrates that large hydrolases are predicted to be less soluble in E. coli than smaller ones ( Figure 3). Hydrolases predicted to be soluble in E. coli (CV-CV' < 0) have sequence lengths between 200 and 400 amino acids, most hydrolases of more than 400 amino acids are predicted to be insoluble in E. coli (CV-CV' > 0). A high fraction of bacterial and archaean small hydrolases is predicted to be soluble in E. coli, while eukaryotic small hydrolases are predicted to be mainly insoluble. The two outliers with CV-CV' < -1 and a sequence length larger than 650 are a lipase from Staphylococcus xylosus (AAG35726) and CG6296 from Drosophila melanogaster (AAF56648). Both are putative proteins and have not yet been expressed in E. coli. Thus, there are two general observations: (1) small α/β-hydrolases are predicted to be more soluble in E. coli than large α/β-hydrolases and (2) eukaryotic α/β-hydrolases are larger than prokaryotic α/β-hydrolases.

Analysis of groups formed by protein size
To distinguish the effects of hydrolase size and phylogenetic origin on solubility, small, medium-sized and large α/β-hydrolases were investigated separately, and hydrolases were grouped by phylogeny of their origin (Tables 1, 2, 3).
Small hydrolases have an average of CV-CV' of 0.48 and a first quartile of -0.17, thus more than 25 % of small hydrolases are predicted to be soluble in E. coli. All taxa have a positive average of CV-CV' (Table 1), indicating that less than 50 % of the hydrolases in each taxon are pre-dicted to be soluble. Hydrolases from eukaryotic taxa are at average predicted to be highly insoluble. All prokaryotic taxa and plants have negative first quartiles of CV-CV', thus more than 25 % of their hydrolases are predicted to be soluble. All other eukaryotic taxa have positive first quartiles of CV-CV', thus most of their hydrolases are predicted to be insoluble. The taxa containing hydrolases with the highest predicted solubility average are from bacteria, the taxa containing hydrolases with the lowest predicted solubility average are from eukaryota.  predicted solubility average are from bacteria and fungi. The taxa containing hydrolases with the lowest predicted solubility average are from metazoa and plants.
Large hydrolases have an average of CV-CV' of 0.93 and a first quartile of 0.61, they are in general predicted to be highly insoluble. All taxa have positive averages and first quartiles of CV-CV', and thus most of their hydrolases are predicted to be insoluble ( Table 3). The taxa containing hydrolases with the highest predicted solubility average are from metazoa, the taxa containing hydrolases with the lowest predicted solubility average are from bacteria and fungi.
In general CV-CV' from bacterial hydrolases is much lower if the hydrolase is small, while metazoan hydrolases have a lower average and first quartile of CV-CV' if the hydrolase is large. Though large metazoan hydrolases have a higher probability of solubility than small metazoan hydrolases, few large hydrolases are predicted to be soluble in E. coli ( Figure 3). This is consistent with the result from Table 3 that in the analysis of large hydrolases no taxon with a negative first quartile of CV-CV' could be found.
Thus there are several conclusions. Large α/β-hydrolases from bacteria are predicted to be less soluble than smaller ones. Small bacterial α/β-hydrolases are predicted to be more soluble than both small and large eukaryotic α/βhydrolases. But large hydrolases from metazoa are predicted to be more soluble than large hydrolases from bacteria and small hydrolases from metazoa. So there seems to be a principal difference between metazoa and bacteria. Fungi, especially ascomycetes, behave differently. Their α/ β-hydrolases with the highest predicted solubility are mainly medium-sized.

Analysis of solubility by genera
Of the 257 genera represented in the database most include only a few hydrolases. Therefore the 29 genera with at least ten hydrolases were analyzed ( Hydrolases from bacterial genera show a wide range of averages of predicted solubility. Interestingly the bacterial genera with the highest and the lowest average of CV-CV' (Mycobacterium and Rhodococcus) are both from actinobacteria. The third large genus of actinobacteria, Streptomyces, is predicted to include more than 25 % soluble hydrolases as it is shown by a negative first quartile.
Thus the averages of different genera of a taxon may differ completely in predicted solubility. Therefore a strategy to identify α/β-hydrolases from actinobacteria that are soluble in E. coli should focus on proteins from Rhodococcus and Streptomyces, but not from Mycobacterium.

Analysis of solubility by sequence similarity
Hydrolases with sequence similarity have been assigned to superfamilies which were analysed ( Table B in  Correlation between CV-CV' and the length of protein sequences in bacterial (blue), eukaryotic (green), and archaean (red) α/ β-hydrolases.  nases) (supplementary file 2). In general, the average of CV-CV' depends on sequence length. However, the superfamily with the lowest sequence length in this table (Table  B in the supplementary file 1), Bacillus lipase, has the lowest predicted solubility.
Proteins from cytosolic hydrolases, a large superfamily that contains mainly epoxide hydrolases and haloalkane dehalogenases, are in general predicted to be more soluble than proteins from most other superfamilies. The average of CV-CV' is 0.29 and the first quartile is -0.22. This superfamily was chosen for a more detailed analysis of homologous families (

The solubility model
The dataset of the statistical solubility model were 81 highly diverse proteins, for which solubility data exists, from a wide range of organisms [1]. The published prediction accuracy of the five parameter model was 76 % for soluble proteins and 91 % for insoluble proteins [1], the overall accuracy for solubility prediction of proteins from the dataset was 88% [1]. It has been discovered that only two out of the five parameters are critical for distinguishing between soluble and insoluble proteins [3]. Therefore, CV-CV', the indicator of predicted solubility, was derived by two terms: the total charge calculated by the relative numbers of arginines, lysines, aspartic acids and glutamic acids, and the relative number of turn forming residues calculated by the relative number of asparagines, glycines, prolines and serines. Those two parameters show a level of significance of 100% [1].
Most of the variation of CV-CV' among α/β-hydrolases is caused by the charge term. A high negative or positive charge results in the best predicted solubility. However, the pK value of a titratable group is highly dependent on the proteins structure. Short-and long-range interactions can lead to pK shifts of more than two units, changing the total protein charge by more than five unit charges [33,34].
An evaluation of nine frequently used fusion proteins used to improve solubility in E. coli [3][4][5] showed that seven proteins indeed are predicted to be soluble in E. coli (data not shown). The exceptions are MBP which has a CV-CV' of 0.23, and phage coat protein III which has a CV-CV' of 0.62 (data not shown).
To test the model in the prediction of solubility of hydrolases, 35 hydrolases from the PDB which were annotated as expressed in E. coli were examined ( Table D in the supplementary file 1). Because good solubility is prerequisite for successful crystallization, this group of proteins served as a positive control for the predictive value of our method. 24 of 35 hydrolases are predicted to be soluble in E. coli (CV-CV' < 0). For all 35 hydrolases, the average of CV-CV' is -0.23, the first quartile is -0.80. Thus, the predicted average solubility of this group of proteins is much higher than any single protein family in the database, including proteins from enterobacteria and the genus Escherichia which have an average of CV-CV' of 0.24 and 0.10 respectively (data not shown), which is a strong support for the reliability of this method.
In addition, the observation that substitution of asparagines by aspartic acids in DsbA-IGFBP-3 fusion proteins improved solubility [12] is consistent with the solubility formula because the fusion protein is already predicted to be negatively charged (data not shown). Thus, increasing the negative charge increases predicted solubility. Similarly, incorporation of solvent exposed positive charged amino acids improved solubility of consensus ankyrin repeat proteins [10]. Here the net charge proposed by the solubility formula was zero (data not shown), so solubility is predicted to be increased by insertion of positively or negatively charged amino acids.
However, the solubility formula exclusively depends on sequence information and neglects the structural context. Therefore, in some cases this simple model resulted in wrong predictions. In the model, substitution of multiple phenylalanine residues by serine [11] led to a predicted increase of the ratio of turn-forming residues in the formula and thus to a lowered predicted solubility in E. coli. Instead, an increase in solubility was observed experimentally [11] which can be explained by the structural context: replacement of solvent-accessible phenylalanines increased polarity of the proteins surface and thus its solubility. The fact that single residues can have a huge impact on in vivo solubility [13][14][15][16] is not always fully explainable by the solubility formula. Therefore, for the design of proteins with higher solubility the formula should be combined with a careful investigation of the structural context.

Solubility of α/β-hydrolases
Protein size was not included as a parameter in the solubility model, because it showed no significant difference between soluble and insoluble proteins of the dataset [1]. However, the analysis of groups formed by α/β-hydrolase size revealed that both protein size and phylogeny of organisms have a major impact on predicted solubility. Statements like 'Small hydrolases are predicted to be more soluble than large hydrolases', 'Bacterial hydrolases are predicted to be more soluble than eukaryotic hydrolases' are generally true, but simplifications.
For small to medium-sized hydrolases, bacterial hydrolases are predicted to be more soluble than eukaryotic hydrolases, archaea are somewhere in between. The relatively high predicted solubility of medium-sized fungal hydrolases is in contrast to the low predicted solubility of small and large fungal hydrolases. Hydrolases from plants are predicted to be much more soluble when they are small than when they are of medium size. However, the significance of this statement is relatively low because there are only eight medium-sized hydrolases from plants.
Though large hydrolases are mainly predicted to be insoluble the CV-CV' values are much lower for metazoan than for bacterial and fungal hydrolases. So here the situation is opposite to the situation of small and medium-sized hydrolases. Additionally, large metazoan hydrolases are predicted to be more soluble in E. coli than small and medium-sized metazoan hydrolases.
These results propose that, in problematic cases where soluble hydrolases are very rare, it makes sense to screen for large hydrolases in coding sequences of metazoa, small hydrolases should be searched in bacterial or archaean and medium-sized hydrolases in actinobacterial or ascomycetic coding sequences. Secreted and eukaryotic hydrolases generally have a low predicted solubility in E. coli.
As the data about taxa, the family information could be used to efficiently search for new soluble hydrolases. If a specific catalytic activity is observed in different homologous families or a soluble member of a specific superfamily is searched for a structural genomics project, family specific degenerate primers could preferably be designed for families which are predicted to include mainly soluble hydrolases.
If soluble hydrolases of a given size are rarely found, screening of libraries could be limited to coding sequences from taxa where soluble hydrolases are expected.

Conclusion
General rules for the relationship between predicted solubility in E. coli, protein size and phylogenetic origin are: 1) Bacterial hydrolases are predicted to be more soluble in E. coli than eukaryotic hydrolases.
2) Small hydrolases are predicted to be more soluble in E. coli than large hydrolases.
3) In one taxon huge differences of predicted solubility between genera can exist.

4)
In one superfamily huge differences of predicted solubility between homologous families can exist.
When taking into account the groups formed by protein size there are three additional rules: 5) Small bacterial hydrolases are predicted to be more soluble than small and large hydrolases from eukaryotes and large bacterial hydrolases.
6) Large metazoan hydrolases are predicted to be more soluble than large hydrolases from bacteria and small hydrolases from metazoa.

7)
Fungal medium-sized hydrolases are predicted to be more soluble than small or large hydrolases from fungi.
When using the family characteristics family-specific primers could be designed to amplify members from specific homologous families with high averages of predicted solubility in E. coli.

The database
Sequence data for analysis was derived from the Lipase Engineering Database (LED) [31,32] which integrates sequence, structure and annotation information of α/βhydrolases. The database comprises 1820 hydrolases which are grouped into 149 homologous families and 45 superfamilies.
To remove fragments, only hydrolases larger than 170 amino acids were included in the analysis.

The solubility model
A two parameter statistical model by Harrison et al. [1,3] was used to predict in vivo solubility of recombinant proteins in E. coli. The main parameters for solubility in E. coli are the relative number of turn forming residues (asparagine, glycine, proline and serine) and the absolute of charge per residue which is determined by the fraction of positively and negatively charged amino acids (arginine, lysine, aspartic acid, glutamic acid). These values have been combined to a canonical variable CV, where N, G, P, S, R, K, D, E, are the numbers of asparagines, glycines, prolines, serines, arginines, lysines, aspartic acids and glutamic acids, respectively, and n is the total number of residues in the sequence.
To distinguish soluble from insoluble protein a threshold of CV' = 1.71 was introduced. If the difference CV-CV' is smaller than zero, the protein is predicted to be soluble in E. coli. If it is larger than zero the protein is predicted to be insoluble in E. coli. From CV-CV' a probability of solubility is calculated: The dependency of solubility on the parameters relative number of turn forming residues and absolute of charge per residue can be interpreted as follows: the higher the charge, the higher the repulsion between proteins, thus preventing aggregation. Additionally, many turn-forming residues slow down protein folding, resulting in a high concentration of folding intermediates which can form aggregates.

Statistical methods
Averages of CV-CV', the relative number of turn forming residues, charge per residue and length of protein sequence were computed. Additionally first quartiles of CV-CV' were determined. For determination of first quartiles the values were ordered by size and divided into four groups with the same number of members. The first quartile is the largest value of CV-CV' of the group which contains 25 % of the smallest values of CV-CV'. This means that 25 % of all proteins in the distribution are predicted to have at least the solubility that is represented by the value of the first quartile. When used in combination with averages, quartiles give additional information about the shape of the distribution. They make it possible to compare distributions even if their averages are very similar. While the average value of CV-CV' gives an upper limit of solubility to 50 % of all proteins of a distribution, the first quartile indicates the solubility of the 25 % best soluble proteins.

Visualisation of distributions
The optimal window size for the visualisation of the distributions d 1 , d 2 and d 3 (Figures 1 and 2) was determined as follows. For each distribution a window size w was determined, where max and min are the largest and the smallest values of the distribution and n is the number of values in the distribution.