A graphtheoretic approach for classification and structure prediction of transmembrane βbarrel proteins
 Van Du T Tran^{1}Email author,
 Philippe Chassignet^{1},
 Saad Sheikh^{1} and
 JeanMarc Steyaert^{1}
https://doi.org/10.1186/1471216413S2S5
© Tran et al.; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Abstract
Background
Transmembrane βbarrel proteins are a special class of transmembrane proteins which play several key roles in human body and diseases. Due to experimental difficulties, the number of transmembrane βbarrel proteins with known structures is very small. Over the years, a number of learningbased methods have been introduced for recognition and structure prediction of transmembrane βbarrel proteins. Most of these methods emphasize on homology search rather than any biological or chemical basis.
Results
We present a novel graphtheoretic model for classification and structure prediction of transmembrane βbarrel proteins. This model folds proteins based on energy minimization rather than a homology search, avoiding any assumption on availability of training dataset. The ab initio model presented in this paper is the first method to allow for permutations in the structure of transmembrane proteins and provides more structural information than any known algorithm. The model is also able to recognize βbarrels by assessing the pseudo free energy. We assess the structure prediction on 41 proteins gathered from existing databases on experimentally validated transmembrane βbarrel proteins. We show that our approach is quite accurate with over 90% Fscore on strands and over 74% Fscore on residues. The results are comparable to other algorithms suggesting that our pseudoenergy model is close to the actual physical model. We test our classification approach and show that it is able to reject αhelical bundles with 100% accuracy and βbarrel lipocalins with 97% accuracy.
Conclusions
We show that it is possible to design models for classification and structure prediction for transmembrane βbarrel proteins which do not depend essentially on training sets but on combinatorial properties of the structures to be proved. These models are fairly accurate, robust and can be run very efficiently on PClike computers. Such models are useful for the genome screening.
Background
Transmembrane proteins play several key roles in the human body including intercell communication, transportation of nutrients, and ion transport. They also play key roles in human diseases like depression, hypertension, cancer, thus are targeted by a majority of pharmaceuticals being manufactured today. The transmembrane proteins are divided into two main types according to their conformation: αhelical bundles and βbarrels (TMB). The TMB proteins, which are much less abundant than helical bundles, are found in the outer membrane of Gramnegative bacteria, mitochondria and chloroplasts. They perform diverse functions such as porins, passive or active transporters, enzymes, defensive or structural proteins [1]. Thus, the structure of TMB proteins is very important for both biological and medical sciences.
These proteins, which span the membrane entirely, make up 20  30% of identified proteins in most whole genomes. However, due to difficulties in determination of their structures, solved TMB structures constitute only a meagre 2% of the RCSB Protein Data Bank (PDB) [2–5]. This is mainly due to experimental difficulties and complexity of the TMB structure [6]. Consequently, various learningbased techniques have been developed for discriminating TMB proteins from globular and transmembrane αhelical proteins [6–8], and for predicting TMB secondary structures [7–12]. We first discuss these methods and their potential shortcomings in detail, and then proceed with describing our approach.
Ou et al. [10] proposed a method based on radial basis function networks to predict the number of βstrands and membrane spanning regions in βbarrel outer membrane proteins. Randall et al. [9] tried to predict the TMB secondary structure with 1D recursive neural network using alignment profiles. Gromiha et al. [7, 8] used the amino acid compositions of both globular and outer membrane proteins (OMPs) to discriminate OMPs and developed a feed forward neural networkbased method to predict the transmembrane segments. Bagos et al. [11] produced a consensus prediction from different methods based on hidden Markov models, neural networks and support vector machines [8, 13–19]. Tractability has been an issue for some of these approaches. In order to overcome this limitation, Waldispühl et al. [12] used a structural model and pairwise interstrand residue statistical potentials derived from globular proteins to predict the supersecondary structure of TMB proteins. Freeman et al. [6] have introduced a statistical approach for recognition of TMB proteins based on known physicochemical properties.
Most of these rely on the learning assumptions in the underlying models as well as the sampling of proteins in their training set. However, the number of TMB proteins known today is tiny. Thus, it is arguable whether these approaches can work well for recognizing and folding TMB proteins which are not homologous to those currently known. It is also important to note that none of these methods allow for permutations in protein structures. The TMB structures are not merely a series of βstrands where each is bonded to the preceding and succeeding ones in the primary sequence, but they may contain Greek key or Jelly roll motifs as well, for instance, the Cterminal domain of the PapC usher [PDB:3L48]. This level of structure may be described as a permutation on the order of the bonded strands.
In this paper, we present a novel ab initio model for classification and structure prediction of TMB proteins based on minimizing free energy in a graphtheoretic framework. It is able to deal with permuted TMB structures. The prediction accuracy is evaluated on known TMB proteins available in popular protein databases [20], and compared with existing software [9, 10, 12, 21]. Our approach also performs well in structure prediction and the results are comparable to those of the existing algorithms. Ours is the first model that actually gives an insight into the physicochemical model rather than merely classifying or predicting TMB proteins. The results show that our approach is also good at discriminating TMB proteins.
Results and discussion
Folding
Comparison of prediction accuracy on PDBTM40
Residues  Strands  

Method  Q _{2}  Specificity  Sensitivity  Fscore  MCC  Specificity  Sensitivity  Fscore  MCC 
TMBpro  81.2 ± 6.1*  79.3 ± 7.9  84.2 ± 11.2  0.76 ± 0.1  0.61 ± 0.14  90.1 ± 15.0  94.2 ± 12.5  0.93 ± 0.12  0.85 ± 0.26 
BBP  79.2 ± 5.4  78.4 ± 6.3  80.4 ± 9.9  0.74 ± 0.1  0.57 ± 0.12  91.4 ± 12.0  91.4 ± 11.3  0.92 ± 0.11  0.83 ± 0.22 
The TMBETAPREDRBF webserver predicted nonTMB for 24 over 41 proteins of PDBTM40, or 58.5%. The structures for correctly identified proteins were completely accurate. This might be because they were included in the training set.
Evaluation of shear numbers
Influence of the filtering threshold
Comparison of prediction accuracy on ECOLI40 with different thresholds
Residues  Strands  

ρ  Q _{2}  Specificity  Sensitivity  Fscore  MCC  Specificity  Sensitivity  Fscore  MCC 
2/3  80.9 ± 4.8*  80.4 ± 5.2  82.7 ± 8.4  0.77 ± 0.04  0.61 ± 0.08  94.8 ± 5.7  93.3 ± 5.9  0.94 ± 0.05  0.88 ± 0.1 
1/2  79.7 ± 6.0  78.5 ± 5.1  82.4 ± 8.6  0.76 ± 0.05  0.58 ± 0.11  96.1 ± 4.8  95.4 ± 5.3  0.96 ± 0.05  0.91 ± 0.09 
1/3  77.7 ± 5.6  75.6 ± 6.5  81.1 ± 8.6  0.74 ± 0.05  0.55 ± 0.11  91.7 ± 9.2  94.9 ± 6.5  0.94 ± 0.07  0.87 ± 0.07 
Evaluation on mutated sequences
Permuted structures
For [PDB:3L48], the Cterminal domain of the PapC usher in E. coli, the observed structure topology containing a Greek key motif corresponds to the permutation σ = (1, 4, 3, 2, 5, 6, 7) and is predicted with an accuracy (Q_{2}) of 70.2% at ρ = 0.2.
Classification
100% of the nonredundant set of 177 αhelical transmembrane proteins of length from 140 to 800 residues in PDBTM are rejected, whereas 31 out of 32 nonredundant lipocalins taken from PDB are predicted as nonTMB (the dataset is available at [24]). Though lipocalins are also βbarrels which reverse the TMB pattern with a hydrophobic core, the environmental effects on both sides of the barrel are still different. Our pseudoenergy model yields unfavorably on such structures and discriminates considerably better than the learningbased methods like FreemanWimley [6], TMBpro [9], PREDTMBB [18] and TMBETAPREDRBF [10], but also of transFold [12].
Conclusions
We have presented a new pseudoenergy minimization method for the classification and prediction of transmembrane protein supersecondary structure based on a variety of potential structures. Our approach takes into account many physicochemical constraints and minimizes the free energy. It also accounts for permuted structures, thus giving more complete information on the folded structure. Our method is quite accurate with more than 90% sensitivity and Fscore, over 80% M.C.C. score on strands; and over 74% accuracy and Fscore on residues. The results are comparable to those given by TMBpro and TMBETAPREDRBF, which are both learning based methods. Moreover, our results are more consistent and have a significantly less variation across different TMB proteins. This is especially interesting given that our algorithm is based mainly on pseudoenergy minimizations, and the probabilistic model only plays a very small role. While the model presented here is only for TMB proteins, it can be easily extended to accommodate αhelical bundles. We did not use a more sophisticated statistical model for classifying βbarrel strands because that would risk overfitting and reliance on the training dataset. It is also interesting to note that our approach performs very well for identification of TMB proteins, rejecting all the αhelical bundles. The Freeman and Wimley [6] approach is more accurate on some datasets. However, it risks overfitting and does not predict the structure. Therefore, our approach provides the best overall classification results amongst the methods that try to predict structures. Our model does learn the probabilistic model from training dataset, but it is mainly to screen out obvious nonTMB strands. Therefore, there are no concerns about the size of the training data or overfitting.
Even though the results presented in this paper are comparable to other methods, the methodology presented here is novel and gives insight into the actual physicochemical constraints and energy. Moreover, our approach should be able to predict TMB proteins which are significantly different from known proteins. Finally, our approach provides more information than the current approaches by providing the permutations of the strands.
Future work
We are working on energy models for TM αhelical bundles and βbarrels with broken strands, as well as globular βbarrels like lipocalins or membrane targeting proteins (C2 domain) where permuted structures are usually found. Nevertheless, similar to the other methods, we only propose singledomain protein structures.
We are also currently working on refinements in structural constraints and hydrophobicity, which may help to improve the accuracy of our predicted structure. Finally, it will be interesting to investigate more sophisticated statistical models for the initial screening, both to improve the results and understand how effective a mixed approach can be.
Methods
We now present the methods developed for classification and structure prediction of TMB proteins (a preliminary version of this work appeared as a short paper in [25, 26]). TMB proteins are hard to identify, however, it is relatively easy to identify a majority of other proteins which are not TMB. We use physicochemical properties and a simple probabilistic model based on a sliding window for filtering amino acid segments that are obviously not involved in any βbarrel structures as a membrane spanning βstrand. Proteins that are considered to be putative TMB proteins by this initial phase are then further analyzed. Next, we try to fold the given protein, treating it as a TMB protein, using the pseudoenergy minimization model. If the protein cannot be folded into βbarrels according to the energy minimization framework, the protein is rejected and classified as a nonTMB protein.
Before presenting the simple model that we used for filtering the transmembrane βstrands, we discuss some physicochemical constraints that a protein must obey to be a TMB protein. We enforce these constraints in both the filtering and folding steps of our algorithm.
Geometric framework for βbarrels
For a regular β barrel [27–29], the backbone geometry is entirely determined by n, the number of strands composing the barrel, and by S, the shear number, which is defined below.
Definition 1 Shear number of a βbarrel In a regular βbarrel, the shear number S is unambiguously defined as the ordinal distance between an amino acid A and an amino acid B that is located on the same strand as A and linked to A through a path of hydrogen bonds. B is the projection of the "copy" of A after one turn on the first strand of the barrel.
Angle θ, in association with a given membrane thickness, is involved in the energetic rules and restricts the membrane spanning βstrand length. Then, n and S have to be fixed as parameters.
Definition 2 Relative shear number Given a shear number S, the relative shears between adjacent strands remain as n  1 degrees of freedom. As a convention, we consider the relative shears on the extracellular side of the barrel. So, ∀i > 1, s_{ i }, the relative shear of strand i + 1 with respect to strand i (strand n + 1 being identified with 1), is measured on strand i as the ordinal distance between the undermost amino acid of strand i and the one that is directly bound to the undermost amino acid of strand i + 1.
We define the shear number, by extension, for the case of a βsheet (i.e. an open βbarrel) to make our algorithms capable of dealing with the structure of βsheets.
where s_{ i } is the relative shear of strand i + 1 with regard to strand i.
Each βstrand is directed with respect to the sequence order from Nterminal to Cterminal. A strand is said to be upward if it is oriented from the extracellular environment to the periplasmic space, i.e. the Nterminal of the strand is located on the extracellular side and its Cterminal is on the periplasmic side. Inversely, the strand is said to be downward. The upward/downward orientation of the strand, relatively to the barrel axis, defines another degree of freedom.
Finally, considering a βstrand as a ribbon where the amino acids direct their sidechains alternatively on both sides, toward the barrel interior (channel) or toward the surrounding lipid (membrane), we will distinguish two ways of facing, neglecting small swivel adjustments. A strand is said to be odd inward if the odd indexed amino acids face to the channel and odd outward if those face to the membrane. We have one more degree of freedom.
The segment is then considered as a candidate for conformation τ if $\u0128\left(\tau \phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}\stackrel{\u0304}{\tau};{r}_{1}\phantom{\rule{2.77695pt}{0ex}}{r}_{2}\dots {r}_{p}\right)>0.$
The nonredundant training set PDBTM40 of 41 TMB proteins is used to learn this probabilistic model. Due to the small size of the training set, we apply the filter with a relatively low threshold at $\rho =\frac{2}{3}$ to avoid overfitting. This ensures that on average, each block r is accepted in conformation τ if the propensity for τ to be in τ (i.e. f_{ τ ,r } /f_{ τ }, ·) is at most 1.5 times less than the propensity to be in $\stackrel{\u0304}{\tau}\left(\mathsf{\text{i}}\mathsf{\text{.e}}\mathsf{\text{.}}{f}_{\stackrel{\u0304}{\tau},r}/{f}_{\stackrel{\u0304}{\tau},}\cdot \right)$. Only substrings that pass these very stringent criteria are considered to be putative strands.
Now we present a graphtheoretic energy minimization model for recognizing and folding TMB proteins.
Definition of the graph structure
Edges
for the lexicographic order, and this ensures the DAG structure.
The set E also contains edges of the form (⊤, v) that define the subset of starting vertices  the leading substrings satisfying specific constraints. Similarly, E contains edges of the form (v, ⊥) that define the subset of ending vertices, with a satisfactory trailing substring. Again, the length constraints applied to the substrings associated to edges imply that  E , the number of edges, is $\mathcal{O}\left(\left\mathbf{V}\right\right)$ or $\mathcal{O}\left(N\right)$.
Figure 9 gives a small example of such a graph (to simplify, only one orientation has been considered). An edge like (v_{1}, v_{2}) is forbidden, since the two corresponding substrings overlap. Edges like (v_{2}, v_{3}) or (v_{2}, v_{6}) are also forbidden, since the inserted substrings are respectively too short for a turn or too long for a loop.
Energy attributes
The attributes that complete the definition of the graph G are pseudoenergy functions defined as follows:

$\forall v\in {\mathbf{V}}^{\mathsf{\text{*}}},{\mathcal{E}}_{\mathsf{\text{intr}}}\left(v\right)$ represents the intrinsic energy of the given strand in the given orientation. This term is the sum of both the internal energy of the substructure, i.e. the interactions between its own amino acids, and the interaction energy with the environment (e.g. membrane and channel) apart from the rest of the considered protein.
Note that ${\mathcal{E}}_{\mathsf{\text{intr}}}\left(\top \right)={\mathcal{E}}_{\mathsf{\text{intr}}}\left(\perp \right)=0.$

$\forall \left(v,w\right)\in {\mathbf{V}}^{*}\times {\mathbf{V}}^{*},{\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,w,s\right)$ represents the interaction energy of the pair (v, w) when the two corresponding strands are placed side by side along the barrel, with respect to the respective orientation parameters associated to the vertices and accordingly to the relative shear s. The energy will take into account the number of contacts and different sidechain interactions such as the packing of hydrophobic cores and bonding abilities. Then, $\forall \left(v,w\right)\in {\mathbf{V}}^{*}\times {\mathbf{V}}^{*},{\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,w\right)=\underset{s}{\text{min}}{\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,w,s\right)$is the interaction energy of the pair (v, w) for an optimal relative shear. It is further assumed that ${\mathcal{E}}_{\mathsf{\text{adj}}}$ is defined over a superset of E, since we will consider the case where two adjacent strands are not consecutive along the sequence.
We also introduce the particular values:${\mathcal{E}}_{\mathsf{\text{adj}}}\left(\top ,v\right)={\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,\perp \right)=0,\forall v\in \mathbf{V}.$

An associated function s_{adj} is defined such that:

$\forall \left(v,w\right)\in {\mathbf{V}}^{*}\times {\mathbf{V}}^{*},{\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,w,{s}_{\mathsf{\text{adj}}}\left(v,w\right)\right)={\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,w\right),$which is a relative shear that leads to the optimal interaction energy.
An arising question is why the orientation degrees of freedom are described as a multiplicity of nodes but the relative shear degrees of freedom are considered when calculating the ${\mathcal{E}}_{\mathsf{\text{adj}}}$ terms. A first answer comes from the fact that wrong orientations are rather absolute and will result in pruning the sets E and V while the shear parameters are not so discriminative. The main reason is that we will consider "floating" parts in which adjacencies are already set, while a relative shear between any two parts is not yet known. In such a situation, attaching the relative shears to node pairs allows a significant factorization.

∀(v, w) ∈ E,∀t∈{1, 2, . . . , n  1} and ∀ s  a relative shear, ${\mathcal{E}}_{\mathsf{\text{loop}}}\left(v,w,t,s\right)$ is related to the intrinsic energy of the turn/loop between the strands v and w (consecutive along the sequence) when they are placed at a distance t along the barrel with a relative shear s. The distance t = 1 corresponds to the case where the strands are placed consecutively on the barrel, while an integer value t > 1 will correspond to the case where t  1 other strands are interleaf.
To simplify, we will also use ${\mathcal{E}}_{\mathsf{\text{loop}}}\left(\top ,v\right)$ or ${\mathcal{E}}_{\mathsf{\text{loop}}}\left(v,\perp \right)$ for denoting the intrinsic energy of the outer fragment attached respectively to a starting or an ending vertex v. As such a fragment has a free side, the position parameters may be dropped.
Then, in the usual case of two βstrands that fold as a hairpin, the related energy is considered to be ${\mathcal{E}}_{\mathsf{\text{adj}}}\left(v,w\right)+{\mathcal{E}}_{\mathsf{\text{loop}}}\left(v,w,1,{s}_{\mathsf{\text{adj}}}\left(v,w\right)\right)$. It is supposed a relative flexibility for turns and loops, so, when a fold is feasible, ${\mathcal{E}}_{\mathsf{\text{loop}}}$ is weak compared to ${\mathcal{E}}_{\mathsf{\text{adj}}}$ and the relative placement of the two βstrands is enforced to be close to s_{adj}. Nevertheless, ${\mathcal{E}}_{\mathsf{\text{loop}}}$ will result in a strong penalty in the case of an unfeasible turn or loop, for example a loop with a majority of hydrophobic residues.
Protein folding problem
such that $\sum _{\left(v,w\right)\in \mathcal{P}}{s}_{\mathsf{\text{adj}}}\left(v,\phantom{\rule{2.77695pt}{0ex}}w\right)=S.$
Solving as the longest path problem
Since the graph is a DAG, the longest path problem is solved with a well known dynamic programming scheme [34] of complexity $\mathcal{O}\left(\left\mathbf{V}\right\right)$ in space and $\mathcal{O}\left(\left\mathbf{V}\right+\left\mathbf{E}\right\right)$ in time, that is also $\mathcal{O}\left(N\right)$ for both, from the structural constraints that relate $\left\mathbf{V}\right$, $\left\mathbf{E}\right$ and N. The objective is the computation of ${\mathbf{C}}_{\perp}^{S}$ and the optimal structure is then reconstructed by a usual traceback postprocessing. Note that, for each path, we only have to consider its last vertex, so, we have to track single index states.
The goal is to calculate $\underset{v,\left(\top ,v\right)\in \mathbf{E}}{\text{max}}{\mathbf{C}}_{\left(v,\perp \right)}^{S}.$Thus the scheme is of complexity $\mathcal{O}\left(\mathbf{V}{}^{2}\right)$ in space and $\mathcal{O}\left(\left\mathbf{V}\right\cdot \left\mathbf{E}\right\right)$ in time, that is also $\mathcal{O}\left({N}^{2}\right)$ for both, from the structural constraints. This may produce paths of any length and the constraint of n strands is applied as a cut in the recurrence.
Generalization
In a more general case, we consider permutations to deal with the fact that the arrangements of the strands along the barrel do not necessarily follow their order along the sequence. This usually occurs with Greek key motifs or more rarely with Jelly roll motifs. Hence, the protein folding problem becomes finding the longest path in a graph with respect to a given permutation σ, i.e. the vertices of , seen on a circle as in Figure 10 are permuted according to σ.
Let σ be a circular permutation of {1, 2, . . . , n}. When 1, 2, . . . , n are numbering the positions along the barrel, values σ (1), σ (2), . . . , σ(n) will give the respective ranks of the strands in the sequence order. A position of reference along the barrel is fixed by setting σ(1) = 1. Figure 10 shows a first example of a structure with a Greek key motif, which is described by the permutation σ = (1, 2, 3, 6, 5, 4).
The dynamic programming scheme now consists in building a barrel, by adding a next strand, taken in sequence with respect to the graph edges, but that is inserted at the position defined by the given permutation. Useful values are the ranks (in the sequence order) of the two strands between which a given one will be inserted. For instance, with the current example, the 5^{th} strand will be inserted between the 2^{nd} and the 4^{th} strands.
Let now k denote the level of construction (1 ≤ k ≤ n), that is the number of strands already placed.
With the current example, we get (see Figure 11):
left_{1} = 6 left_{2} = 1 left_{3} = 4 right_{1} = 2 right_{2} = 5 right_{3} = 6
left_{4} = 5 left_{5} = 2 left_{6} = 3 right_{4} = 3 right_{5} = 4 right_{6} = 1
An important piece of information to store for the dynamic programming scheme is the set of "active" indices, i.e. ranks of the strands (in the sequence order) that are not definitively bonded on both sides, along the barrel, and also not linked along the sequence and thus have to be kept as degrees of freedom. So, in the current example (see Figure 12), we have to keep in memory as many solutions (to subproblems) as valid instances of the 2^{nd} and 4^{th} strands, until an optimal choice for these is recorded as a solution for each instance of the 5^{th} strand. At that time, any instance as the 5^{th} strand is kept as a candidate for a link with the 6^{th}, by a turn or loop, while the different instances as the 3^{rd} and 1^{st} are kept for proceeding to an insertion in between.
where the case n  1 is intended for the adjacency that will close the barrel.
With the current example of Figures 11 and 12, we get:
conf_{1} = {1} conf_{2} = {1, 2} conf_{3} = {1, 2, 3}
conf_{4} = {1, 2, 3, 4} conf_{5} = {1, 3, 5} conf_{6} = {6}
Thus, for this example, the maximal complexity in space, $\mathcal{O}\left({N}^{4}\right)$, is reached for the set of solutions to the subproblem with 4 strands. Then looping over this set, for computing the set of solutions to the subproblem with 5 strands, will also cost $\mathcal{O}\left({N}^{4}\right)$ in time, since the choice for the 5^{th} strand is bounded by the structural constraints embedded as edges in the graph. It is a difference with most of the dynamic programming schemes where the complexity in time is expressed with an additional $\mathcal{O}\left(N\right)$ factor compared the complexity in space. As an other example, in the case of Figure 10, we obtain the complexity $\mathcal{O}\left({N}^{2}\right)$ in both time and space, which is similar to the case where σ is an identity permutation.
Now we have to decide at which minimal level k each term ${\mathcal{E}}_{\mathsf{\text{adj}}}$ or ${\mathcal{E}}_{\mathsf{\text{loop}}}$ is determined and can be integrated in the dynamic programming scheme. For the ${\mathcal{E}}_{\mathsf{\text{adj}}}$ terms, it is simply asserted that the previous or the next strand along the barrel is already placed when left_{ k } < k or right_{ k } < k, respectively.
For the ${\mathcal{E}}_{\mathsf{\text{loop}}}$ terms, the problem is to wait until the relative shear between the two ends of a turn or loop is solved by the interleaf adjacencies. So, in the given example, the energy of the loop between the 2^{nd} and 3^{rd} strands can only be evaluated when the 5^{th} strand has been laid and the optimal relative shear ${s}_{\mathsf{\text{adj}}}^{*}\left({v}_{2},{v}_{3}\right)={s}_{\mathsf{\text{adj}}}\left({v}_{2},{v}_{5}\right)+{s}_{\mathsf{\text{adj}}}\left({v}_{5},{v}_{4}\right)+{s}_{\mathsf{\text{adj}}}\left({v}_{4},{v}_{3}\right)$ is known.
then let ${\mathcal{A}}_{k}^{*}$ denote the equivalence relation defined by the transitive closure of ${\mathcal{A}}_{k}$ and let ${\mathbf{A}}_{k}=\left\{i<ki{\mathcal{A}}_{k}^{*}\left(i+1\right)\right\}.$
Thus, i ∈ A_{ k } means that the i^{th} and (i + 1)^{st} strands are geometrically linked by adjacencies when the k^{th} substructure is laid and we can compute by composition an optimal relative shear ${s}_{\mathsf{\text{adj}}}^{*}$.
We will now focus on the set δ A_{ k } = A_{ k }  A_{ k } _{ 1}, ∀ k > 1.
Definition 11 Let ${\mathbf{T}}_{k}\subset {{\mathbf{V}}^{*}}^{\left\mathbf{c}\mathbf{o}\mathbf{n}{\mathbf{f}}_{k}\right}$denote the set of all tuples of  conf_{ k } vertices such that there is at least one path (of k edges) starting from ⊤ and passing through these vertices in order.
For any instance z ∈ T_{ k } of such a tuple and, ∀i ∈ conf_{ k }, let z[i] denote the i^{th} vertex of a corresponding path.
This notation (not to be confused with z_{ i, } the i^{th} component of tuple z) is not ambiguous since, from definition, the vertex z[i] is in common to any path associated to z. Particularly, z[k] is the last vertex of any path associated to z.
Note that, from proposition 7, ∀y ∈ T_{ k } _{ 1}, if left_{ k } <k then the vertex y[left_{ k }] is defined (and the same is worth for right_{ k }). We can check that each ${\mathcal{E}}_{\text{adj}}$ term is finally counted exactly once in the sum, at the level corresponding to the position of its further vertex in the sequence order. The optimum is found at k = n and h = S.
Corollary 13 The complexities are $\mathcal{O}\phantom{\rule{2.77695pt}{0ex}}\left({N}^{{\text{max}}_{k}\left\right\mathbf{c}\mathbf{o}\mathbf{n}{\mathbf{f}}_{k}\left\right}\right)$in space and time.
Hence, max_{ k } conf_{ k } ≤ 1 + (2n  2) / 3. For a permutation that only differs from the identity permutation by disjoint Greek key motifs [35], i.e. $\sigma =\left(1,\phantom{\rule{2.77695pt}{0ex}}2,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}{i}_{1},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{G}}_{1},\phantom{\rule{2.77695pt}{0ex}}{i}_{1}+5,\dots ,\phantom{\rule{2.77695pt}{0ex}}{i}_{2},{\mathcal{G}}_{2},\phantom{\rule{2.77695pt}{0ex}}{i}_{2}+5,\dots ,\phantom{\rule{2.77695pt}{0ex}}{\mathcal{G}}_{j},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}n\right)$ where ${\mathcal{G}}_{j}={i}_{j}+3$, i_{ j } + 2, i_{ j } + 1, i_{ j } + 4 or ${\mathcal{G}}_{j}={i}_{j}+1$, i_{ j } + 4, i_{ j } + 3, i_{ j } + 2, it is easy to prove that max_{ k } conf_{ k } ≤ 4 by a discrete analysis on different configurations. The complexities are thus at most $\mathcal{O}\left({N}^{4}\right)$ for such a permutation.
In short, it is possible to compute the optimum in $\mathcal{O}\left({N}^{2}\right)$ running time for structures corresponding to the identity permutation and from $\mathcal{O}\left({N}^{2}\right)$ (for instance, example of Figure 10) to $\mathcal{O}\left({N}^{4}\right)$ (for instance, example of Figure 11) for structures containing disjoint Greek key motifs, where N is the input sequence length. These computation costs might be further improved by a tree decompositionbased algorithm that we are currently working on.
Implementation details
The number of strands n and the shear number S determine the geometry of the barrel, particularly the membrane spanning part of the segments, and are thus involved in the computation of energy terms. If known, the algorithm can enforce these value and fold the protein accordingly. The values for n, which are usually even, are governed by the consideration on the length of the sequence, the thickness of membrane and the length of turns or loops and vary between 8 and 22 [1]. The values for S, are even and included between n and 2n [28, 29]. The problem is then solved by the constrain dynamic programming with the constraints of given n and S. A small number of couples (n, S) have to be explored and our algorithm is fast enough for that.
Sidechain interactions between contiguous residues along a segment on the same side and interactions with the environment of channel or bilayer define the intrinsic energy of the corresponding vertex. The pairing energy of two adjacent segments in the barrel is computed by optimizing the relative positions between constituent amino acids. These energies involve hydrogen bonds in main chains, electrostatic interactions between sidechains, hydrophobic effect as well as environmental effect. More specifically, the extracellular and intracellular environments with distinct hydrophobicity indices can have significantly different hydrophobic effects. In addition, the membrane thickness gives constraints on segment size and helps identify the interactions inside or outside the membrane region. We use here by default a parameter of 3 nm for the membrane thickness, thus 8 residues thick [36, 37]. The features on size, polarity [38], and flexibility [39] of turns and loops are also taken into consideration, i.e. turns and loops satisfy threshold constraints on their polarity and flexibility indices and their length. Their energies are approximated by hydrophobicity [31].
We use the Dunbrack backbonedependent rotamer library [40] and the partial charges from GROMOS force field [41] to compute pairwise interaction energies. The hydrophobic interaction between two sidechains u, v is assessed by the amount of contacts between nonpolar groups, calculated by taking the average on all rotamer pairs of the two sidechains e_{ uv } =<e_{ uvrotamers }>. Each sidechain plays a role of a group of partial charges in the electrostatic interaction. The mainchain hydrogen bond is measured by the electrostatic potential energy between peptide CO and NH groups.
The probabilistic model and the constraints on hydrophobicity help discard the unlikely membrane spanning βstrands. A threshold on overall energy can also be involved to enhance the discrimination. We studied the perstrand energy value for a variety of TMB proteins including the training dataset and other TMB proteins. Even though this value is always higher than 0.9 for these proteins, we chose 0.85 as a threshold to avoid overfitting. Note that this does not affect the prediction results, and is only used for classification.
Experimental setup
Software
We compare our folding prediction accuracy to TMBpro [9] and TMBETAPREDRBF [10]. We compare our classification results to Freeman et al. [6], TMBETAPREDRBF [10], PREDTMBB [18] and transFold [12]. TMBpro and TMBETAPREDRBF results are executed from their webserver.
Datasets
We used TMB proteins from the PDBTM database [20] to train and test our approaches.

Folding: We used CDHIT [42] to constrain the redundancy in proteins. A threshold of 40% similarity was applied to reduce the dataset, resulting in 49 sequences (PDBTM40). We retain only the monomeric barrels, i.e. the sequences that form a unique complete barrel. Thus, PDBTM40 contains 41 sequences [PDB: 1OH2_Q, 3A2R_X, 3AEH_A, 3BRZ_A, 3CSL_A, 2R4P_A, 3DWO_X, 2FGQ_X, 3EFM_A, 3EMN_X, 2ERV_A, 2IWW_A, 2F1T_A, 1FEP_A, 3FHH_A, 3FID_A, 1ILZ_A, 1BY3_A, 2GSK_A, 1BH3_A, 2HDF_A, 2J1N_A, 2IAH_A, 3JTY_A, 1BXW_A, 2VDF_A, 1PNZ_A, 3GP6_A, 1AF6_A, 3NJT_A, 2O4V_A, 2ODJ_A, 1QJ8_A, 1P4T_A, 2POR_ , 1TLW_A, 1UXF_A, 1UYN_X, 2WJQ_A, 2X4M_A, 1XKW_A]. It is important to note that both TMBPro and our method use the entire dataset to train. While this may result in overfitting for a learningbased approach, the effect on our approach should be very small.

Classification: We used a set of 177 αhelical transmembrane proteins of length from 140 to 800 residues, at 40% redundancy reduction, from PDBTM and 32 nonredundant lipocalins taken from PDB.
Declarations
Acknowledgements
The authors would like to thank all the INRIA AMIB Team members, especially Mireille Régnier, Yann Ponty, Julie Bernauer and Balaji Raman.
This article has been published as part of BMC Genomics Volume 13 Supplement 2, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S2.
Authors’ Affiliations
References
 Tamm LK, Hong H, Liang B: Folding and assembly of β barrel membrane proteins. Biochim Biophys Acta. 2004, 1666: 250263. 10.1016/j.bbamem.2004.06.011.View ArticlePubMedGoogle Scholar
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
 Arora A, Tamm LK: Biophysical approaches to membrane protein structure determination. Curr Opin Struct Biol. 2001, 11: 540547. 10.1016/S0959440X(00)002463.View ArticlePubMedGoogle Scholar
 Casadio R, Fariselli P, Martelli PL: In silico prediction of the structure of membrane proteins: Is it feasible?. Brief Bioinform. 2003, 4 (4): 341348. 10.1093/bib/4.4.341.View ArticlePubMedGoogle Scholar
 Taylor PD, Toseland CP, Attwood TK, Flower DR: Betabarrel transmembrane proteins: Enhanced prediction using a Bayesian approach. Bioinformation. 2006, 1 (6): 231233.PubMed CentralPubMedGoogle Scholar
 Freeman TCJ, Wimley WC: A highly accurate statistical approach for the prediction of transmembrane betabarrels. Bioinformatics. 2010, 26 (16): 196574. 10.1093/bioinformatics/btq308.PubMed CentralView ArticlePubMedGoogle Scholar
 Gromiha M, Ahmad S, Suwa M: Neural networkbased prediction of transmembrane β strand segments in outer membrane proteins. J Comput Chem. 2004, 25: 762767. 10.1002/jcc.10386.View ArticlePubMedGoogle Scholar
 Gromiha MM, Ahmad S, Suwa M: TMBETANET: discrimination and prediction of membrane spanning β strands in outer membrane proteins. Nucleic Acids Res. 2005, 33: W164W167. 10.1093/nar/gki367.PubMed CentralView ArticlePubMedGoogle Scholar
 Randall A, Cheng J, Sweredoski M, Baldi P: TMBpro: secondary structure, β contact and tertiary structure prediction of transmembrane βbarrel proteins. Bioinformatics. 2008, 24: 513520. 10.1093/bioinformatics/btm548.View ArticlePubMedGoogle Scholar
 Ou YY, Chen SA, Gromiha MM: Prediction of membrane spanning segments and topology in β barrel membrane proteins at better accuracy. J Comput Chem. 2010, 31: 217223. 10.1002/jcc.21281.View ArticlePubMedGoogle Scholar
 Bagos P, Liakopoulos T, Hamodrakas S: Evaluation of methods for predicting the topology of betabarrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics. 2005, 6: 710.1186/1471210567.PubMed CentralView ArticlePubMedGoogle Scholar
 Waldispühl J, Berger B, Clote P, Steyaert JM: Predicting transmembrane β barrels and interstrand residue interactions from sequence. Proteins. 2006, 65: 6174. 10.1002/prot.21046.View ArticlePubMedGoogle Scholar
 McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16: 404405. 10.1093/bioinformatics/16.4.404.View ArticlePubMedGoogle Scholar
 Jacoboni I, Martelli PL, Fariselli P, Pinto VD, Casadio R: Prediction of the transmembrane regions of β barrel membrane proteins with a neural networkbased predictor. Protein Sci. 2001, 10: 779787. 10.1110/ps.37201.PubMed CentralView ArticlePubMedGoogle Scholar
 Martelli P, Fariselli P, Krogh A, Casadio R: A sequenceprofilebased HMM for predicting and discriminating β barrel membrane proteins. Bioinformatics. 2002, 18 (Suppl 1): S46S53. 10.1093/bioinformatics/18.suppl_1.S46.View ArticlePubMedGoogle Scholar
 Ahn C, Yoo S, Park H: Prediction for betabarrel transmembrane protein region using HMM. KISS. 2003, 30 (2): 802804.Google Scholar
 Bigelow HR, Petrey DS, Liu J, Przybylski D, Rost B: Predicting transmembrane betabarrels in proteomes. Nucleic Acids Res. 2004, 32: 25662577. 10.1093/nar/gkh580.PubMed CentralView ArticlePubMedGoogle Scholar
 Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ: PREDTMBB: a web server for predicting the topology of β barrel outer membrane proteins. Nucleic Acids Res. 2004, 32: W400W404. 10.1093/nar/gkh417.PubMed CentralView ArticlePubMedGoogle Scholar
 Natt NK, Kaur H, Raghava G: Prediction of transmembrane regions of β barrel proteins using ANNand SVMbased methods. Proteins. 2004, 56: 1118. 10.1002/prot.20092.View ArticlePubMedGoogle Scholar
 Tusnády GE, Dosztányi Z, Simon I: PDB_TM: selection and membrane localization of transmembrane proteins in the Protein Data Bank. Nucleic Acids Res. 2005, 33: D275D278.PubMed CentralView ArticlePubMedGoogle Scholar
 Bagos P, Liakopoulos T, Spyropoulos I, Hamodrakas S: A Hidden Markov Model method, capable of predicting and discriminating β barrel outer membrane proteins. BMC Bioinformatics. 2004, 5: 2910.1186/14712105529.PubMed CentralView ArticlePubMedGoogle Scholar
 Dayhoff MO, Schwartz RM, Orcutt CB: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978, 5 (Suppl 3): 345352.Google Scholar
 Koebnik R, Krämer L: Membrane assembly of circularly permuted variants of the E. coli outer membrane protein OmpA. J Mol Biol. 1995, 250: 617626. 10.1006/jmbi.1995.0403.View ArticlePubMedGoogle Scholar
 BetaBarrel Predictor Web Server. [http://www.lix.polytechnique.fr/Labo/VanDu.Tran/bbp/]
 Tran VD, Chassignet P, Steyaert JM: Prediction of permuted supersecondary structures in betabarrel proteins. Proceedings of the 2011 ACM Symposium on Applied Computing SAC'11, ACM Digital Library. 2011, Taichung, Taiwan, 110111.View ArticleGoogle Scholar
 Tran VD, Chassignet P, Sheikh S, Steyaert JM: Energybased classification and structure prediction of transmembrane betabarrel proteins. Proceedings of the 2011 IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), IEEE Xplore. 2011, Orlando, FL, USA, 159164.View ArticleGoogle Scholar
 Marsh D: Infrared dichroism of twisted betasheet barrels. The structure of E. coli outer membrane proteins. J Mol Biol. 2000, 297: 803808. 10.1006/jmbi.2000.3557.View ArticlePubMedGoogle Scholar
 Murzin AG, Lesk AM, Chothia C: Principles determining the structure of β sheet barrels in proteins I. A theoretical analysis. J Mol Biol. 1994, 236: 13691381. 10.1016/00222836(94)900647.View ArticlePubMedGoogle Scholar
 Murzin AG, Lesk AM, Chothia C: Principles determining the structure of β sheet barrels in proteins II. The observed structures. J Mol Biol. 1994, 236: 13821400. 10.1016/00222836(94)900655.View ArticlePubMedGoogle Scholar
 Chou KC, Carlacci L, Maggiora GM: Conformational and geometrical properties of idealized betabarrels in proteins. J Mol Biol. 1990, 213: 315326. 10.1016/S00222836(05)801937.View ArticlePubMedGoogle Scholar
 Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157: 105132. 10.1016/00222836(82)905150.View ArticlePubMedGoogle Scholar
 Fano R: Transmission of Information. 1961, Wiley, New YorkGoogle Scholar
 Gibrat JF, Garnier J, Robson B: Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol. 1987, 198: 425443. 10.1016/00222836(87)902920.View ArticlePubMedGoogle Scholar
 Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. 2009, MIT Press, 3Google Scholar
 Zhang C, Kim SH: A comprehensive analysis of the Greek key motifs in protein β barrels and β sandwiches. Proteins. 2000, 40: 409419. 10.1002/10970134(20000815)40:3<409::AIDPROT60>3.0.CO;26.View ArticlePubMedGoogle Scholar
 Lewis BA, Engelman DM: Lipid bilayer thickness varies linearly with acyl chain length in fluid phosphatidylcholine vesicles. J Mol Biol. 1983, 166 (2): 211217. 10.1016/S00222836(83)800072.View ArticlePubMedGoogle Scholar
 Rawicz W, Olbrich K, McIntosh T, Needham D, Evans E: Effect of Chain Length and Unsaturation on Elasticity of Lipid Bilayers. Biophys J. 2000, 79: 328339. 10.1016/S00063495(00)762953.PubMed CentralView ArticlePubMedGoogle Scholar
 Grantham R: Amino Acid Difference Formula to Help Explain Protein Evolution. Science. 1974, 185: 862864. 10.1126/science.185.4154.862.View ArticlePubMedGoogle Scholar
 Bhaskaran R, Ponnuswamy P: Amino acid scale: average flexibility index. Int J Pept Protein Res. 1988, 32: 242255.Google Scholar
 Dunbrack RL, Cohen FE: Bayesian statistical analysis of protein sidechain rotamer preferences. Protein Sci. 1997, 6 (8): 16611681. 10.1002/pro.5560060807.PubMed CentralView ArticlePubMedGoogle Scholar
 van Gunsteren WF, Billeter SR, Eising AA, Hünenberger PH, Krüger P, Mark AE, Scott WRP, Tironi IG: Biomolecular simulation: the GROMOS96 manual and user guide. vdf Hochschulverlag AG an der ETH Zürich and BIOMOS b.v.: Zürich, Groningen. 1996Google Scholar
 Li W, Godzik A: Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 16581659. 10.1093/bioinformatics/btl158.View ArticlePubMedGoogle Scholar
 Liu WM: Shear numbers of protein β barrels: definition, refinements and statistics. J Mol Biol. 1998, 275: 541545. 10.1006/jmbi.1997.1501.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.