A graph-theoretic approach for classification and structure prediction of transmembrane β-barrel proteins

Background Transmembrane β-barrel proteins are a special class of transmembrane proteins which play several key roles in human body and diseases. Due to experimental difficulties, the number of transmembrane β-barrel proteins with known structures is very small. Over the years, a number of learning-based methods have been introduced for recognition and structure prediction of transmembrane β-barrel proteins. Most of these methods emphasize on homology search rather than any biological or chemical basis. Results We present a novel graph-theoretic model for classification and structure prediction of transmembrane β-barrel proteins. This model folds proteins based on energy minimization rather than a homology search, avoiding any assumption on availability of training dataset. The ab initio model presented in this paper is the first method to allow for permutations in the structure of transmembrane proteins and provides more structural information than any known algorithm. The model is also able to recognize β-barrels by assessing the pseudo free energy. We assess the structure prediction on 41 proteins gathered from existing databases on experimentally validated transmembrane β-barrel proteins. We show that our approach is quite accurate with over 90% F-score on strands and over 74% F-score on residues. The results are comparable to other algorithms suggesting that our pseudo-energy model is close to the actual physical model. We test our classification approach and show that it is able to reject α-helical bundles with 100% accuracy and β-barrel lipocalins with 97% accuracy. Conclusions We show that it is possible to design models for classification and structure prediction for transmembrane β-barrel proteins which do not depend essentially on training sets but on combinatorial properties of the structures to be proved. These models are fairly accurate, robust and can be run very efficiently on PC-like computers. Such models are useful for the genome screening.


Background
Transmembrane proteins play several key roles in the human body including inter-cell communication, transportation of nutrients, and ion transport. They also play key roles in human diseases like depression, hypertension, cancer, thus are targeted by a majority of pharmaceuticals being manufactured today. The transmembrane proteins are divided into two main types according to their conformation: a-helical bundles and b-barrels (TMB). The TMB proteins, which are much less abundant than helical bundles, are found in the outer membrane of Gram-negative bacteria, mitochondria and chloroplasts. They perform diverse functions such as porins, passive or active transporters, enzymes, defensive or structural proteins [1]. Thus, the structure of TMB proteins is very important for both biological and medical sciences.
These proteins, which span the membrane entirely, make up 20 -30% of identified proteins in most whole genomes. However, due to difficulties in determination of their structures, solved TMB structures constitute only a meagre 2% of the RCSB Protein Data Bank (PDB) [2][3][4][5]. This is mainly due to experimental difficulties and complexity of the TMB structure [6]. Consequently, various learning-based techniques have been developed for discriminating TMB proteins from globular and transmembrane a-helical proteins [6][7][8], and for predicting TMB secondary structures [7][8][9][10][11][12]. We first discuss these methods and their potential shortcomings in detail, and then proceed with describing our approach.
Ou et al. [10] proposed a method based on radial basis function networks to predict the number of b-strands and membrane spanning regions in b-barrel outer membrane proteins. Randall et al. [9] tried to predict the TMB secondary structure with 1D recursive neural network using alignment profiles. Gromiha et al. [7,8] used the amino acid compositions of both globular and outer membrane proteins (OMPs) to discriminate OMPs and developed a feed forward neural network-based method to predict the transmembrane segments. Bagos et al. [11] produced a consensus prediction from different methods based on hidden Markov models, neural networks and support vector machines [8,[13][14][15][16][17][18][19]. Tractability has been an issue for some of these approaches. In order to overcome this limitation, Waldispühl et al. [12] used a structural model and pairwise interstrand residue statistical potentials derived from globular proteins to predict the supersecondary structure of TMB proteins. Freeman et al. [6] have introduced a statistical approach for recognition of TMB proteins based on known physicochemical properties.
Most of these rely on the learning assumptions in the underlying models as well as the sampling of proteins in their training set. However, the number of TMB proteins known today is tiny. Thus, it is arguable whether these approaches can work well for recognizing and folding TMB proteins which are not homologous to those currently known. It is also important to note that none of these methods allow for permutations in protein structures. The TMB structures are not merely a series of b-strands where each is bonded to the preceding and succeeding ones in the primary sequence, but they may contain Greek key or Jelly roll motifs as well, for instance, the C-terminal domain of the PapC usher [PDB:3L48]. This level of structure may be described as a permutation on the order of the bonded strands.
In this paper, we present a novel ab initio model for classification and structure prediction of TMB proteins based on minimizing free energy in a graph-theoretic framework. It is able to deal with permuted TMB structures. The prediction accuracy is evaluated on known TMB proteins available in popular protein databases [20], and compared with existing software [9,10,12,21].
Our approach also performs well in structure prediction and the results are comparable to those of the existing algorithms. Ours is the first model that actually gives an insight into the physicochemical model rather than merely classifying or predicting TMB proteins. The results show that our approach is also good at discriminating TMB proteins.

Folding
The folding prediction results are presented in Table 1 and Figure 1. Figure 1 plots the Matthews Correlation Coefficient for our approach BBP (Beta-Barrel Predictor) and TMBpro for different proteins along the x-axis. The results of our approach are comparable to those of TMBpro but more consistent as we do not rely on training for folding. We note that, in the cases the program predicts an optimal structure with a wrong number of strands, the optimal energy is really close to the energy of the topologically right structure.
The TMBETAPRED-RBF web-server predicted non-TMB for 24 over 41 proteins of PDBTM40, or 58.5%. The structures for correctly identified proteins were completely accurate. This might be because they were included in the training set.

Evaluation of shear numbers
We study the energy distribution of 17 TMB structures (ECOLI40) in E. coli taken from PDBTM40 (including [PDB: 1AF6_A, 1BXW_A, 1BY3_A, 1FEP_A, 1ILZ_A,  1PNZ_A, 1QJ8_A, 1TLW_A, 2F1T_A, 2GSK_A,  2HDF_A, 2IWW_A, 2J1N_A, 2R4P_A, 2WJQ_A,  3AEH_A, 3GP6_A]) with regards to the slant angle, hence the shear number (see Figure 2). Most optimal structures incline with an angle of 41°-49°, as observed in databases. This suggests that our model performs well the physicochemical properties of TMB structures. It should be also noted that there is no natural way to define the shear number a priori.

Influence of the filtering threshold
We apply the filtering thresholds ρ = 1 3 , 1 2 and 2 3 on ECOLI40. These thresholds ensure that on average, considering 3-residue blocks as subunits, each segment is accepted as a b-strand if its propensity to be b-strand is at most 3, 2, 1.5 times, respectively, less than its propensity to be other structure (a-helices or turns/loops). The observed minor difference in accuracy with such considerably distinguished thresholds reinforces the fair independence of our approach from the training data. The results in Table 2 show the strong predicting ability of BBP from a poor known database. The lower the parameter r, the more independent to the training the predictor. This reduced the prediction performance of the

Evaluation on mutated sequences
We generate the mutated sequences from ECOLI40 by substituting the amino acids at turns or loops using the PAM250 substitution matrix [22]. Each sequence in ECOLI40 is mutated up to 5% of amino acids into 10 new sequences. Figures 3 and 4 show the Matthews Correlation Coefficient and F-score for residues and bstrands. We observe from these results the stability of our predictions. It also suggests that the TMB proteins are stable against these mutations at their turns and loops. The difference in structures of those mutated proteins may merely come from the shift of membrane spanning b-strands when their two extremities are mutated.

Permuted structures
For [PDB:3L48], the C-terminal domain of the PapC usher in E. coli, the observed structure topology containing a Greek key motif corresponds to the permutation s = (1, 4, 3, 2, 5, 6, 7) and is predicted with an accuracy (Q 2 ) of 70.2% at r = 0.2. Following the experimental observations that were published previously on the efficiency of the in vivo membrane assembly of OmpA variants [23], we test our algorithm with different given permutations. OmpA   Figure 5). The pseudo-energy 10.21 of the observed permutation is found in the lowest energy zone. 41 permuted structures, or 0.81%, reach an energy of (10.21 ± 0.3). A ratio of about 1.31% is found in the case of OmpX [PDB:1QJ8] (see Figure 6). These results are not surprising since a protein may be folded into more than one spatial conformation. In both cases, a Poisson-like distribution is found. This observation may help to discriminate most of infeasible conformations with the use of a threshold on the global energy. Hence, the method is expected to rapidly find a small set containing the right structure within a threshold of, for instance, 2% from the lowest energy and with structural feasibility conditions on permutations. This set might be much smaller be refining the biologically plausible permutations. Other proposed solutions in this set may be the candidates for in vivo and in vitro studies.

Classification
100% of the non-redundant set of 177 a-helical transmembrane proteins of length from 140 to 800 residues in PDBTM are rejected, whereas 31 out of 32 nonredundant lipocalins taken from PDB are predicted as non-TMB (the dataset is available at [24]). Though lipocalins are also b-barrels which reverse the TMB pattern with a hydrophobic core, the environmental effects on both sides of the barrel are still different. Our pseudoenergy model yields unfavorably on such structures and discriminates considerably better than the learning- based methods like Freeman-Wimley [6], TMBpro [9], PRED-TMBB [18] and TMBETAPRED-RBF [10], but also of transFold [12].

Conclusions
We have presented a new pseudo-energy minimization method for the classification and prediction of transmembrane protein super-secondary structure based on a variety of potential structures. Our approach takes into account many physicochemical constraints and minimizes the free energy. It also accounts for permuted structures, thus giving more complete information on the folded structure. Our method is quite accurate with more than 90% sensitivity and F-score, over 80% M.C.C.  score on strands; and over 74% accuracy and F-score on residues. The results are comparable to those given by TMBpro and TMBETAPRED-RBF, which are both learning based methods. Moreover, our results are more consistent and have a significantly less variation across different TMB proteins. This is especially interesting given that our algorithm is based mainly on pseudoenergy minimizations, and the probabilistic model only plays a very small role. While the model presented here is only for TMB proteins, it can be easily extended to accommodate a-helical bundles. We did not use a more sophisticated statistical model for classifying b-barrel strands because that would risk overfitting and reliance on the training dataset. It is also interesting to note that our approach performs very well for identification of TMB proteins, rejecting all the a-helical bundles. The Freeman and Wimley [6] approach is more accurate on some datasets. However, it risks overfitting and does not predict the structure. Therefore, our approach provides the best overall classification results amongst the methods that try to predict structures. Our model does learn the probabilistic model from training dataset, but it is mainly to screen out obvious non-TMB strands. Therefore, there are no concerns about the size of the training data or overfitting. Even though the results presented in this paper are comparable to other methods, the methodology presented here is novel and gives insight into the actual physicochemical constraints and energy. Moreover, our approach should be able to predict TMB proteins which are significantly different from known proteins. Finally, our approach provides more information than the current approaches by providing the permutations of the strands.

Future work
We are working on energy models for TM a-helical bundles and b-barrels with broken strands, as well as globular b-barrels like lipocalins or membrane targeting proteins (C2 domain) where permuted structures are usually found. Nevertheless, similar to the other methods, we only propose single-domain protein structures.
We are also currently working on refinements in structural constraints and hydrophobicity, which may help to improve the accuracy of our predicted structure. Finally, it will be interesting to investigate more sophisticated statistical models for the initial screening, both to improve the results and understand how effective a mixed approach can be.

Methods
We now present the methods developed for classification and structure prediction of TMB proteins (a preliminary version of this work appeared as a short paper in [25,26]). TMB proteins are hard to identify, however, it is relatively easy to identify a majority of other proteins which are not TMB. We use physicochemical properties and a simple probabilistic model based on a sliding window for filtering amino acid segments that are obviously not involved in any b-barrel structures as a membrane spanning b-strand. Proteins that are considered to be putative TMB proteins by this initial phase are then further analyzed. Next, we try to fold the given protein, treating it as a TMB protein, using the pseudoenergy minimization model. If the protein cannot be folded into b-barrels according to the energy minimization framework, the protein is rejected and classified as a non-TMB protein.
Before presenting the simple model that we used for filtering the transmembrane b-strands, we discuss some physicochemical constraints that a protein must obey to be a TMB protein. We enforce these constraints in both the filtering and folding steps of our algorithm.

Geometric framework for b-barrels
For a regular b -barrel [27][28][29], the backbone geometry is entirely determined by n, the number of strands composing the barrel, and by S, the shear number, which is defined below. Definition 1 Shear number of a b-barrel In a regular bbarrel, the shear number S is unambiguously defined as the ordinal distance between an amino acid A and an amino acid B that is located on the same strand as A and linked to A through a path of hydrogen bonds. B is the projection of the "copy" of A after one turn on the first strand of the barrel.
Structural constants are h (≈ 3.3Å), the jump per amino acid along a strand, and d (≈ 4.4Å), the mean distance between adjacent strands, given respectively by the peptide bond and hydrogen bond geometries. The other geometric characteristics, such as θ, the slant angle of the strands relative to the z barrel axis, are given from n, S, h and d [30]: Angle θ, in association with a given membrane thickness, is involved in the energetic rules and restricts the membrane spanning b-strand length. Then, n and S have to be fixed as parameters.
Definition 2 Relative shear number Given a shear number S, the relative shears between adjacent strands remain as n -1 degrees of freedom. As a convention, we consider the relative shears on the extracellular side of the barrel. So, ∀i >1, s i , the relative shear of strand i + 1 with respect to strand i (strand n + 1 being identified with 1), is measured on strand i as the ordinal distance between the undermost amino acid of strand i and the one that is directly bound to the undermost amino acid of strand i + 1.
On the example of Figure 7, the sequence of relative shears (s i ) is (1 1 1 2 1 1 1 2). The sum of consecutive relative shears naturally defines the shear between two extreme strands, thus we have the constraint for the bbarrel, where the two extreme strands are strand 1, for instance, and itself after a round on the barrel: We define the shear number, by extension, for the case of a b-sheet (i.e. an open b-barrel) to make our algorithms capable of dealing with the structure of b-sheets.
Definition 3 Shear number of a b-sheet The shear number of a n-strand b-sheet is defined as the sum of relative shears on consecutive pairs of adjacent strands: Each b-strand is directed with respect to the sequence order from N-terminal to C-terminal. A strand is said to be upward if it is oriented from the extracellular environment to the periplasmic space, i.e. the N-terminal of the strand is located on the extracellular side and its Cterminal is on the periplasmic side. Inversely, the strand is said to be downward. The upward/downward orientation of the strand, relatively to the barrel axis, defines another degree of freedom.
Finally, considering a b-strand as a ribbon where the amino acids direct their side-chains alternatively on both sides, toward the barrel interior (channel) or toward the surrounding lipid (membrane), we will distinguish two ways of facing, neglecting small swivel adjustments. A strand is said to be odd inward if the odd indexed amino acids face to the channel and odd outward if those face to the membrane. We have one more degree of freedom.
Physicochemical constraints. On the amphipathic bstrand of TMB proteins, the side-chains of amino acids are directed towards the membrane and the channel alternatively. Hydrophilic and polar side-chains orient towards the aqueous interior while hydrophobic ones contact the hydrophobic bilayer [1]. We use the Kyte-Doolittle scale [31] to measure the hydrophobicity H(r) of each amino acid r. In this scale, a higher value represents higher hydrophobicity, and vice versa. The necessary condition for a segment r i .... r j to be a potential membrane spanning b-strand is that one side is hydrophobic and the other side is hydrophilic. Formally, we define as the average hydrophobicity on the respective even and odd numbered sides. Hence, the constraints are necessary for a segment of j -i + 1 consecutive amino acids r i .... r j to be a potential membrane spanning b-strand, where ζis a lower bound for the hydrophobic side and ζ + is an upper bound for the hydrophilic side. We use the values ζ -= -1 and ζ + = 1, which were obtained through an statistical data analysis Figure 7 The schematic planar view of a 8-b -strand barrel (strand 1 is duplicated for clarity). Thick lines represent the peptide bonds between consecutive amino acids along their strand. Thin lines represent the hydrogen bonds between the amino acids in adjacent strands. In this example, the shear number is S = 10, which is the ordinal distance between amino acids A and B. We note that all known b -barrels have a positive shear number [43] and are slanted "to the right", as illustrated here. on known TMB structures (see Figure 8). Then, with respect to the TMB structure, the segment r i ....r j is defined as odd inward oriented if H o i,j < H e i,j and odd outward oriented if H e i,j < H 0 i,j . Classification filtering. In order to identify substrings as potential membrane spanning b-strands (the vertices) or turns/loops (the edges), we introduce a simple probabilistic model that acts as a primary filter. We use a sliding window (segment) as a sequence of consecutive l-residue subsegments (or blocks) (l = 3 in our implementation). Let r denote the occurrence of a given block (r = r 1 r 2 . . . r l ) and let τ be the event that a block is found in a given conformation (b-strand or turn/ loop). The information that τ gets from r is defined as: where f τ,r represents the frequency observed in the training dataset for a block r to be found in conformation τ and we denote for short [32]: Thus, I(τ ; r) measures the influence of r on the occurrence of τ. If I(τ ; r) = 0, there is no influence; whereas I (τ ; r) >0 indicates that r is favorable to the occurrence of τ and vice versa. Formally, the preference of r in favor of τ as opposed toτ , any conformation different from τ [33], is: fτ ,r fτ , .
A simple measure is associated to each segment r 1 r 2 . . . r p that helps determine if it is likely a b-strand or a turn/loop. It is defined as the sum of informations on all the l-residue blocks: The segment is then considered as a candidate for conformation τ ifĨ(τ :τ ; r 1 r 2 . . . r p ) > 0.
The non-redundant training set PDBTM40 of 41 TMB proteins is used to learn this probabilistic model. Due to the small size of the training set, we apply the filter with a relatively low threshold at ρ = 2 3 to avoid overfitting. This ensures that on average, each block r is accepted in conformation τ if the propensity for τ to be in τ (i.e. f τ ,r /f τ , ·) is at most 1.5 times less than the propensity to be inτ (i.e.fτ ,r /fτ , ·) . Only substrings that pass these very stringent criteria are considered to be putative strands. Now we present a graph-theoretic energy minimization model for recognizing and folding TMB proteins.

Definition of the graph structure
Dynamic programming approach. Let S be the sequence of the N amino acids constituting the primary structure of a given protein. We will consider G(V, E, E intr , E adj , E loop ) , the weighted directed acyclic graph (DAG) [34] built from S as follows: Vertices Let V = V* ∪ {⊤, ⊥} be the set of vertices. Each vertex of V* represents a candidate secondary structure item as a bstrand associated with a given set of parameters. It corresponds to a contiguous part (a substring, defined by its starting and ending indices 1 ≤ v < k ≤ N) of S that satisfies given conformational constraints (such as length, propensity to be a b-strand, . . .). The associated parameters provide information about the discretized spatial laying of this part relatively to the whole structure. So, combining the upward/downward and inward/ outward degrees of freedom previously introduced, we consider 4 different orientations for each given candidate b-strand. We could also consider the different instances of relative shear to multiply the number of vertices, but we do not for reasons to be clarified later. A canonical order is defined on V* as the lexicographic order on tuples formed by the respective starting/ending indices in S and the associated parameters. The length constraint implies that the number of candidate substrings and thus |V|, the number of vertices, are bounded above by kN for a small value k. To simplify further definitions, a dummy vertex ⊤ will be used to represent an empty substring at the start of S and, similarly, ⊥ will represent an empty substring at the end of the sequence. To extend the order on all of the vertices, we set ⊤ <v < ⊥, ∀v V* (see Figure 9).

Edges
Let E ⊂ V × V be the set of directed edges. Intuitively, an edge corresponds to a turn or a loop that connects two consecutive b-strands. To be more precise, ∀v, w V*, with ν v , v , ν w , w denoting their respective starting and ending indices, (v, w) is an edge, if v <ν w -2 and the substring of amino acids from v + 1 to ν w -1 satisfies the constraints that allow to form a turn or a loop (such as conditions on length, flexibility, propensity, . . .) also depending on the relative laying of the two substructures. We have the elementary property: for the lexicographic order, and this ensures the DAG structure.
The set E also contains edges of the form (⊤, v) that define the subset of starting vertices -the leading substrings satisfying specific constraints. Similarly, E contains edges of the form (v, ⊥) that define the subset of ending vertices, with a satisfactory trailing substring. Again, the length constraints applied to the substrings associated to edges imply that |E|, the number of edges, is O(|V |) or O(N) . Figure 9 gives a small example of such a graph (to simplify, only one orientation has been considered). An edge like (v 1 , v 2 ) is forbidden, since the two corresponding substrings overlap. Edges like (v 2 , v 3 ) or (v 2 , v 6 ) are also forbidden, since the inserted substrings are respectively too short for a turn or too long for a loop.

Energy attributes
The attributes that complete the definition of the graph G are pseudo-energy functions defined as follows: • ∀v ∈ V * , E intr (v) represents the intrinsic energy of the given strand in the given orientation. This term is the sum of both the internal energy of the substructure, i.e. the interactions between its own amino acids, and the interaction energy with the environment (e.g. membrane and channel) apart from the rest of the considered protein.
• ∀(v, w) ∈ V * ×V * , E adj (v, w, s) represents the interaction energy of the pair (v, w) when the two corresponding strands are placed side by side along the barrel, with respect to the respective orientation parameters associated to the vertices and accordingly to the relative shear s. The energy will take into account the number of contacts and different sidechain interactions such as the packing of hydrophobic cores and bonding abilities. Then, , s)is the interaction energy of the pair (v, w) for an optimal relative shear. It is further assumed that E adj is defined over a superset of E, since we will consider the case where two adjacent strands are not consecutive along the sequence.
We also introduce the particular values: • An associated function s adj is defined such that: which is a relative shear that leads to the optimal interaction energy.
An arising question is why the orientation degrees of freedom are described as a multiplicity of nodes but the relative shear degrees of freedom are considered when calculating the E adj terms. A first answer comes from the fact that wrong orientations are rather absolute and will result in pruning the sets E and V while the shear parameters are not so discriminative. The main reason is that we will consider "floating" parts in which adjacencies are already set, while a relative shear between any two parts is not yet known. In such a situation, attaching the relative shears to node pairs allows a significant factorization.
• ∀(v, w) E,∀t {1, 2, . . . , n -1} and ∀ s -a relative shear, E loop (v, w, t, s) is related to the intrinsic energy of the turn/loop between the strands v and w (consecutive along the sequence) when they are placed at a distance t along the barrel with a relative shear s. The distance t = 1 corresponds to the case where the strands are placed consecutively on the barrel, while an integer value t >1 will correspond to the case where t -1 other strands are interleaf. To simplify, we will also use E loop ( , v) or E loop (v, ⊥) for denoting the intrinsic energy of the outer fragment attached respectively to a starting or an ending vertex v. As such a fragment has a free side, the position parameters may be dropped.
Then, in the usual case of two b-strands that fold as a hairpin, the related energy is considered to be E adj (v, w) + E loop (v, w, 1, s adj (v, w)) . It is supposed a relative flexibility for turns and loops, so, when a fold is feasible, E loop is weak compared to E adj and the relative placement of the two b-strands is enforced to be close to s adj . Nevertheless, E loop will result in a strong penalty in the case of an unfeasible turn or loop, for example a loop with a majority of hydrophobic residues.

Protein folding problem
Given a graph G(V, E, E intr , E adj , E loop ) defined as above, two integers n, S, and a permutation σ as 3 parameters, we look for the path P in G that maximizes the following objective function: Such a path P whose vertices are arranged onto a circle is called a circle-attached path. The adjacent vertices in the path are not necessarily successive on the circle. This order of succession is determined by the given permutation σ (see Figure 10).

Solving as the longest path problem
We will first consider an open structure, as a b-sheet, where the adjacency of strands follows their natural order along the amino acid sequence, i.e. s is an identity permutation. We involve here the constraint ∑ 1<i≤n s i = S. Hence, solving such a structure will result in finding a path P in G whose overall "energy" is given by the sum: Aiming at minimizing E , the protein folding problem will turn into finding the path from ⊤ to ⊥ that maximizes the criterion C = −E . Let C h v be the maximum value for C over all the paths from ⊤ to v, with a shear number of h of the corresponding b-sheet, then C 0 = 0 and ∀v V\{⊤}, ∀h, C h v is defined as: Since the graph is a DAG, the longest path problem is solved with a well known dynamic programming scheme [34] of complexity O(|V|) in space and O(|V | + |E |) in time, that is also O(N) for both, from the structural constraints that relate |V |, |E | and N. The objective is the computation of C S ⊥ and the optimal structure is then reconstructed by a usual traceback post-processing. Note that, for each path, we only have to consider its last vertex, so, we have to track single index states.
For a barrel secondary structure, we have to consider a closing spatial adjacency between the last and the first strands. s is still an identity permutation. The constraint on the shear number becomes ∑ 1<i≤n+1 s i = S. The dynamic programming scheme is almost the same as previously, except that we also have to keep track of the first vertex of any path. So, ∀v V*, such that (⊤, v) and a special closing step is needed: The goal is to calculate max v,( ,v)∈E C S (v,⊥) . Thus the scheme is of complexity O(|V| 2 ) in space and O(|V | · |E |) in time, that is also O(N 2 ) for both, from the structural constraints. This may produce paths of any length and the constraint of n strands is applied as a cut in the recurrence.

Generalization
In a more general case, we consider permutations to deal with the fact that the arrangements of the strands along the barrel do not necessarily follow their order along the sequence. This usually occurs with Greek key motifs or more rarely with Jelly roll motifs. Hence, the protein folding problem becomes finding the longest path P in a graph with respect to a given permutation σ, i.e. the vertices of P , seen on a circle as in Figure 10 are permuted according to σ.
Let s be a circular permutation of {1, 2, . . . , n}. When 1, 2, . . . , n are numbering the positions along the barrel, values s (1), s (2), . . . , s(n) will give the respective ranks of the strands in the sequence order. A position of reference along the barrel is fixed by setting s(1) = 1. Figure 10 shows a first example of a structure with a Greek key motif, which is described by the permutation s = (1, 2, 3, 6, 5, 4).
Hereafter, we will illustrate the presentation of our algorithm by following the example s = (1, 2, 5, 4, 3, 6), which is a bit trickier situation. This example is now said the current example. The corresponding structure and the dynamic programming process are illustrated in Figures 11 and 12.
The dynamic programming scheme now consists in building a barrel, by adding a next strand, taken in sequence with respect to the graph edges, but that is inserted at the position defined by the given permutation. Useful values are the ranks (in the sequence order) of the two strands between which a given one will be inserted. For instance, with the current example, the 5 th strand will be inserted between the 2 nd and the 4 th strands.
Let now k denote the level of construction (1 ≤ k ≤ n), that is the number of strands already placed.
Proposition 4 The k th strand (in the sequence order) is inserted between the two strands whose ranks (in the sequence order) are left k and right k , defined as: With the current example, we get (see Figure 11): left 1 = 6 left 2 = 1 left 3 = 4 right 1 = 2 right 2 = 5 right 3 = 6 left 4 = 5 left 5 = 2 left 6 = 3 right 4 = 3 right 5 = 4 right 6 = 1 An important piece of information to store for the dynamic programming scheme is the set of "active" indices, i.e. ranks of the strands (in the sequence order) that are not definitively bonded on both sides, along the barrel, and also not linked along the sequence and thus have to be kept as degrees of freedom. So, in the current example (see Figure 12), we have to keep in memory as many solutions (to subproblems) as valid instances of the 2 nd and 4 th strands, until an optimal choice for these is recorded as a solution for each instance of the 5 th strand. At that time, any instance as the 5 th strand is kept as a candidate for a link with the 6 th , by a turn or loop, while the different instances as the 3 rd and 1 st are kept for proceeding to an insertion in between.
Definition 5 Two ranks i and j, which refer to the sequence order, are said "adjacent" if: where the case n -1 is intended for the adjacency that will close the barrel.
Proposition 6 The set of "active" indices (in the sequence) at level k is defined by: With the current example of Figures 11 and 12, we get: Thus, for this example, the maximal complexity in space, O(N 4 ), is reached for the set of solutions to the subproblem with 4 strands. Then looping over this set, for computing the set of solutions to the subproblem with 5 strands, will also cost O(N 4 ) in time, since the choice for the 5 th strand is bounded by the structural constraints embedded as edges in the graph. It is a difference with most of the dynamic programming schemes where the complexity in time is expressed with an additional O(N) factor compared the complexity in space. As an other example, in the case of Figure 10, we obtain the complexity O(N 2 ) in both time and space, Figure 11 A permuted b-barrel with a Greek key motif 5436, s = 1 2 5 4 3 6. Left part is a "flattened" side view, stands 6 and 1 being also adjacent. Right part is a schematic view from above, looking down along the axis of the barrel.
which is similar to the case where s is an identity permutation.
Now we have to decide at which minimal level k each term E adj or E loop is determined and can be integrated in the dynamic programming scheme. For the E adj terms, it is simply asserted that the previous or the next strand along the barrel is already placed when left k < k or right k < k, respectively.
Proposition 7 For all k, we have: This results from the definition of the "active" indices of conf k -1 . To simplify the further energy expression, we use the following notation for an "ifelse" function: For the E loop terms, the problem is to wait until the relative shear between the two ends of a turn or loop is solved by the interleaf adjacencies. So, in the given example, the energy of the loop between the 2 nd and 3 rd strands can only be evaluated when the 5 th strand has been laid and the optimal relative shear Definition 8 Let A k be the relation on positive integers, defined as: ∀i, j, iA k j ⇔ ⎡ ⎣ i = j or i ≤ k and j ≤ k and i, j are adjacent then let A * k denote the equivalence relation defined by the transitive closure of A k and let A k = {i < k|iA * k (i + 1)}. Thus, i A k means that the i th and (i + 1) st strands are geometrically linked by adjacencies when the k th substructure is laid and we can compute by composition an optimal relative shear s * adj .
We will now focus on the set δ A k = A k -A k -1 , ∀ k > 1. Proposition 9 For all k, we have: Proposition 10 For all i < k -1, Definition 11 Let T k ⊂ V * |confk| denote the set of all tuples of |conf k | vertices such that there is at least one path (of k edges) starting from ⊤ and passing through these vertices in order.
For any instance z T k of such a tuple and, ∀i conf k , let z[i] denote the i th vertex of a corresponding path.
This notation (not to be confused with z i, the i th component of tuple z) is not ambiguous since, from definition, the vertex z[i] is in common to any path associated to z. Particularly, z[k] is the last vertex of any path associated to z.
Proposition 12 For all z Î T k , the set of tuples corresponding to paths of length k -1 that can be extended to a path corresponding to z is defined as: Let C h k,z be the maximum value for C over all paths starting from ⊤ and leading in order through the vertices of a given tuple z T k with a shear number of h of the corresponding b -barrel. The general recurrence relation is: ∀z T k,  Figure 12 Successive steps of the dynamic programming scheme for a Greek key motif 5436, s = 1 2 5 4 3 6. Arrows show the "waiting" constraints. As all the (loop or adjacency) constraints have not been satisfied by determining the linked segments, a segment should be considered as a variable with the memory of a partial solution for each possible instance. The corresponding ranks are said "active" and are listed in the conf set.
Note that, from proposition 7, ∀y T k -1 , if left k <k then the vertex y[left k ] is defined (and the same is worth for right k ). We can check that each E adj term is finally counted exactly once in the sum, at the level corresponding to the position of its further vertex in the sequence order. The optimum is found at k = n and h = S.
Corollary 13 The complexities are O (N max k ||conf k || )in space and time.
For any permutation, we have Hence, max k ||conf k || ≤ 1 + (2n -2) /3. For a permutation that only differs from the identity permutation by disjoint Greek key motifs [35], i.e. where G j = i j + 3, i j + 2, i j + 1, i j + 4 or G j = i j + 1, i j + 4, i j + 3, i j + 2, it is easy to prove that max k ||conf k || ≤ 4 by a discrete analysis on different configurations. The complexities are thus at most O(N 4 ) for such a permutation. In short, it is possible to compute the optimum in O(N 2 ) running time for structures corresponding to the identity permutation and from O(N 2 ) (for instance, example of Figure 10) to O(N 4 ) (for instance, example of Figure 11) for structures containing disjoint Greek key motifs, where N is the input sequence length. These computation costs might be further improved by a tree decomposition-based algorithm that we are currently working on.

Implementation details
The number of strands n and the shear number S determine the geometry of the barrel, particularly the membrane spanning part of the segments, and are thus involved in the computation of energy terms. If known, the algorithm can enforce these value and fold the protein accordingly. The values for n, which are usually even, are governed by the consideration on the length of the sequence, the thickness of membrane and the length of turns or loops and vary between 8 and 22 [1]. The values for S, are even and included between n and 2n [28,29]. The problem is then solved by the constrain dynamic programming with the constraints of given n and S. A small number of couples (n, S) have to be explored and our algorithm is fast enough for that.
Side-chain interactions between contiguous residues along a segment on the same side and interactions with the environment of channel or bilayer define the intrinsic energy of the corresponding vertex. The pairing energy of two adjacent segments in the barrel is computed by optimizing the relative positions between constituent amino acids. These energies involve hydrogen bonds in main chains, electrostatic interactions between side-chains, hydrophobic effect as well as environmental effect. More specifically, the extracellular and intracellular environments with distinct hydrophobicity indices can have significantly different hydrophobic effects. In addition, the membrane thickness gives constraints on segment size and helps identify the interactions inside or outside the membrane region. We use here by default a parameter of 3 nm for the membrane thickness, thus 8 residues thick [36,37]. The features on size, polarity [38], and flexibility [39] of turns and loops are also taken into consideration, i.e. turns and loops satisfy threshold constraints on their polarity and flexibility indices and their length. Their energies are approximated by hydrophobicity [31].
We use the Dunbrack backbone-dependent rotamer library [40] and the partial charges from GROMOS force field [41] to compute pairwise interaction energies. The hydrophobic interaction between two side-chains u, v is assessed by the amount of contacts between nonpolar groups, calculated by taking the average on all rotamer pairs of the two side-chains e uv =<e uv|rotamers >. Each side-chain plays a role of a group of partial charges in the electrostatic interaction. The main-chain hydrogen bond is measured by the electrostatic potential energy between peptide CO and NH groups.
The probabilistic model and the constraints on hydrophobicity help discard the unlikely membrane spanning b-strands. A threshold on overall energy can also be involved to enhance the discrimination. We studied the per-strand energy value for a variety of TMB proteins including the training dataset and other TMB proteins. Even though this value is always higher than 0.9 for these proteins, we chose 0.85 as a threshold to avoid overfitting. Note that this does not affect the prediction results, and is only used for classification.

Datasets
We used TMB proteins from the PDBTM database [20] to train and test our approaches.
• Folding: We used CD-HIT [42] to constrain the redundancy in proteins. A threshold of 40% similarity was applied to reduce the dataset, resulting in 49 sequences (PDBTM40). We retain only the monomeric barrels, i.e. the sequences that form a unique