### Conservation of histones between Human and Arabidopsis

Histone H3 is highly conserved between Human and Arabidopsis with 97.8% identity, different only in the positions of 31, 87 and 90 amino acids, which is even with the similar identity level as that among the different isoforms within Arabidopsis. Histone H4 is also highly conserved between Human and Arabidopsis with 98.0% identity, different in the sites of 60 and 72. The identity of Histone H2B is 80.0% between human and Arabidopsis, but less conserved in the N- terminus, where the known acetylated site is localized. The acetylation information of histone H2B.C of human is used in a direct conservation-mapping mode in the prediction. Because of the lack of data about acetylation site and much more variants in histone H2A, the acetylation pattern of which is un-predictable by sequence conservation directly.

### Human acetylation based prediction

#### Known acetylation comparison between Human and Arabidopsis

In Arabidopsis, there are 13, 12, 12, 22, 16, 21 lysines in Histone H2A.1, H2A.2, H2A.3, H2A.4, H2A.5 and H2A.6, respectively, only two sites, K5 and K144 in H2A, have been verified for acetylation[8]. For histone H2B, there are 11 variants, but only H2B.10, H2B.6, H2B.7 have the acetylation information. Therefore, we considered those three variants in our prediction according to the filter principle (see methods). There are 29, 32 and 32 lysines in H2B.10, H2B.6, H2B.7, respectively, seven, six and six lysines of them were verified to be acetylatable. Similarly, both of histone H3.2 and H3.3 contain 14 lysines, seven lysines in each of them have shown to be acetylable. The least lysines exist in histone H4, and 50 percent of the total 10 lysines have been identified to be acetylated by our data collection (see methods). Meanwhile, text mining shows that only a few lysines (1/16 and 1/5) in H2A and H2B have been detected to be acetylated. Compared with human histone acetylation, identified acetylatable lysines of H2B are significantly less in Arabidopsis by Fisher Exact Test with p value less than 0.01.

#### Training set insufficient in Arabidopsis

The key hypothesis of flanking peptide based prediction methods is the structure recognization. Here, such type of methods focuses on short stretches of amino acid N- and C-terminus, because structural studies of published KAT domains coupled with peptide substrates typically do not exceed 14-20 aa in length [4, 10]. Thus, the acetylatability of adjacent lysines is very interesting, and obviously has an effect on the peptide-based methods. Here, we take three as a distance to define the adjacent lysines. In the human training dataset (HTDS), there are 9 adjacent lysine pairs and eight of them have the same acetylatability with 89 identity, while only 53.0% identity remains in Arabidopsis training dataset (ATDS). Fisher's Exact Test shows that the relationship between adjacency and acetylability is different between the two training sets. By accepting the key hypothesis, the ATDS seems to be insufficient. Therefore, the highly conserved and well explored human acetylation data are adopted for the prediction in Arabidopsis.

### Human acetylation based prediction

#### Length normalization provide slightly data mining effect

The prediction method used here is mainly based on sequence similarity, as described in PreMod [6]. To construct a more stringent and theorizational model, several critical factors, including peptide length, lysine position and histone correlation, were considered.

To be consistent with the key hypothesis that acetylation is based on the structure reorganization of the lysine's flanking peptide, raw distance is defined as the local alignment score with BLOSUM 62 matrix and Smith-Waterman algorithm[11], and length normalized distance is equal to raw distance divided by the minimized length of the pair-wise alignment peptides. The hierarchical cluster dendrogram was constructed based on raw distance and length normalized distance. Cutting the dendrogram at a given height will split the tree into seperated classes (sub-classes) at a selected distance, the greater the number of the height, the less the amount of the seperated classes. Because the neighbour lysines are always with the same acetylation status, in such case, the number of the corresponding peptides was reduced to one according to PreMod. For the semi-supervised dataset with unobserved acetylation site labeled as N class, 'complete' hierarchical method and information gain ratio [12], are applied to construct and estimate the performance to balance the effect of minimizing sub-classes and maximizing accuracy, where 'complete' hierarchical method can minimize sub-classes and information gain ratio can maximize accuracy[12]. Comparing to the raw distance, length normalized distance reduces the sub-classes from 25 to 24 slightly, with the same information gain ratio 0.589 at the heights of 0.381 and 0.364 respectively, based on the hierarchical cluster dendrogram. As a visual data mining method, hierarchical cluster was used to summarize some chaos single sample into understandable patterns with different sub-classes. Whereas, to test whether such a method can cluster samples into simpler sub-classes, permutation method was employed to construct a significant level of random acetylation status labels. By 10000 permutations, p values of the Null hypothesis, that the number of sub-classes can be less than 25 or 24 by chance are 0.030 or 0.032, respectively. Furthermore, a 1000 time-permutation on normalization was also performed to test the distance, the length normalization effect is less significant with permutation p value as 0.27. The result showed that hierarchical cluster can group the peptides efficiently, but length normalization had less effect on the cluster result.

#### Human peptide acetylation based prediction performance

Two level prediction cutoffs were defined here. Because the raw tree with human peptide acetylation was built by complete hierarchical cluster method, adding any given different samples to the tree is unable to change the between-subclass relationships of the previous samples at any given height. The hierarchical structure was maintained with the high cutoff 0.364, and only the sites that are the children of the known human subclasses in the full cluster tree were considered to be predictable in acetylation prediction in Arabidopsis. Restricted to this cutoff, only 33 highly conserved sites in Arabidopsis were predicted, of which 12 sites were experimentally verified to be acetylable. Because of the lack of rigorous evidence that a site is un-acetylable, especially for the ATDS, no others but the false negative sample can be counted reasonably. Thus, only false negative rate and sensitivity are calculated for the prediction performance here, with 0% and 100% in the prediction results with high cutoff, respectively.

#### Proof

Given a tree *T* = *(V,E)*, in the complete linkage method, *D(r,s)* is computed as *D(r,s)* = Max { *d(i,j)* : Where object *i* is in cluster *r* ⊆ *V* and object *j* is cluster *s* ⊆ *V* }, At each stage of hierarchical clustering, the clusters *r* and *s*, for which *D(r,s)* is minimum, are merged.

1. Case 1(within-subclass): if Vector *i* and *j* are in a same sub-class at a height h, adding a any new vertex *k* will change the value of *d(i,j)* when *d(i,k)* <*d(i,j)* <*d(k,j),* then *d(i,j)* will be increased by *d(k,j)-d(i,j),* and the given height *h* may divide *i* and *j* to different sub-class.

2. Case 2 (between-subclass): if vertex *i* and *j* belongs to different sub-class *r* and s at a height *h,* respectively, without loss of generality, a new vector *k* is added to sub-class *r*. Then, the distance between the new set {*r, k*} and *s* is no less than distance between *r* and *s*, *D*({*r*, *k*}, *s*) ≥ *D*(*r*, *s*)) Thus, *i* and *j* will also be cut to the different subtree at the height *h*.

In all, adding new sample will not change the existed subclass relationship, because the monotone increases cluster distance definition.

To utilize the constructed hierarchical cluster tree and perform an overall prediction, we need to transform it to a classifier with features and a decision function, and fit another cutoff. According to the above hierarchical cluster results, 12 sub-classes with at least 2 peptides with the same label, 12 features were set to be the *D*(*p*_{
i
}, *s*′_{
j
}), where the function D is defined as complete linkage methods, *s*′_{
j
}) represents for *j* ∈ {1,...,12} th of the 12 sub-classes, and decision function is defined as

*x* = *αmin*(*D*(*x*, *s*′_{
j
}) – *βmin*(*D*(*x*, *s*′_{
i
}) – *I*

where *α, β* represent for the weight of sub-class distances, and can be treated as the confidential level of the acetylability, *i* ⊂ {Acetylable subclasse} and *j* ⊂ {Un-acacylabl subclasse}, *I* for the intercept.

With the max-min distance based classifier, the 266 peptides, generated by histone filtering and 12 flanking residues window sliding processes from both human and Arabidopsis, were predicted. Among them, eighty-one have been experimentally identified as acetylation sites. Among all the remained 185 unknown lysines, almost 70% lysines were predicted to be acetylable. Of the 185, nineteen have been regarded as unacetylated based on the information we collected. For the 19 negative labeled in Arabidopsis, H3K56 and H3K79 were relabeled as acetylable-positive, and almost 50 percent unaceletalyted sites in human histones were also relabeled to be positive ones. Sensitivity and false negative rate of our prediction were estimated to be 98.4% and 1.16% based on the 81 peptides, respectively. Although the flexible cutoff may contain some false positive acetylation site, the high sensitivity should be more attractive to make the prediction result to be a useful candidate set.

### Peptide based feature conservation between histones

#### Peptide based feature conservation between Histones

In the resulted hierarchical cluster tree, all the clustered peptides happened to belong to the same histone, we chose the information gain as a statistics index, which is a monotone function to the chaos of the sub-classes. A permutation of the information gain against the null hypothesis is given by p value less than 10e - 5, where the null hypothesis is that all the histone types are random to be categorized in the different sub-classes and a highly significant correlation between histone types and the sub-classes.

### Structure factors contribution in acetylation prediction

#### 3D location of lysines are related to acetylability

According to the constructed hierarchical cluster, in which the adjacent lysines are combined into one peptide, the clustered lysine shows not only the similarity in acetylability, but also closeness in protein location, which are not caused by the overlapping area of peptide. As in the N-terminus of the H3.1, all the lysines, H3K4, H3K9, H3K14, H3K18, H3K23 and H3K27, are acetylable and clustered together with the similar peptide pattern (Figure 1, B). To confirm the result, a permutation test was preformed against the distance of lysine location, where the null hypothesis is that all the location distances are no larger than the observed ones with the different sub-class, the test showed that a high significance of the adjacency is resulted by p value less than 10e- 5. Although the extreme N-terminuses of the core histones are not available in many crystal structure data, obviously, acetylability is associated with nucleosome location of lysines in 3D viewer (Figure 2).

Intuitionally, in the vertical viewer of the nucleosome 3D structure (2CV5) obtained from PDB [13], H3K37, H3K56, H4K31, H4K44, H4K59 (group1, Figure 2 A) are acetylable closer to the boundary regions of the DNA-binding of a nucleosome, which obviously contradistinguish to the H3K115 and H3K122 (group 2, Figure 2 A) which are closer to the middle regions of the DNA- binding of a nucleosome. In the horizontal viewer of the 3D structure, the lysines, H2BK40, H2BK43, H2BK105, H2BK113, H2BK117 and H2K122 (group 3, Figure 2 B), are obviously farther from the DNA than the acelytable sites, H4K77 and H4K79 (group 4, Figure 2 B) in the two double-strand region of nucleosome, while the third spacial stand-alone group, the pair of H4K91(group 5, Figure 2 C) are in the center of the nucleosome, which region is lack of un- acetylable lysines. Similarly, the pair of H2BK31, which are in the center of two double-strand regions of the nucleosome, have not been evaluated and therefore was not defined to be a feature group in the following prediction.

#### Other structure factors contribution in acetylation prediction

Many other structure factors have been considered to be possible features in acetylation prediction (additional file 1), but only a few have been selected. A typical type of factor is accessible surface area (ASA), which contains Percent ASA, Residue ASA, Sidechain ASA and Atom ASA. We tested the correlation coefficient between all of these values and acetylation statues by Wilcoxon Rank Sum and Signed Rank Test. Consistent with previously published results, none of them are significant at 95 percent confidential level.

### Peptide sequence and 3D structure integrated classifier

#### Construction of acetylability associated location region

To identify the distance between a lysine and acetylability associated location region, the surface of the region should be constructed properly. Here, the surface was chosen to have a shape of ellipsoid, which is fitted by the 3D location of ε-N (additional file 2),

*x*^{T}*Ax* = 1

where *A* is a symmetric positive definite matrix and *x* is a vector. In such case, the eigenvectors of *A* define the principal directions of the ellipsoid and the inverse of the square root of the eigenvalues is the corresponding equatorial radii, and then the distance was calculated numerically by the simulated points of the surface.

In an Arabidopsis nucleosome, the distance between any two lysines in all histones is calculated by mapping the short peptides to protein sequence of the 2CV5 crystal structure [13].

#### Classifier construction and performance

Because of the semi-reliability of the un-acetyability, it is doubtful to train a classifier by both positive and negative sets, which are supposed to be totally correct. Under the semi-supervised situation, we adopted a hierarchy process to predict the acetyablility of Arabidopsis histones. The first hierarchy decision was done by the higher level cutoff based on peptide similarity. The lysines not contained in the predicted result were predicted by the lower level peptide-based cutoff, in which the false negative rate is shown to be 1.16%. Then the structure location was adopted to adjust the positive lysines generated in the second hierarchy. The decision strategy in the third hierarchy was the same as that used in the above peptide-based method and the distance was defined by ε-N spacial location.

Through such hierarchy process, the performances for histone acetylation prediction in Arabidopsis were 0.73, 1, 0.26 and 0 for specificity, sensitivity, false negative rate and false positive rate, respectively. Comparing to the peptide-based prediction, the previous mispredicted acetylable lysines were corrected by the structure location. Because of the existence of non- observed acetylable site, the specificity is relative lower and less reliable. There are 227 Arabidopsis peptides in the 266 selected peptides from both human and Arabidopsis, and all the 227 Arabidopsis lysine peptides were predicted based on the model generated from the rest known human histone lysines' acetylation patterns. For all the 227 Arabidopsis lysines used in prediction, there are 72.7 percent lysines are acetylable (additional file 3), which indicate that many currently unobserved acetylated status and patterns could happen under those untested special tempo-spatial conditions.