Significant distinct branches of hierarchical trees: a framework for statistical analysis and applications to biological data

One of the most common goals of hierarchical clustering is finding those branches of a tree that form quantifiably distinct data subtypes. Achieving this goal in a statistically meaningful way requires (a) a measure of distinctness of a branch and (b) a test to determine the significance of the observed measure, applicable to all branches and across multiple scales of dissimilarity. We formulate a method termed Tree Branches Evaluated Statistically for Tightness (TBEST) for identifying significantly distinct tree branches in hierarchical clusters. For each branch of the tree a measure of distinctness, or tightness, is defined as a rational function of heights, both of the branch and of its parent. A statistical procedure is then developed to determine the significance of the observed values of tightness. We test TBEST as a tool for tree-based data partitioning by applying it to five benchmark datasets, one of them synthetic and the other four each from a different area of biology. For each dataset there is a well-defined partition of the data into classes. In all test cases TBEST performs on par with or better than the existing techniques. Based on our benchmark analysis, TBEST is a tool of choice for detection of significantly distinct branches in hierarchical trees grown from biological data. An R language implementation of the method is available from the Comprehensive R Archive Network: http://www.cran.r-project.org/web/packages/TBEST/index.html.


Abstract Background
One of the most common goals of hierarchical clustering is finding those branches of a tree that form quantifiably distinct data subtypes. Achieving this goal in a statistically meaningful way requires (a) a measure of distinctness of a branch and (b) a test to determine the significance of the observed measure, applicable to all branches and across multiple scales of dissimilarity.

Results
We formulate a method termed Tree Branches Evaluated Statistically for Tightness (TBEST) for identifying significantly distinct tree branches in hierarchical clusters.
For each branch of the tree a measure of distinctness, or tightness, is defined as a rational function of heights, both of the branch and of its parent. A statistical procedure is then developed to determine the significance of the observed values of tightness. We test TBEST as a tool for tree-based data partitioning by applying it to five benchmark datasets, one of them synthetic and the other four each from a different area of biology. For each dataset there is a well-defined partition of the data into classes. In all test cases TBEST performs on par with or better than the existing techniques.

Conclusions
Based on our benchmark analysis, TBEST is a tool of choice for detection of significantly distinct branches in hierarchical trees grown from biological data. An R language implementation of the method is available from the Comprehensive R Archive Network: cran.r-project.org/web/packages/TBEST/index.html .

Background
Hierarchical clustering (HC) is widely used as a method of partitioning data and of identifying meaningful data subsets. Most commonly an application consists of visual examination of the dendrogram and intuitive identification of sub-trees that appear clearly distinct from the rest of the tree. Obviously, results of such qualitative analysis and conclusions from it may be observer-dependent. Quantifying the interpretation of hierarchical trees and introducing mathematically and statistically well-defined criteria for distinctness of sub-trees would therefore be highly beneficial and is the focus of this work.
The need for such quantification was recognized some time ago, and methods have been designed for (a) identifying distinct data subsets while (b) making use of hierarchical tree organization of the data. These methods fall into two categories, depending on whether or not they employ statistical analysis. The simplest approach that does not rely on statistical analysis is a static tree cut, wherein the tree is cut into branches at a given height. This procedure is guaranteed to produce a partition of the data, but provides no way to choose the height at which to cut. Moreover, some partitions cannot be produced by a static cut. Dynamic Tree Cut, or DTC in the following [1], is a more sophisticated recipe, capable of generating partitions not achievable by a static cut. However, DTC partitions depend on the minimal allowed number of leaves in a branch, a user-defined parameter that cannot be determined by the method itself.
In addition, there are methods for choosing a tree partition from considerations of branch distinctness and its statistical significance. Sigclust, or SC in the following [2], is a parametric approach wherein a two-way split of the data is deemed significant if the null hypothesis that the data are drawn from a single multivariate normal distribution is rejected. The method is designed to work in the asymptotic regime, where the dimensionality of the objects being clustered far exceeds the number of the objects. In application to trees SC works in a top-down fashion, by first examining the split at the root node and proceeding from a parent node to its daughter nodes only if the split at the parent node has been found significant. Unlike SC, the sum of the branch lengths method, or SLB in the following [3] is designed specifically for hierarchical trees and utilizes a measure of distinction between two nodes joined at a parent node that is linearly related to the heights of the two daughter nodes and that of the parent. Similarly to SC, SLB adopts a top-down scheme.
A method introduced here is termed Tree Branches Evaluated Statistically for Tightness (TBEST) and shares features with the existing approaches. Like SC and SLB, TBEST employs statistical analysis to identify significantly distinct branches of a hierarchical tree. Similarly to DTC and SLB, it uses tree node heights to assess the distinctness of a tree branch. As with the other three methods, partitions generated by TBEST are not necessarily accessible by a static cut.
At the same time, TBEST differs from the existing designs in several aspects, two of which are critical. First, unlike DTC, SC and SLB, it examines all the tree nodes simultaneously for distinctness. Secondly, unlike SLB, it combines node heights non-linearly to construct a statistic for distinctness that is better able to handle a tree in which distinct branches of approximately equal numbers of leaves occur at different heights. The key properties of all four methods are summarized in Table 1.
In the remainder of this work we formulate TBEST and systematically compare its performance to that of DTC, SC and SLB on a number of benchmark datasets originating from a variety of biological sources. In all cases we find that TBEST performs as well as or better than the three published methods. We conclude by discussing generalizations of TBEST and its relation to other aspects of cluster analysis.

Methods
Consider a set of objects with pair-wise relations given by a dissimilarity matrix.
Given a linkage rule, a hierarchical tree can be grown for the set. We will only consider inversion-free linkage rules here. The tree is specified, in addition to its branching structure, by the heights of its nodes. The height of the node quantifies the dissimilarity within the data subset defined by the node. We wish to construct, for each node of the tree, a measure of how distinct the data subset corresponding to the node is from the data set. The special case of the objects being points in a Euclidean space, with the dissimilarities defined as distances between the points, may be used for guidance in this construction. The node height then quantifies the linear extent of the data subset defined by the node. Accordingly, it has been proposed [3] to make the measure of distinctness of a node n linear in the difference in heights between a parent P(n) of n and that of n itself. An example of a one-dimensional dataset, tabulated in Additional File 1 and shown in Figure 1, illustrates a difficulty with such construction. Both the subsets shown in blue and in green are clearly distinct from the rest of the data, but the difference in heights between the blue node and its parent is not as great as that between the green node and its parent. Thus, based on the parent to child difference in heights, one would conclude, counter-intuitively, that the blue subset is not nearly as distinct as the green subset. A measure in better agreement with intuition is the relative difference of heights: where h(n) is the height of node n. In the following we refer to S(n) as the tightness of node n. In the absence of inversions, the tightness of any node is a number between 0 and 1. In particular, S(n) = 1 identically if n is a leaf. The two subsets highlighted in Figure 1 are nearly equally tight by this measure, despite the disparity in their heights.
To enable statistical analysis of tightness, a null distribution of S(n) is required, for making comparisons with the observed S(n). This null distribution is obtained by randomizing the dataset from which trees are grown. How such randomization is to be performed depends on the type of the data and on the broader context of the study and cannot be specified in general. For example, if the data matrix represents gene expression, with genes as rows and observations as columns, it may be appropriate to randomize the data by permuting values independently within each row. However, in other situations a more restrictive randomization should be adopted. For example, the elements of a binary data matrix may represent the mutation status at a set of genomic positions (rows) in a collection of genomes (columns). The investigator may wish to randomize the data while preserving both the site mutation frequencies (row sums) and the overall mutation burden within each genome (column sums).
Here we design a general procedure for constructing the null distribution of tightness for any given data randomization scheme. To guide this design, we generated distributions of tightness in trees grown from randomized data for multiple combinations of datasets, definitions of dissimilarity, linkage rules and randomization methods, as listed in Table 2. As Figure 2 and Additional file 2: Figure S1 illustrate, the shapes of these distributions generally depend on the number of leaves and, in most cases examined, the peak of the distribution occurs at higher tightness for smaller number of leaves. The identity S(n) = 1 for single-leaf nodes is consistent with this observation. We therefore conclude that, for a given observed value of tightness, the appropriate null distribution should be sampled by repeated randomization of the data, growing a tree for each randomization, selecting among its nodes the ones with the numbers of leaves matching the observation, and determining the tightness of these nodes. However, it is not guaranteed that, in any tree grown from randomized data, there will be a unique node with a number of leaves exactly equal to that of the observed node. To resolve this difficulty conservatively, we adopt the following procedure. If, for a given data randomization, the tree contains nodes with the number of leaves exactly as observed, the highest S(n) computed for these nodes is added to the sample. Otherwise we consider all the nodes with the number of leaves nearest the observed one from above and all those with the number of leaves nearest the observed one from below, and add to the sample the highest S(n) of any of these nodes.
With the sampling procedure specified, tests for statistical significance of tightness can be conducted for all the internal nodes of the observed tree, excluding the root, since the latter has no parent. The number of tests is therefore two less than the number of leaves. Due to this multiplicity of tests, higher levels of significance are required for rejection of the null hypotheses for trees with larger numbers of leaves. A straightforward way to handle this requirement would be to increase the size of the sample from the null distribution by performing more randomizations. However, for trees with large numbers of leaves this simple-minded approach may be rendered impractical by computational cost. Instead, higher levels of significance may be accessed by using the extreme-value theory (EVT) to approximate the tail of the null distribution, thereby permitting considerable economy of computational effort [4]. We have used the EVT-based method alongside the more costly purely empirical computation of significance in our benchmark studies reported in the following, and found the two approaches to be in good agreement, as shown in Additional file 2: Figure S2. The p-values displayed in the following were computed by applying a multiple-hypotheses correction of the form p = 1 -(1 -p e ) N-2 , where p e is the empirical p-value and N is the number of leaves.

Results
We evaluated the performance of TBEST in comparison to three published methods of identifying distinct subsets of observations, namely, DTC, SC and SLB. Of the five datasets used in the evaluation one is synthetic, generated to simulate a set of gene expression profiles. The remaining four datasets share two common features: they originate in biological experiments and in each case there is an independently known, biologically meaningful partition of observations into types. We call this known partition "truth", and the corresponding types the true types, henceforth. The essential properties of the benchmark datasets are summarized in Table 3.
To better judge the performance of TBEST in comparison to the other three algorithms, we considered, for each dataset, more than one combination of dissimilarity and linkage methods used for hierarchical clustering. With the exception of the third benchmark case, randomization of the input data, as required for both TBEST and SLB, consisted of randomly permuting the observed values, independently for each variable. The degree of agreement between a computed partition of the data and the truth is quantified in terms of corrected-for-chance Rand index, or cRI in the following [5]. It should be noted that the subsets of the data identified as distinct by TBEST and the other three techniques by necessity correspond each to a branch of a tree. This, however, is not necessarily the case for the true types, some of which do not correspond to a single branch. As a result, a perfect match between any computed partition and the truth may not be possible, and the maximal attainable value of cRI may be below 1. For this reason, to evaluate the performance of TBEST and the published methods across benchmark datasets, we also identify, for each tree considered, a partition into branches that best matches the truth and determine cRI between that partition and the computed partitions for each of the methods.

Simulated6
The data are a sample of size 60 in 600 dimensions [6]. The true partition of the data is into six subtypes, with the sizes of 8, 12, 10, 15, 5, and 10. Each of the 600 variables represents a simulation of a gene expression. For 300 of these genes the values are sampled from the same distribution for all subtypes. The remaining 300 genes fall into six non-overlapping subsets of equal size. Each subset corresponds to exactly one subtype, and for that subtype only the genes in the subset are sampled from a distribution that differs from the background.
The comparison between the four algorithms is displayed graphically in Figure   3. For both combinations of dissimilarity and linkage only TBEST and DTC match the truth exactly, while the other two methods either fail to partition the set or do so incompletely. We note that the Euclidean dissimilarity -complete linkage combination results in a particularly challenging tree, which cannot be partitioned correctly by a static cut.

Leukemia
The

Discussion and Conclusions
As our test results demonstrate, the performance of TBEST as a tool for data partitioning is equal or superior to that of similar published methods in a variety of biology-related settings. This is true in particular for datasets with underlying tree-like organization, such sets of genomic profiles of individual cancer cells, of the same type as our second benchmark case above. In a work presently in progress we are applying TBEST systematically to a number of datasets of a similar nature. But TBEST also performs well on datasets with no underlying hierarchical structure, such as Simulated6 or Leukemia above. In total, TBEST was able to recover the true partition of the data on par with or better than the published methods in ten out of eleven test cases considered here. We further note that in all but one cases considered the optimal partition of the data by TBEST also was the most significant nontrivial partition. This was not the case for the other significance-based methods included in the comparison.
TBEST can both be applied and formulated more broadly. The applicability of TBEST is not limited to data partitioning that has been our focus here. TBEST can be used for finding all significantly distinct branches of a hierarchical tree, regardless of whether these form a full partition. Further, alternatives to the test statistic of Equation 1 can easily be devised, For example, for any non-leaf node n we can where c 1 (n), c 2 (n) are the two children of n. While the discussion of these extensions

Competing interests
The authors declare that they have no competing interests.        Tables   Table 1 -Properties of TBEST and of the three published methods.

Additional files
Additional file 1 -Dataset displayed in Figure 1 An Excel file containing a set of 280 positive real values sampled from a mixture of three normal components: N(0.5,0.4 2 ), N(11,1 2 ) and N(5,2 2 ).

Additional file 2 -Figures S1 and S2.
A PDF file containing Figure S1, an 11-panel figure illustrating null distribution of tightness and Figure S2, a comparison of empirical p-value estimates for tightness to EVT-based estimates.