Significant distinct branches of hierarchical trees: a framework for statistical analysis and applications to biological data
© Sun and Krasnitz; licensee BioMed Central Ltd. 2014
Received: 14 October 2014
Accepted: 31 October 2014
Published: 19 November 2014
One of the most common goals of hierarchical clustering is finding those branches of a tree that form quantifiably distinct data subtypes. Achieving this goal in a statistically meaningful way requires (a) a measure of distinctness of a branch and (b) a test to determine the significance of the observed measure, applicable to all branches and across multiple scales of dissimilarity.
We formulate a method termed Tree Branches Evaluated Statistically for Tightness (TBEST) for identifying significantly distinct tree branches in hierarchical clusters. For each branch of the tree a measure of distinctness, or tightness, is defined as a rational function of heights, both of the branch and of its parent. A statistical procedure is then developed to determine the significance of the observed values of tightness. We test TBEST as a tool for tree-based data partitioning by applying it to five benchmark datasets, one of them synthetic and the other four each from a different area of biology. For each dataset there is a well-defined partition of the data into classes. In all test cases TBEST performs on par with or better than the existing techniques.
Based on our benchmark analysis, TBEST is a tool of choice for detection of significantly distinct branches in hierarchical trees grown from biological data. An R language implementation of the method is available from the Comprehensive R Archive Network: http://www.cran.r-project.org/web/packages/TBEST/index.html.
Hierarchical clustering (HC) is widely used as a method of partitioning data and of identifying meaningful data subsets across multiple areas of biology including, prominently, high-throughput genomics. Most commonly an application consists of visual examination of the dendrogram and intuitive identification of sub-trees that appear clearly distinct from the rest of the tree. Obviously, results of such qualitative analysis and conclusions from it may be observer-dependent. Quantifying the interpretation of hierarchical trees and introducing mathematically and statistically well-defined criteria for distinctness of sub-trees would therefore be highly beneficial and is the focus of this work.
The need for such quantification was recognized some time ago, and methods have been designed for (a) identifying distinct data subsets while (b) making use of hierarchical tree organization of the data. These methods fall into two categories, depending on whether or not they employ statistical analysis. The simplest approach that does not rely on statistical analysis is a static tree cut, wherein the tree is cut into branches at a given height. This procedure is guaranteed to produce a partition of the data, but provides no way to choose the height at which to cut. Moreover, some partitions cannot be produced by a static cut. Dynamic Tree Cut, or DTC in the following , is a more sophisticated recipe, capable of generating partitions not achievable by a static cut. However, DTC partitions depend on the minimal allowed number of leaves in a branch, a user-defined parameter that cannot be determined by the method itself.
In addition, there are methods for choosing a tree partition from considerations of branch distinctness and its statistical significance. Sigclust, or SC in the following , is a parametric approach wherein a two-way split of the data is deemed significant if the null hypothesis that the data are drawn from a single multivariate normal distribution is rejected. The method is designed to work in the asymptotic regime, where the dimensionality of the objects being clustered far exceeds the number of the objects. In application to trees SC works in a top-down fashion, by first examining the split at the root node and proceeding from a parent node to its daughter nodes only if the split at the parent node has been found significant. Unlike SC, the sum of the branch lengths method, or SLB in the following  is designed specifically for hierarchical trees and utilizes a measure of distinction between two nodes joined at a parent node that is linearly related to the heights of the two daughter nodes and that of the parent. Similarly to SC, SLB adopts a top-down scheme.
A method introduced here is termed Tree Branches Evaluated Statistically for Tightness (TBEST) and shares features with the existing approaches. Like SC and SLB, TBEST employs statistical analysis to identify significantly distinct branches of a hierarchical tree. Similarly to DTC and SLB, it uses tree node heights to assess the distinctness of a tree branch. As with the other three methods, partitions generated by TBEST are not necessarily accessible by a static cut.
Properties of TBEST and of the three published methods
Order of examining the tree
all internal nodes in parallel
top down and bottom up
In the remainder of this work we formulate TBEST and systematically compare its performance to that of DTC, SC and SLB on a number of benchmark datasets originating from a variety of biological sources. In all cases we find that TBEST performs as well as or better than the three published methods. We conclude by discussing generalizations of TBEST and its relation to other aspects of cluster analysis.
where h(n) is the height of node n. In the following we refer to S(n) as the tightness of node n. In the absence of inversions, the tightness of any node is a number between 0 and 1. In particular, S(n) =1 identically if n is a leaf. The two subsets highlighted in Figure 1 are nearly equally tight by this measure, despite the disparity in their heights.
To enable statistical analysis of tightness, a null distribution of S(n) is required, for making comparisons with the observed S(n). This null distribution is obtained by randomizing the dataset from which trees are grown. How such randomization is to be performed depends on the type of the data and on the broader context of the study and cannot be specified in general. For example, if the data matrix represents gene expression, with genes as rows and observations as columns, it may be appropriate to randomize the data by permuting values independently within each row. However, in other situations a more restrictive randomization should be adopted. For example, the elements of a binary data matrix may represent the mutation status at a set of genomic positions (rows) in a collection of genomes (columns). The investigator may wish to randomize the data while preserving both the site mutation frequencies (row sums) and the overall mutation burden within each genome (column sums).
Combinations of data sets, dissimilarity, linkage and randomization methods, used for testing TBEST
Data permutation method
Independently for each coordinate (column)
(1 - Pearson correlation)
Independently for each gene (column)
(1 - Pearson correlation)
Independently for each chromosome; identically for all cores (columns) in a chromosome
(1 - Pearson correlation)
(1 - Pearson correlation)
Independently for each protein (column)
(1 - Pearson correlation)
(1 - Spearman correlation)
Independently for each surface marker (column)
(1 - Kendall correlation)
where p e is the empirical p-value and N is the number of leaves.
We use TBEST in the following to identify most detailed significant partitions of the data into branches of a given hierarchical tree. We define a partition to be significant with a threshold α if (a) every part is a branch and (b) if for every part at least one of the children of its parent node is tight with the p-value p < α. Among the significant partitions with a threshold α we find the most detailed, i.e., the one with the highest number of parts.
Availability of supporting data
The dataset tabulated in Additional file 2 was generated by a computer simulation in the course of this work. The remaining five datasets used in this work, as detailed in the Results section, are public, and each was made available with the corresponding publication [5–9]. Our study involved no human participants and required no participant consent.
Properties of the five benchmark datasets
Number of leaves
Number of variables
True number of classes
Simulation of gene expression
mRNA levels from microarray analysis
DNA copy number analysis, sequencing
Proteomic analysis, using mass spectrometry
Flow cytometry analysis of surface markers from fluorescence intensity
To better judge the performance of TBEST in comparison to the other three algorithms, we considered, for each dataset, more than one combination of dissimilarity and linkage methods used for hierarchical clustering. With the exception of the third benchmark case, randomization of the input data, as required for both TBEST and SLB, consisted of randomly permuting the observed values, independently for each variable. The degree of agreement between a computed partition of the data and the truth is quantified in terms of corrected-for-chance Rand index, or cRI in the following . Briefly, the original Rand index is a measure of agreement between two partitions, 1 and 2, of n objects, defined as the number of concordant pairs divided by the number of all pairs: , where and is the binomial coefficient. Here if objects i and j belong to the same part in partition k and otherwise. cRI corrects this definition by accounting for random concordance: , where the expectation E(N c ) is computed under the assumption that, for each partition, the assignment of objects to parts is random, while the sizes of all parts are given and fixed. It should be noted that the subsets of the data identified as distinct by TBEST and the other three techniques by necessity correspond each to a branch of a tree. This, however, is not necessarily the case for the true types, some of which do not correspond to a single branch. As a result, a perfect match between any computed partition and the truth may not be possible, and the maximal attainable value of cRI may be below 1. For this reason, to evaluate the performance of TBEST and the published methods across benchmark datasets, we also identify, for each tree considered, a partition into branches that best matches the truth and determine cRI between that partition and the computed partitions for each of the methods.
In each of the cases in the following we studied how the most significant detailed partition found by TBEST, and its correspondence to the truth, vary with the significance threshold α. In an analogous fashion, we analysed the detailed partitions generated by SLB and SC. For DTC, which is not a statistically supported method, we examined the properties of the most detailed partition as a function of the minimal allowed number of leaves in each part.
DTC achieves top performance for the (1 - Pearson correlation) dissimilarity – Ward linkage combination if its minimal allowed number of leaves does not exceed that of the smallest compartment-associated branch of the tree. However, this property is lost for the (1 - Pearson correlation) dissimilarity – average combination where an additional cluster with two leaves is identified by DTC if the minimal number of leaves is set at or below 2.
Discussion and conclusions
As our test results demonstrate, the performance of TBEST as a tool for data partitioning is equal or superior to that of similar published methods in a variety of biology-related settings. This is true in particular for datasets with underlying tree-like organization, such sets of genomic profiles of individual cancer cells, of the same type as our third benchmark case above. In a work presently in progress we are applying TBEST systematically to a number of datasets of a similar nature. But TBEST also performs well on datasets with no underlying hierarchical structure, such as Simulated6 or Leukemia above. In total, TBEST was able to recover the true partition of the data on par with or better than the published methods in ten out of eleven combinations of data, dissimilarity and linkage considered here. We further note that for all but one such combination the optimal partition of the data by TBEST also was the most significant partition into more than one part. This was not the case for the other significance-based methods included in the comparison.
To guide the future use of TBEST, we must note the limitations of the method. TBEST is designed specifically for the analysis of hierarchical trees. In this sense, it is not as universal as SC, which is applicable to any partition of the data, including, in the case of hierarchical clustering, to trees with inversions. Further, TBEST never considers single leaves to be significantly distinct, whereas both SLB and DTC (but not SC) are free of this limitation.
where c1(n), c2(n) are the two children of n. While the discussion of these extensions is beyond the scope of this work, an implementation of TBEST as an R language package provides a number of options, both for the definition of tightness and for annotation of significantly distinct branches .
Finally, we note that tightness of tree branches is complementary to another important notion in clustering, namely, cluster stability under re-sampling of the input data. The latter property can be analysed in a number of ways, such as bootstrap analysis of trees [13–15] or methods not directly related to trees [5, 16]. Existing work provides examples where both distinctness and stability under resampling are prerequisites of a meaningful partition . Incorporation of TBEST into such combined analysis will be addressed in the future.
We are grateful to M. Wigler for contributing to the early stages of this work and numerous subsequent discussions; to S. Yoon for reading and commenting on the manuscript; to M. Akerman, B. Meunier and J.F. Hoquette for generously sharing their data with us; to K.A. Schlauch for generously providing software.
Funding: This work was supported by the National Institutes of Health grant NIH/1UO1CA168409-01 and by grant 125217 from the Simons Foundation.
- Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R. Bioinformatics. 2008, 24 (5): 719-720. 10.1093/bioinformatics/btm563.PubMedView ArticleGoogle Scholar
- Liu Y, Hayes DN, Nobel A, Marron JS: Statistical significance of clustering for high-dimension, low-sample size data. J Am Stat Assoc. 2008, 103 (483): 1281-1293. 10.1198/016214508000000454.View ArticleGoogle Scholar
- Munneke B, Schlauch KA, Simonsen KL, Beavis WD, Doerge RW: Adding confidence to gene expression clustering. Genetics. 2005, 170 (4): 2003-2011. 10.1534/genetics.104.031500.PubMed CentralPubMedView ArticleGoogle Scholar
- Knijnenburg TA, Wessels LFA, Reinders MJT, Shmulevich I: Fewer permutations, more accurate P-values. Bioinformatics. 2009, 25 (12): I161-I168. 10.1093/bioinformatics/btp211.PubMed CentralPubMedView ArticleGoogle Scholar
- Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003, 52 (1–2): 91-118.View ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
- Krasnitz A, Sun G, Andrews P, Wigler M: Target inference from collections of genomic intervals. Proc Natl Acad Sci U S A. 2013, 110 (25): E2271-E2278. 10.1073/pnas.1306909110.PubMed CentralPubMedView ArticleGoogle Scholar
- Kislinger T, Cox B, Kannan A, Chung C, Hu P, Ignatchenko A, Scott MS, Gramolini AO, Morris Q, Hallett MT, Rossant J, Hughes TR, Frey B, Emili A: Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling. Cell. 2006, 125 (1): 173-186. 10.1016/j.cell.2006.01.044.PubMedView ArticleGoogle Scholar
- Diaz-Romero J, Romeo S, Bovee JV, Hogendoorn PC, Heini PF, Mainil-Varlet P: Hierarchical clustering of flow cytometry data for the study of conventional central chondrosarcoma. J Cell Physiol. 2010, 225 (2): 601-611. 10.1002/jcp.22245.PubMedView ArticleGoogle Scholar
- Hubert L, Arabie P: Comparing partitions. J Classif. 1985, 2 (2–3): 193-218.View ArticleGoogle Scholar
- Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D, Muthuswamy L, Krasnitz A, McCombie WR, Hicks J, Wigler M: Tumour evolution inferred by single-cell sequencing. Nature. 2011, 472 (7341): 90-94. 10.1038/nature09807.PubMed CentralPubMedView ArticleGoogle Scholar
- Sun G, Krasnitz A: TBEST: Tree branches evaluated statistically for tightness. The Comprehensive R Archive Network. 2013, http://cran.r-project.org/web/packages/TBEST/index.html,Google Scholar
- Efron B, Halloran E, Holmes S: Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci U S A. 1996, 93 (14): 7085-7090. 10.1073/pnas.93.14.7085.PubMed CentralPubMedView ArticleGoogle Scholar
- Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Soc Stud Evol. 1985, 39: 783-791.Google Scholar
- Shimodaira H: An approximately unbiased test of phylogenetic tree selection. Syst Biol. 2002, 51 (3): 492-508. 10.1080/10635150290069913.PubMedView ArticleGoogle Scholar
- Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 2002, 3 (7): RESEARCH0036-PubMed CentralPubMedView ArticleGoogle Scholar
- Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma. Nature. 2011, 474 (7353): 609-615. 10.1038/nature10166.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.