We first describe the data pre-processing procedure, our HMM model and the model fitting for a cluster of aCGH samples. Then we describe the machine learning algorithm that uses HMM models to cluster tumors. Finally, we introduce our fast implementation for the clustering algorithm and the approach to find the optimal number of groups.
aCGH data segmentation
Raw data from Affymetrix SNP array was preprocessed using the open source R package aroma.affymetrix with the default parameters [18]. The resulting SNP-level copy numbers were segmented using circular binary segmentation (CBS) method within aroma.affymetrix [19]. The break points predicted by CBS at each sample were then summarized across all samples and the segments between pairs of adjunct break points were identified and the mean copy number of each segment was calculated for clustering analysis. Because tumor specimen is often contaminated with normal tissues that are adjacent to them, which would complicate the analysis and introduce false negatives, it is necessary to conduct quality control to remove the normal contamination first. We developed a machine learning algorithm to identify and filter out contaminated samples based on segmented copy number data. To conduct the quality control, we first chose two groups of samples; one with the highest number of copy number alteration regions and the other being normal samples. These two groups of samples served as a training dataset for a Random Forest classifier [20] that was used to distinguish tumors and normal samples and identify signatures for tumor samples. The out of bag error rate of this classifier is usually very low (<5%). The trained classifier was then applied to all aCGH samples and a score of probability to be contaminated was calculated for each sample. Samples with a probability of more than 50% of contamination were excluded from further clustering analysis.
HMM model fitting
We use a mixture HMM model to represent the clusters of tumor samples in that each cluster or subtype of tumors is modelled by a single HMM. Let t = 1,.., T, denote the DNA segments of aCGH data. Suppose there are G clusters in total and there are ng samples in cluster g, where g = 1,..., G. The HMM model for cluster g is demonstrated in Figure 1. The hidden state at segment t is denoted as xg(t), which is the true copy number of segment t and is unknown. The observed mean copy number for segment t and sample j is denoted as ygj(t) where j = 1, ..., ng. The transitional probability from state t-1 to t is denoted as Pg(t) and the emission probability from state t to observation ygj(t) is denoted as Pgj(t).
Based on the distribution of segmentation copy number data, we choose 6 states, xX{0.1, 1.5, 2, 3, 4, 5}. The transitional probabilities between states are given in a multinomial distribution that was empirically estimated from the raw data. For example, for the 79 glioma samples we discuss in the results section, the transitional probability between the same states is 0.56, and is 0.088 otherwise. Lognormal distribution is widely used to model DNA copy number [21]. We let the emission probability follow a log-normal distribution with the hidden state value being the mean of the normal distribution. The standard deviation of the normal distribution was empirically estimated from raw data using the mean standard deviation of segment copy number across all samples. Viterbi algorithm is used to fit this HMM model to a cluster of tumor samples. Viterbi algorithm is an efficient algorithm that uses a dynamic programming approach to estimate the hidden states from segment 1 to T and to calculate delta value, Δg, which is the log-likelihood for the model fitting of cluster g. We have implemented Viterbi algorithm in both R and C++ programming environment.
Sample clustering
The optimal clustering is determined by the maximum sum of delta values from all groups. It is computationally infeasible to perform HMM fitting and calculate delta sum values for every possible clustering because the number of clustering for n samples G groups is Gn. We have developed a computationally efficient algorithm for finding the clustering optima.
The algorithm searches the optimal clustering as follows. First, for a given number of groups G, we run a NMF algorithm that was developed earlier to preliminarily cluster the samples. The NMF algorithm has been described previously [16]. Though NMF classifies aCGH data without considering the correlation between markers, it has been shown to discover genomic signatures in various types of tumors [11, 16]. Thus, the clustering result of NMF serves as a starting point for HMM clustering. Secondly, we fit a HMM for each cluster from the previous step and calculate the sum of delta values, Δ1, ..., ΔG. This is the log-likelihood for the current optimal clustering. Thirdly, based on the current optimal clustering, we randomly select 2 samples, and re-assign the group labels to them. We then perform HMM model fitting and calculate the log-likelihood for the new clustering. If the new log-likelihood is greater than the optimal log-likelihood, we update the optimal clustering with this new clustering. Fourthly and finally, we repeat the last step a number of times, and stop until the optimal log-likelihood remains unchanged for m times of consecutive random clustering. We find that m = n2 is sufficient for HMMC to converge to the optimal clustering. This workflow of clustering procedure is illustrated in Figure 2.
We have use both R and C++ to implement this clustering algorithm. As step 3 can be performed simultaneously for multiple random clustering, we implemented a parallel computing version of HMMC to decrease the computation time. This was accomplished using the OpenMP package [22] within C++ to allow the random clustering and HMM fitting to be carried out across multiple threads in a multi-core machine. Each thread can update the optimal clustering and all threads share the same optima. We have tested this program in Dell PowerEdge R910 workstation that has 32-cores and the parallel computing increased efficiency by an order of magnitude.
Model selection
One of the most difficult questions about clustering problem is to determine the number of groups, G. Our HMMC algorithm relies on a given G to find the optimal clustering. Thus model selection is needed to determine the best number of groups.
We use the Bayesian Information Criterion (BIC), a widespread statistical tool to conduct model selection. BIC is defined as follows:
where L is the likelihood which measures how good the HMM model approximate the data, k is the number of parameters used in the model, and n is the number of samples. The second term, k ln(n), serves as a penalty on the number of parameters used in the model to avoid overfitting. Thus, the magnitude of penalty increases as the number G increases. A preferred model is selected upon a lower BIC value that indicates relatively higher likelihood as well as relatively lower risk of overfitting.
The sum of delta values is the log-likelihood for HMM model fitting,
where Δ g is the delta value of the g th group. Because the parameters for transitional and emission probability are empirically estimated, the number of unknown parameters k in BIC is equal to the number of hidden states. For an aCGH dataset with G groups and T segments, the number of parameters is equal to T × G. Thus, the BIC for HMM is defined as
We run HMMC for a range of values of G, and select the optimal G with the smallest BIC.
Cross-validation for testing stability of HMMC
To assess the stability of HMMC results, we conduct a cross-validation as described earlier [16]. After completing HMMC for the entire dataset, we perform cross-validation as follows. We randomly leave 10% of samples out and apply HMMC to the remaining 90% of samples and compare the clustering results with the original one. The number of samples that are assigned to a different subgroup other than the original result is counted as errors. We repeat this procedure 100 times and calculate the error rate, which represents the stability of the clustering algorithm with respect to the permutation of samples.