Automatic B cell lymphoma detection using flow cytometry data
© Shih et al; licensee BioMed Central Ltd. 2013
Published: 5 November 2013
Flow cytometry has been widely used for the diagnosis of various hematopoietic diseases. Although there have been advances in the number of biomarkers that can be analyzed simultaneously and technologies that enable fast performance, the diagnostic data are still interpreted by a manual gating strategy. The process is labor-intensive, time-consuming, and subject to human error.
We used 80 sets of flow cytometry data from 44 healthy donors, 21 patients with chronic lymphocytic leukemia (CLL), and 15 patients with follicular lymphoma (FL). Approximately 15% of data from each group were used to build the profiles. Our approach was able to successfully identify 36/37 healthy donor cases, 18/18 CLL cases, and 12/13 FL cases.
This proof-of-concept study demonstrated that an automated diagnosis of CLL and FL can be obtained by examining the cell capture rates of a test case using the computational method based on the multi-profile detection algorithm. The testing phase of our system is efficient and can facilitate diagnosis of B-lymphocyte neoplasms.
Flow cytometry (FC) involves conjugating fluorochromes to antibodies, allowing them to bind different cell biomarkers, and passing the stained cells through the path of a laser where the fluorochromes are excited and fluorescence emission is measured. Forward and side scatter of cells give information about the size and complexity of the cells. FC is a valuable tool in the diagnosis of lymphocytic neoplasms. Most of the current software supplied by the cytometer manufacturer provides a 2-parameter visual representation of the multi-dimensional data. Pathologists must manually select the areas that include the cells of interest and view these cells using two other attributes, a process known as gating. These areas of interest are not fixed due to instrument, operator, and sample differences. The pathologists use the clustering of the cells, the distribution and cell size of a cluster, and the relative location of the clusters to make the selection. The process is tedious, time-consuming, and subject to bias. Thus, there is an urgent need to develop a fast and unbiased diagnostic approach .
Our ultimate goal is to establish an automated process for clustering cells of interest to replace manual gating [2, 3]. Cell populations can be identified in an automated fashion (automated gating) by employing clustering algorithms. The most challenging aspect of the automated process is finding the best clustering algorithm for high-dimensional data sets [4–7]. Many existing dimension-reduction approaches may cause useful information to be lost [8–13]. There have been several attempts to use machine learning technique to automate the gating process [14–20]. The most commonly used approach is the k-mean algorithm , which assigns a cell to its nearest cluster. There are several versions of k-mean algorithms such as fuzzy k-mean, K-medoid, Gath Geva, and the Gustafson Kessel algorithm . Other common approaches are hierarchical clustering [23–26] and density-based clustering .
Recently, model-based clustering has been gaining popularity [28–31], including use of the expectation-maximization (EM) algorithm . However, most approaches only focus on the first stage of FC data analysis that identifies cell populations, some approaches are only semi-automatic , and some only target certain types of lymphocytic neoplasms [34, 35]. This paper proposes a novel 3-dimensional (3-D) 5-parameter model that detects multiple types of B-lymphocyte neoplasms.
In this proof-of-concept study, we will apply this methodology to differentiate between selected subtypes of B-lymphocyte neoplasms and identify biomarkers that contribute to the classification of certain subtypes, such as chronic lymphocytic leukemia (CLL) and follicular lymphoma (FL). Our goal is to develop software solutions to allow pathologists to quickly interpret the FC data without bias.
A multi-profile approach for lymphoma detection
The multi-profile lymphoma detection described in this article can detect whether the FC data of an individual matches the profile of a particular type of lymphoma or that of a healthy donor. The objectives of the computational detection system were: (1) minimum human intervention in the detection process, (2) ability to detect various types of B-lymphocyte neoplasms, (3) efficient computation complexity, and (4) reasonable detection rate with a low false-negative rate.
A 3-D 5-parameter flow cytometry data model
The details of the algorithm are listed below.
n: number of observations
d: number of attributes in the observation (3 in the 3-D 5-P model)
k: number of clusters (k = 3 for normal profile and k = 2 for patient profile)
X[i, j]: observation data of size n × d, i = 1, ..., n and j = 1, ..., 3.
m: multiplier for the standard deviation used to determine the size of the ellipsoids (m = 2 in our analysis)
Output: k ellipsoids containing data points within m × std of the centers of the clusters represented by:
W[c]: percentage of the data points in cluster c, c = 1, ..., k.
M[c, i]: the i-th attribute of the of the c-th mean of the cluster (k × d)
V[c, i, j]: the co-variance matrix of the c-th cluster (k × d × d)
Step 1: [Initialization] Given X, use the K-mean algorithm to find k clusters of X. The output of K-mean are: M (i) , V (i) and W (i) , the means, co-variance, and the weight of the k clusters.
Step 2: [Clustering] Use the EM (expectation maximization) algorithm to compute a better clustering of × with initial values M = M (i) , V = V (i) , and W = W (i) .
Step 3: [Ellipsoid Construction] Construct k ellipsoids with Means M and Co-variance V and weight W. The ellipsoid should include all data points within m × std of the center of its cluster.
The ratio of the number of cells captured inside the ellipsoid(s) and that of cells of the test clusters is defined as the CCR of the profile on the test case. In other words, the ratio calculation requires two numbers: the number of B cells and the number of overall cells. The number of cells captured by a profile ellipsoid can be used as the numerator of CCR. For the denominator, there are three possibilities: all blood cells, all lymphocytes, or B cells. In the next two paragraphs, we shall describe how the CCR is computed.
To find out the B cells captured by an ellipsoid in the profile, it was necessary to partition the cells into clusters. However, most clustering algorithms are ineffective in dealing with clusters that are very close or intersecting each other. Thus, our first step was to use a hierarchical divisive clustering ("top-down") approach by separating the T cells from the rest of the test cells by using the value of CD19. The parameter k is defined as the number of clusters (k = 3 for normal profile and k = 2 for patient profile) and X[c,j] represents the observation data of c-th cell. In the first step, T cells were identified and assigned with a label k. The next step was to find the center of the T cells. This can be easily achieved by calculating the mean of cells with label k. Technical variation, such as different operators, machines, etc., may cause the data to shift. Thus, the third step of alignment is to fix the variation by moving the profile to "fit" the test data. We have tried several methods for alignment. In one approach for fitting the normal profile, we divided B cells into two clusters representing lambda light chain dominant and kappa light chain dominant and obtained the centers of the two clusters. Then we aligned the ellipsoids individually to the corresponding center. This approach fails to detect changes in the distance between clusters. In addition, the clustering algorithm used to separate two clusters that are very closely aligned was not very effective and this may result in misclassifications. In our current work, we adopted a hierarchical approach: we first found the center of the T cells in the test case, and then calculated the difference between center of T cells in the profile and test case. Finally, we aligned all ellipsoids by the difference. In our system, we used only the one or two ellipsoids that represent B cells and left out the ellipsoid that represents T cells since we are detecting B lymphocyte neoplasms. After aligning the ellipsoids to the center of the corresponding clusters, we obtained the numbers of the captured B cells, which is the numerator of the CCR.
For the denominator of the CCR, we tried out all the three possibilities mentioned above. If we use the total number of the blood cells as the denominator, the CCR is compressed to a small range thus it is difficult to distinguish healthy donors and patients. In a preliminary paper we reported , the B cell CCR is calculated by the number of B cells inside the ellipsoid divided by the total number of lymphocytes.
That approach gave us a higher CCR to compare since the denominator is smaller. Even though the CCR in  was able to distinguish the patients from healthy donors, the CCR for healthy donor using the normal profile is somewhat small (about 13% on average). In this paper, we decide to use a third approach by using the total number of B cells as the denominator. This approach gives us a much higher CCR for healthy donor compared with the normal profile (over 80% on average).
The detail of the fitting process is given below, and the final B cell CCR is defined as the ratio of the number of B cells inside the ellipsoid over the total number of B cells. Input:
n: total number of observations
k: number of clusters (k = 3 for normal profile and k = 2 for patient profile)
d: number of attributes in the observation (d = 3)
X[c, j]: observation data of the c-th cell, c = 1, .., n, and j = 1, ..., d.
P: a Profile (Normal, CLL, or FL) including M[c,j] and V[c,i,j], c = 1, ..., k.
Output: Cell capture rate of × against the profile P.
Step 1: [Clustering of cells] This is achieved by a hierarchical divisive clustering approach to identify the T-cells with the CD19 first. Let cluster[c] be the cluster of cell c, thus cluster[c]=k for all cell c in the T-cell cluster. For the rest of the cells, use the K-mean algorithm on X[c,j] to find the remaining k-1 cluster(s). The B-cell clusters are numbered as cluster 1, .. , k-1.
Step 2: [Finding the centers] For each cluster, find the center MC[c, i] (c = 1, ..., k, i = 1, ..., d) of the cluster by computing the mean of the cells in that cluster i.
Step 3: [Alignment] Find the difference δ[k,i] of T[k,i], the centers of the T-cell clusters and M[k,i] the centers of the T-cells of the profile P. Modify the means so that the T-cell cluster aligns with the T-cell ellipsoid, i. e., M[c,i] = M[c,i]+ δ[k,i].
The algorithm calculates and selects the axis with the smallest distance to the CCR of the test case.
Multi-profile testing with cross validation
(24+21+15)x3 = 180
(36+20+15)x21 = 1491
(36+21+14)x15 = 1065
Since we adopted the leave-one-out approach for building for the CLL and FL profiles, some of the cancer patients fit the profile better than others. A more carefully selected profile is needed to improve the accuracy of the diagnosis, which is discussed in the next section.
Multi-profile testing with a data selection strategy for profile building
As mentioned previously, there is no need to pre-select healthy donors to build the normal profile since healthy donors' samples are fairly consistent in composition. To choose a better ellipsoid to represent the CLL, we used the distance between the center of cluster 3 to 1 (or 2) as our selecting criteria in Figure 7a and 7b. We selected approximately 15% of the CLL cases that have a closer value to the mean of the distance. For FL (Figure 8), we will perform the same process to pre-select 15% of FL data for our training cases. The CLL and FL profiles are built by merging the training cases.
Multi-profile testing with data selection strategy for profile building.
As a proof-of-concept study, we have demonstrated a multi-profile B lymphocyte neoplasm analysis methodology to automate the detection of certain types of B lymphocyte neoplasms by FC. A profiling method was described that characterized both the healthy donors and patients with different types of B-lymphocyte neoplasms. A CCR was defined to measure the fitness of a test case against the profile. We have demonstrated that one can obtain an automated diagnosis of CLL and FL by examining the CCRs of a test case against all three profiles. Although we only looked at FL and CLL in this study, this novel 3-D 5-parameter detection system should be capable of identifying other types of B lymphocyte neoplasms. Moreover, since the analysis is computational, it is possible to track FC data for monitoring disease progression of a lymphoma patient.
Additionally, this 3-D 5-parameter detection system provides a novel way for pathologists to interpret FC data. Instead of manually gating on numerous 2-parameter plots, they can analyze 5-parameters in a 3-D image that can be rotated and viewed from various angles. This would allow them to see small clusters of cells that may be obscured in a 2-D image. In this way the 3-D 5-parameter detection system has the potential to improve a process that is labor-intensive, time-consuming, and subject to human error through automation and improved data interpretation.
This article is an expanded paper previously presented at the 2012 IEEE 2nd International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) . We expanded the preliminary result presented at the ICCABS conference and added the following new components. 1. Detail algorithms of our method: In the ICCABS paper, we only included the brief descriptions of building profiles and using the profile to test a new subject. In our current submission, we have included the detail steps of the Profile Building Algorithm and Fitting Algorithm. 2. Additional experimental results: After collecting more data from the Methodist Hospital, we added 7 more FL patient cases which almost doubled the FL sample size. 3. A comprehensive analysis including cross-validation of the testing: In the current submission, we added (a) Single-Profile Testing, (b) Multi-Profile Testing with Cross Validation, (c) A data selection strategy for profile building which yields better profiles for CLL and FL. 4: New definition of the B cell CCR: the B cell CCR is calculated by the number of B cells inside the ellipsoid divided by the total number of lymphocytes. 5. Other Improvements: We presented a new overview of the methodology which gives a better explanation of the system, and we used box plots to compare the cell capture rate of using various profiles. This gives reader a better understanding of the distribution of the CCRs.
This project was supported in part by NIH grants R01CA151955 (YZ) and R33CA173382 (YZ).
Publication of this article was supported by the Methodist Hospital.
This article has been published as part of BMC Genomics Volume 14 Supplement 7, 2013: Selected articles from the Second IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2012): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S7
- Lee G, Stoolman L, Scott C: Transfer learning for automatic gating of flow cytometry data. ICML 2011 Workshop on Unsupervised and Transfer Learning. 2011Google Scholar
- Leary JF, Smith J, Szaniszlo p, Reece LM: Comparison of multidimensional flow cytometric data by a novel data mining technique. art no 64410N Imaging, Manipulation, and Analysis of Biomolecules, Cells, and Tissues V. 2007, 6441: N4410-Google Scholar
- Roederer M, Hardy RR: Frequency difference gating: a multivariate method for identifying subsets that differ between samples. Cytometry. 2001, 45 (1): 56-64. 10.1002/1097-0320(20010901)45:1<56::AID-CYTO1144>3.0.CO;2-9.PubMedView ArticleGoogle Scholar
- Liu L, Xiong L, Lu JJ, Gernert KM, Hertzberg V: Comparing and clustering flow cytometry data. Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine. 2008, 305-309.Google Scholar
- Wilkins MF, Hardy SA, Boddy L, Morris CW: Comparison of five clustering algorithms to classify phytoplankton from flow cytometry data. Cytometry. 2001, 44 (3): 210-217. 10.1002/1097-0320(20010701)44:3<210::AID-CYTO1113>3.0.CO;2-Y.PubMedView ArticleGoogle Scholar
- Pedreira CE, Costa ES, Barrena S, Lecrevisse Q, Almeida J, van Dongen JJ, Orfao A: Generation of flow cytometry data files with a potentially infinite number of dimensions. Cytometry A. 2008, 73 (9): 834-846.PubMedView ArticleGoogle Scholar
- Cavenaugh JS, von Laszewski G, Mosmann TR, Rebhahn J, Datta S, Naim I, Sharma G, Pangborn A, Pendleberry S, Wismueller A, Espenshade J, Wu H: Rapid, theoretically sound multivariate clustering for a paradigm shift in flow cytometry data analysis. Cytometry B Clinical Cytometry. 2009, 76B (6): 397-398.Google Scholar
- Finn WG, Carter KM, Raich R, Stoolman LM, Hero AO: Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects. Cytometry B Clinical Cytometry. 2008, 76B (1): 1-7.View ArticleGoogle Scholar
- Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP: Automated high-dimensional flow cytometric data analysis. Clinical Immunology. 2008, 127: S152-Google Scholar
- Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP: Automated high-dimensional flow cytometric data analysis. Proceedings of Research in Computational Molecular Biology. 2010, 6044: 577-582. 10.1007/978-3-642-12683-3_41.View ArticleGoogle Scholar
- Zare H, Shooshtari P, Gupta A, Brinkman RR: Data reduction for spectral clustering to analyze high throughput flow cytometry data. BMC Bioinformatics. 2010, 11: 403-10.1186/1471-2105-11-403.PubMedPubMed CentralView ArticleGoogle Scholar
- Zeng QT, Pratt JP, Pak J, Ravnic D, Huss H, Mentzer SJ: Feature-guided clustering of multi-dimensional flow cytometry datasets. J Biomed Inform. 2007, 40 (3): 325-331. 10.1016/j.jbi.2006.06.005.PubMedView ArticleGoogle Scholar
- Petrausch U, Haley D, Miller W, Floyd K, Urba WJ, Walker E: Polychromatic flow cytometry: a rapid method for the reduction and analysis of complex multiparameter data. Cytometry A. 2006, 69 (12): 1162-73.PubMedView ArticleGoogle Scholar
- Jeffries D, Zaidi I, de Jong B, Holland MJ, Miles DJ: Analysis of flow cytometry data using an automatic processing tool. Cytometry Part A. 2008, 73A (9): 857-867. 10.1002/cyto.a.20611.View ArticleGoogle Scholar
- Estes ML, Mund JA, Mead LE, Prater DN, Cai S, Wang H, Pollok KE, Murphy MP, An CS, Srour EF, Ingram DA, Case J: Application of polychromatic flow cytometry to identify novel subsets of circulating cells with angiogenic potential. Cytometry A. 2010, 77 (9): 831-839.PubMedPubMed CentralView ArticleGoogle Scholar
- Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP: Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences of the United States of America. 2009, 106 (21): 8519-8524. 10.1073/pnas.0903028106.PubMedPubMed CentralView ArticleGoogle Scholar
- Toedling J, Rhein P, Ratei R, Karawajew L, Spang R: Automated in-silico detection of cell populations in flow cytometry readouts and its application to leukemia disease monitoring. BMC Bioinformatics. 2006, 7: 282-10.1186/1471-2105-7-282.PubMedPubMed CentralView ArticleGoogle Scholar
- Shulman N, Bellew M, Snelling G, Carter D, Huang Y, Li H, Self SG, McElrath MJ, De Rosa SC: Development of an automated analysis system for data from flow cytometric intracellular cytokine staining assays from clinical vaccine trials. Cytometry A. 2008, 73 (9): 847-856.PubMedPubMed CentralView ArticleGoogle Scholar
- Tzircotis G, Thorne RF, Isacke CM: A new spreadsheet method for the analysis of bivariate flow cytometric data. BMC Cell Biology. 2004, 5: 10-10.1186/1471-2121-5-10.PubMedPubMed CentralView ArticleGoogle Scholar
- Kalina T, Stuchlý J, Janda A, Hrusák O, Růzicková S, Sedivá A, Litzman J, Vlková M: Profiling of polychromatic flow cytometry data on B-cells reveals patients' clusters in common variable immunodeficiency. Cytometry A. 2009, 75 (11): 902-909.PubMedView ArticleGoogle Scholar
- Luta G: On extensions of k-means clustering for automated gating of flow cytometry data. Cytometry A. 2011, 79 (1): 3-5.PubMedView ArticleGoogle Scholar
- Liu HC, Wu DB, Yih JM, Liu SW: Fuzzy C-means algorithm based on standard mahalanobis distances. Proceedings of the 2009 International Symposium on Information Processing (ISIP'09). 2009, 422-427. ISBN 928-952-5726-02-2Google Scholar
- Fiser K, Sieger T, Vormoor JH: Identifying candidate normal and leukemic B cell progenitor populations with hierarchical clustering of 6-color flow cytometry data - a better view. Blood. 2007, 110 (11): 428a-Google Scholar
- Balog AR, Meyerson HJ: Classification of low-grade B-cell lymphomas using hierarchical clustering of raw flow cytometry data. Laboratory Investigation. 2011, 91: 286a-Google Scholar
- Diaz-Romero J, Romeo S, Bovée JV, Hogendoorn PC, Heini PF, Mainil-Varlet P: Hierarchical clustering of flow cytometry data for the study of conventional central chondrosarcoma. Journal of Cellular Physiology. 2010, 225 (2): 601-611. 10.1002/jcp.22245.PubMedView ArticleGoogle Scholar
- Fiser K, Sieger T, Schumich A, Irving J, Dworzak MN, Vormoor J: MRD monitoring of childhood all using hierarchical clustering and support vector machine learning of complex multi-parameter flow cytometry data. Blood. 2008, 112 (11): 536-Google Scholar
- Walther G, Zimmerman N, Moore W, Parks D, Meehan S, Belitskaya I, Pan J, Herzenberg L: Automatic clustering of flow cytometry data with density-based merging. Adv Bioinformatics. 2009, 686759-Google Scholar
- Lo K, Brinkman RR, Gottardo R: Automated gating of flow cytometry data via robust model-based clustering. Cytometry A. 2008, 73 (4): 321-332.PubMedView ArticleGoogle Scholar
- Lakoumentas J, Drakos J, Karakantza M, Nikiforidis GC, Sakellaropoulos GC: Bayesian clustering of flow cytometry data for the diagnosis of B-chronic lymphocytic leukemia. J Biomed Inform. 2009, 42 (2): 251-261. 10.1016/j.jbi.2008.11.003.PubMedView ArticleGoogle Scholar
- Boedigheimer MJ, Ferbas J: Mixture modeling approach to flow cytometry data. Cytometry A. 2008, 73 (5): 421-429.PubMedView ArticleGoogle Scholar
- Chan C, Feng F, Ottinger J, Foster D, West M, Kepler TB: Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry A. 2008, 73 (8): 693-701.PubMedPubMed CentralView ArticleGoogle Scholar
- Naim I, Datta S, Sharma G, Cavenaugh JS, Mosmann TR: Swift: scalable weighted iterative sampling for flow cytometry clustering. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. 2010, 509-512.Google Scholar
- Pedreira CE, Costa ES, Arroyo ME, Almeida J, Orfao A: A multidimensional classification approach for the automated analysis of flow cytometry data. IEEE Trans Biomed Eng. 2008, 55 (3): 1155-1162.PubMedView ArticleGoogle Scholar
- Bashashati A, Lo K, Gottardo R, Gascoyne RD, Weng A, Brinkman R: A pipeline for automated analysis of flow cytometry data: preliminary results on lymphoma sub-type diagnosis. Proceedings of Embc: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2009, 4945-4948.Google Scholar
- Pedreira CE, Costa ES, Almeida J, Fernandez C, Quijano S, Flores J, Barrena S, Lecrevisse Q, Van Dongen JJ, Orfao A: A probabilistic approach for the evaluation of minimal residual disease by multiparameter flow cytometry in leukemic B-cell chronic lymphoproliferative disorders. Cytometry A. 2008, 73A (12): 1141-1150. 10.1002/cyto.a.20638.PubMedView ArticleGoogle Scholar
- Shih MC, Huang SHS, Chang CCJ: A multidimensional flow cytometry data classification. Proceedings of the 9th IEEE International Conference on Bioinformatics and Bioengineering. 2009, 356-359.Google Scholar
- Shih MC, Huang SHS, Zu Y, Donohue R, Chang CCJ: Automatic B cell lymphoma detection using flow cytometry data. 2nd IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS). 2012, 1-6.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.