- Research article
- Open Access
CGHnormaliter: an iterative strategy to enhance normalization of array CGH data with imbalanced aberrations
© van Houte et al; licensee BioMed Central Ltd. 2009
- Received: 5 September 2008
- Accepted: 26 August 2009
- Published: 26 August 2009
Array comparative genomic hybridization (aCGH) is a popular technique for detection of genomic copy number imbalances. These play a critical role in the onset of various types of cancer. In the analysis of aCGH data, normalization is deemed a critical pre-processing step. In general, aCGH normalization approaches are similar to those used for gene expression data, albeit both data-types differ inherently. A particular problem with aCGH data is that imbalanced copy numbers lead to improper normalization using conventional methods.
In this study we present a novel method, called CGHnormaliter, which addresses this issue by means of an iterative normalization procedure. First, provisory balanced copy numbers are identified and subsequently used for normalization. These two steps are then iterated to refine the normalization. We tested our method on three well-studied tumor-related aCGH datasets with experimentally confirmed copy numbers. Results were compared to a conventional normalization approach and two more recent state-of-the-art aCGH normalization strategies. Our findings show that, compared to these three methods, CGHnormaliter yields a higher specificity and precision in terms of identifying the 'true' copy numbers.
We demonstrate that the normalization of aCGH data can be significantly enhanced using an iterative procedure that effectively eliminates the effect of imbalanced copy numbers. This also leads to a more reliable assessment of aberrations. An R-package containing the implementation of CGHnormaliter is available at http://www.ibi.vu.nl/programs/cghnormaliterwww.
- Acute Lymphoblastic Leukemia
- Copy Number Change
- Array Comparative Genomic Hybridization
- Normalization Strategy
- Human Melanoma Cell Line
Array comparative genomic hybridization (aCGH) is an experimental approach used to scan an entire genome for copy number changes at a high resolution . These changes occur particularly in oncogenes where mutations can lead to either gains or losses of genetic material. Consequently, aCGH is a commonly used technique to identify aberrations leading to tumors [2–4]. In aCGH experiments, test and reference DNA samples are labeled with distinct dyes and hybridized to cloned DNA fragments of which the exact genomic location is known. For each DNA region the two-dye intensities are measured by fluorescence from which the corresponding log2 intensity ratios (M) are calculated. A ratio value close to zero indicates a normal copy number (e.g. two in diploids) while a value above or below zero indicates a gain or a loss, respectively. Nonetheless, the proper assessment of copy numbers is not a trivial task and several computational algorithms have been developed for normalization, smoothing, segmentation and calling [5–9]. The normalization procedure, the first stage of the aCGH analysis, aims to minimize the effect of the technical bias (e.g. dye bias) in log2 intensity ratios. Usually aCGH normalization is based upon methods applied to gene expression data, i.e. global-median and intensity-based LOWESS normalization . In global-median normalization a median M value is determined and subtracted from all M values. By doing so, the M values become centered around a median value of zero. Intensity-based LOWESS normalization instead fits a smooth regression line through all M value points. Normalization is achieved by subtracting from each M value its corresponding regression value. These conventional techniques however are in the majority of cases not applicable to aCGH data. This is due to the fact that relevant biological variation is often erroneously treated as technical bias and removed. For instance, probes corresponding to gains (which on average have higher intensities) are generally 'over-normalized' making a proper assessment of gains more difficult. A recently developed method, called popLowess, attempts to tackle this problem by separating the aberrations from the normals through k-means (k = 3) clustering . In this manner the normalization is only based on the population of normal probe values. The problem however is that 'calling' through a clustering method is rather course-grained while several more refined methods are available [9, 12–14]. Another recent normalization and centralization method that seeks to overcome over-normalization was proposed by Chen et al. . In their algorithm normalization is performed by regressing the highest ridgeline of a 2-dimensional intensity distribution which is assumed to correspond to normal probes. Subsequently, the most occurring probe intensity (i.e. the highest peak in the intensity distribution) is used for centralization.
Definition of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN).
Performance on the acute lymphoblastic leukemia (ALL) dataset
Effect of different normalization strategies on the M values.
Chen et al.
Performance on the gastrointestinal stromal tumor (GIST) dataset
Performance on the human melanoma cell line dataset
Results on the melanoma dataset are shown in Figure 2C. CGHnormaliter performs best on specificity (0.90) and precision (0.81), while global-median normalization is slightly more sensitive than CGHnormaliter (0.77 versus 0.76). popLowess and Chen et al. perform several percentage points worse compared with CGHnormaliter on all evaluation criteria. It should be stressed however that the somewhat higher sensitivity yielded by global-median can be attributed to a strikingly good performance on a single sample being the human melanoma cell line WM983. In this case centralization of aCGH data is rather complicated since more than half of the WM983 genome is aberrated. Overall the results are in line with the previous two datasets where CGHnormaliter outperforms the competitors tested.
The work we present here is based on a thorough comparison of a number of aCGH normalization methods involving several testsets. Further investigation should not only comprise larger datasets containing significantly more samples, but should also involve additional high-density platforms, such as Nimblegen. It goes without saying that future development of aCGH data analysis methods will be largely dependent on the size and quality of benchmark sets.
As a next step in the future development of our method we aim to extend the protocol by allowing single-channel data, or dual-channel data for which intensity values are not available. This could be achieved by implementing an iterative local-median strategy as an alternative to the local-LOWESS strategy currently used. In this way the general applicability of CGHnormaliter would be enhanced.
Finally, it should be stressed that a major pitfall of all methods occurs in cases that display many imbalanced copy number alterations. In samples where the number of gains or losses exceeds the number of normals, the data will be centralized around these gains or losses, leading to an incorrect normalization. Another drawback appears in sets where the ploidy of the reference and test sample differs, usually as a result of hypoploidy of the test sample. For instance, if the ploidy of reference and test are m and n (where m ≠ n), respectively, the centralization should be around instead of zero as employed by current methods. The integration of prior knowledge concerning the ploidy, number and nature of aberrations is likely to be key in alleviating these complications.
We introduce a new strategy, called CGHnormaliter, for improved normalization of aCGH data displaying imbalanced aberrations. Our method was tested on three well-studied test sets (ALL, GIST and Melanoma) which are unique considering the large number of extensively validated samples and the occurrence of many imbalanced aberrations. The performance was compared with a conventional global-median approach and the recently published tools popLowess  and that by Chen et al. . We conclude that on average CGHnormaliter outperforms the three other methods in terms of specificity and precision, while its overall sensitivity is comparable to that obtained by popLowess and Chen et al.. The global-median approach scores considerably lower on almost all data samples, mainly due to over-normalization: the presence of many imbalanced aberrations leads to an improper centralization of the intensity ratios. Furthermore, in a number of cases popLowess and Chen et al. achieve similar results as CGHnormaliter since all methods only use the normals for normalization. However, in some examples the identification of the normals is not trivial. In such cases the iterative refinement steps of CGHnormaliter yield better results than the single clustering step of popLowess or the 'highest ridgeline regression' strategy by Chen et al.. It would be interesting to further investigate these findings and combine the iterative protocol with alternative normalization approaches. Nonetheless this research emphasizes the importance of normalization based on properly defined normals and shows the added value of iteration for proper assessment of such normals.
CGHnormaliter is a normalization method tailored to aCGH data. Its novelty resides both in the fact that normalization is guided by a more sophisticated calling technique and that further refinement is attained through a new iteration procedure. The strategy can be summarized as follows. Initially the log2 intensity ratios are segmented using DNAcopy . The segmented data are then given as input to a recently developed calling tool named CGHcall  to discriminate the normals from gains and losses. The assumption here is that the temporary exclusion of aberrations allows for a more appropriate calculation of the LOWESS regression curve. As a result, after normalization, the log2 intensity ratios of the normals will generally be closer to zero and better reflect the biological reality. We coin this normalization strategy 'local-LOWESS' because only a subset of the intensity ratios is considered in the LOWESS regression. The thus normalized data are then segmented again and called. It is likely that the new calls will be more accurate than the previous ones because these are now based on normalized data. In turn, further iterative normalization might benefit from these improved calls. To control iteration, CGHnormaliter needs to establish whether the normalization results have been significantly changed or not. Iterations are terminated if each of the samples shows a mean difference relative to its value in the preceding iteration below α (default α = 0.01). Alternatively, the user can set a maximum number of iterations.
We also included a feature to prevent 'wandering' of the median during the iterative steps of CGHnormaliter. This might occur if a large number of gains or losses are present. In this situation it is likely that the calling algorithm will select many of these as normals. As a consequence an undesired upward (or downward) bias of the baseline can be observed, resulting in a biologically unrealistic number of losses (or gains), which will typically get worse during subsequent iterations. To prevent this we denote the largest copy number population as normals and adjust all calls accordingly.
Other normalization methods
To test the global-median normalization strategy, we used the implementation in the R-package CGHcall version 1.2.0 . In this routine standard global-median normalization is combined with a smoothing step  to remove outliers. For popLowess we used the standalone version 1.0.1 (with a lower limit of 1 for the 'smoothing size' to guarantee normalization of all chromosomes). For the method by Chen et al. we used the MatLab implementation provided by the authors. All programs were run using default parameter settings.
In this study we used three tumor-related benchmark aCGH datasets for method evaluation. These were selected since they contain a considerable amount of samples with imbalanced copy numbers which are cytogenetically verified using SKY, G-banding and/or FISH. The first dataset comprises 8 acute lymphoblastic leukemia (ALL) tissue samples which were analyzed using 32 K BAC arrays (, see Additional file 1). The second dataset consists of 17 gastrointestinal stromal tumors (GIST). These were analyzed using 3 K BAC and PAC arrays where only spots with signal intensities of at least two times the background intensities were included (, GSE5336, see Additional file 2). The third dataset includes samples from 4 human melanoma cell lines, which were analyzed using Agilent 44 K oligonucleotide-based CGH arrays (, GSE7822, see Additional file 3).
Project name: CGHnormaliter R package
Project home page: http://www.ibi.vu.nl/programs/cghnormaliterwww
Operating system(s): Platform independent
Programming language: R
Licence: GNU GPLv3
We would like to thank Johan Staaf for kindly providing the leukemia dataset. We are also grateful to Shama Bhola, Desiree Linders and Hans Wessels (all from the VU University Medical Center) for their help with interpreting the cytogenetic profiles. Funding was provided by the Netherlands Genomics Initiative (BvH: NGI/Ecogenomics, HH: NGI/Centre for Medical Systems Biology) and the Netherlands Bioinformatics Centre (TB and WP: NBIC/BioRange).
- Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, Dairkee SH, Ljung B, Gray JW, Albertson DG: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics. 1998, 20: 207-211. 10.1038/2524.View ArticlePubMedGoogle Scholar
- Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nature Genetics. 2005, 37 (Suppl): 11-17. 10.1038/ng1569.View ArticleGoogle Scholar
- Bejjani BA, Shaffer LG: Application of array-based comparative genomic hybridization to clinical diagnostics. J Mol Diagn. 2006, 8 (5): 528-533. 10.2353/jmoldx.2006.060029.PubMed CentralView ArticlePubMedGoogle Scholar
- Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet. 2006, 14: 139-148. 10.1038/sj.ejhg.5201531.View ArticlePubMedGoogle Scholar
- Chen HH, Hsu FH, Jiang Y, Tsai MH, Yang PC, Meltzer PS, Chuang EY, Chen Y: A probe-density based analysis method for array CGH data: simulation, normalization and centralization. Bioinformatics. 2008, 24 (16): 1749-1756. 10.1093/bioinformatics/btn321.PubMed CentralView ArticlePubMedGoogle Scholar
- Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden markov models approach to the analysis of array CGH data. J Multivariate Anal. 2004, 90: 132-153. 10.1016/j.jmva.2004.02.008.View ArticleGoogle Scholar
- Khojasteh M, Lam WL, Ward RK, MacAulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics. 2005, 6: 274-10.1186/1471-2105-6-274.PubMed CentralView ArticlePubMedGoogle Scholar
- Wiel Van de MA, Van Wieringen WN: CGHregions: dimension reduction for array CGH data with minimal information loss. Cancer Informatics. 2007, 2: 55-63.Google Scholar
- Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 4084-4091. 10.1093/bioinformatics/bti677.View ArticlePubMedGoogle Scholar
- Hwa Yang Y, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002, 30 (4): e15-10.1093/nar/30.4.e15.View ArticleGoogle Scholar
- Staaf J, Jönsson G, Ringnér M, Vallon-Christersson J: Normalization of array-CGH data: influence of copy number imbalances. BMC Genomics. 2007, 8: 382-10.1186/1471-2164-8-382.PubMed CentralView ArticlePubMedGoogle Scholar
- Price TS, Regan R, Mott R, Hedman Å, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJL: SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res. 2005, 33 (11): 3455-3464. 10.1093/nar/gki643.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data. Biostatistics. 2005, 6: 45-58. 10.1093/biostatistics/kxh017.View ArticlePubMedGoogle Scholar
- Wiel Van de MA, Kim KI, Vosse SJ, Van Wieringen WN, Wilting SM, Ylstra B: CGHcall: calling aberrations for array CGH tumor profiles. Bioinformatics. 2007, 23: 892-894. 10.1093/bioinformatics/btm030.View ArticlePubMedGoogle Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23 (6): 657-663. 10.1093/bioinformatics/btl646.View ArticlePubMedGoogle Scholar
- Paulsson K, Heidenblad M, Mörse H, Borg Å, Fioretos T, Johansson B: Identification of cryptic aberrations and characterization of translocation breakpoints using array CGH in high hyperdiploid childhood acute lymphoblastic leukemia. Leukemia. 2006, 20: 2002-2007. 10.1038/sj.leu.2404372.View ArticlePubMedGoogle Scholar
- Wozniak A, Sciot R, Guillou L, Pauwels P, Wasag B, Stul M, Vermeesch JR, Vandenberghe P, Limon J, Debiec-Rychter M: Array CGH analysis in primary gastrointestinal stromal tumors: cytogenetic profile correlates with anatomic site and tumor aggressiveness, irrespective of mutational status. Genes, Chromosomes & Cancer. 2007, 46 (3): 261-276. 10.1002/gcc.20408.View ArticleGoogle Scholar
- Greshock J, Feng B, Nogueira C, Ivanova E, Perna I, Nathanson K, Protopopov A, Weber BL, Chin L: A comparison of DNA copy number profiling platforms. Cancer Research. 2007, 67 (21): 10173-10180. 10.1158/0008-5472.CAN-07-2102.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.