Normalization of array-CGH data: influence of copy number imbalances
BMC Genomics volume 8, Article number: 382 (2007)
High-resolution microarray-based comparative genomic hybridization (CGH) techniques have successfully been applied to study copy number imbalances in a number of settings such as the analysis of cancer genomes. For normalization of array-CGH data, methods initially developed for gene expression microarray analysis have, in general, been directly adopted and used. However, these methods are designed to work under assumptions that may not be valid for array-CGH data when copy number imbalances are present. We therefore sought to investigate the effect on normalization imposed by copy number imbalances.
Here we demonstrate that copy number imbalances correlate with intensity in array-CGH data thereby causing problems for conventional normalization methods. We propose a strategy to circumvent these problems by taking copy number imbalances into account during normalization, and we test the proposed strategy using several data sets from the analysis of cancer genomes. In addition, we show how the strategy can be applied to conveniently define adaptive sample-specific boundaries between balanced copy number, losses, and gains to facilitate management of variation in tissue heterogeneity when calling copy number changes.
We highlight the importance of considering copy number imbalances during normalization of array-CGH data, and show how failure to do so can deleteriously affect data and hamper interpretation.
Microarray-based techniques for genome-wide investigation of copy number aberrations (CNAs) have recently gained much attention. Initially employing arrays developed for gene expression analysis , or low-density arrays produced from large-insert genomic clones such as bacterial artificial chromosomes (BACs) , the application has evolved rapidly. Currently, specialized high-density arrays with oligonucleotide probes or probes derived from BAC clones are predominately used. Two-channel array-based comparative genomic hybridization (aCGH) is a direct successor to conventional metaphase CGH . In both cases, DNA from two samples are differentially labeled with fluorescent dyes and co-hybridized to immobilized genomic capture probes. By use of aCGH, DNA derived from tumor tissue can be compared with reference DNA, e.g., normal whole blood DNA, and genomic imbalances can effectively be investigated. The main advantage of aCGH over conventional CGH is the increased resolution achieved by microarrays with a large number of individual probes, routinely up to hundreds of thousands, covering the entire genome . The power of aCGH has been demonstrated in tumor studies [5–8], as well as in the field of clinical genetics , and the basis of the technique is reviewed elsewhere . In essence, relative ratios of copy number between two DNA samples are obtained by comparing the two fluorescent signal intensities for each probe under the assumption that intensities reflect the amount of corresponding genomic DNA in the respective sample.
In much the same way as for gene expression microarray analysis, relative ratios must be normalized to account for systemic technical bias while retaining relevant biological changes . Although much effort has been invested in developing methods for analysis of aCGH data, including break-point identification and segmentation [12–14], less attention has been devoted to normalization. For this latter purpose, methods originally developed for gene expression microarray data, such as global-median (Median) and intensity-based lowess (Lowess) normalization, have been adopted [5, 6]. Recent reports have evaluated the performance of gene expression normalization strategies when applied on aCGH data and have proposed more specific approaches [15, 16]. Although valid concerns about directly adopting existing normalization techniques are expressed, proposed strategies rely on available conventional methods and the inherent properties of aCGH data have, rather than being incorporated in the strategies, mainly been used for calibration and validation. Microarray data is frequently visualized using M-A plots in which the log ratio, referred to as M, is plotted as a function of log mean intensity, referred to as A (Figure 1a) . When normalizing data using Median, the median M value is identified and subtracted from all M values. This procedure centers data such that the median M value becomes zero. Lowess normalization works in much the same way but use a locally fitted regression curve along the full range of A to identify M values to center data at. This intensity-based strategy has the added advantage over Median normalization of correcting for intensity-based bias of M. Intensity-based bias can introduce curvature in M across A (Figure 1a) which remains uncorrected for after Median based normalization.
Using self-self comparisons, in which a sample is compared with itself, it has been observed that other forms of technical bias, e.g., spatial- or plate bias, exist that can skew measured M values enough to revoke the validity of the aforementioned normalization methods . Both methods have therefore been implemented in ways that include stratification of M values in groups of data that are individually subjected to the correction. Stratification can be performed based on, e.g., spatial probe location, or probe source . The general thought is that stratification will result in groups, i.e., populations, of data in which the validity of the normalization method is upheld. It has also been observed that the assumptions, required for conventional normalization methods to work, can fail as a result of a true biological distribution of M, e.g., in situations where the majority of probes measure true differences between compared samples .
We here highlight a well known and commonly displayed property of tumor cells, namely the presence of biologically true CNAs. Figure 1b shows a genome plot of raw M values obtained by aCGH of a female breast cancer tumor xenograft  compared with male normal whole blood DNA. In the genome plot, M is plotted as a function of the genomic location of the probe sequence. In figure 1b, several genomic regions with different and discrete M can readily be observed. We sought to investigate the effect on normalization imposed by this property of aCGH data. We show that this property results in consequential drawbacks when using conventional normalization methods and propose a strategy that incorporates any populations present in the data into the normalization.
The proposed strategy can be integrated with any of several existing normalization methods and results in improved data quality. Also, spatial effects resulting in non-biological, but relevant, populations that can bias normalization are handled when calculating corrections. We also note that part of the procedure can be applied to assign adaptive sample-specific thresholds for calling copy number changes. The proposed normalization strategy, as well as the adaptive sample-specific level scaling, provides powerful and convenient means for improved copy number analysis using aCGH.
Results and Discussion
This study is outlined as follows with results and discussion presented accordingly. To investigate the influence of copy number imbalances on normalization we first created a set of mimicked data representing states of an increasing fraction of genomic gain. Using the mimicked data we demonstrate the effects of gain on normalization using Median and Lowess. We then evaluated an alternative normalization strategy in which data is stratified into separate populations representing gain and balanced copy number respectively. Whereas mimicked data provide prior knowledge facilitating stratification, most experiments lack this information. Therefore, we developed a method for stratification of data and evaluated the method using previously characterized cases. By applying our procedure for stratification and normalization to tumor specimens on different aCGH platforms we compare performance with standard methods. We investigate the implication of technical spatial effects and propose a strategy for improved normalization. In addition, we evaluate the possibility to apply our method to assess noise levels in data and assign sample-specific thresholds for detection of copy number imbalances.
Normalization of aCGH data using Median
We assumed that aCGH data from samples with a substantial amount of imbalances could be erroneously corrected using Median normalization. This problem is not unexpected and the effect is well known in corresponding cases when gene expression microarray data is normalized . We investigated this issue using aCGH data derived from tiling BAC arrays comparing copy number between DNA from a normal female with karyotype 46, XX and a cell line with 47, XXX . In this case, autosomes are expected to yield log ratio values of M = 0 and the X chromosome is expected to yield log ratio values of M = 0.58 corresponding to XXX/XX. By first removing Y chromosome values and then randomly omitting a varying number of values for autosomes, while retaining all X chromosome values, we could mimic cases with different percentage of gain. In this way we created mimicked data sets with 5, 10, 15, 20, 25, 30, 35, and 40 percent gain, respectively, where 5 percent gain corresponds to not omitting any autosome values. Data sets were created from raw data and then subjected to normalization using Median. After normalization we investigated ratios for autosomes and the X chromosome (Figure 2). As a result of an increased fraction of gain, the median M for the X chromosome is shifted from 0.42 to 0.30 (Figure 2a), confirming our belief that normalization strategies for aCGH should account for the presence of different copy-number populations. The observed shift is a direct result of the composition of aCGH data with respect to copy number populations and can also be observed when looking at autosomes for which the median M is shifted from -0.01 to -0.13 (Figure 2b). When visualizing the normalized data in genome plots the shift clearly appears: M = 0 is in between the two populations (Figure 2c).
Genomic imbalances correlate with intensity in aCGH data
Importantly, when creating the mimicked data sets we did not generate any simulated ratio values; rather, we formed different selections of values using real experimental data. We believe that this use of real experimental data is of significance for aCGH data. This belief is founded on that, in contrast to expression levels, copy number levels are restricted to a, by comparison, moderate dynamic range. Therefore, when a genomic region is subjected to gain or amplification, the increase of genomic material is relatively substantial. Thus, we reasoned that probes for regions of gain would yield comparably higher average intensities than those for regions of normal copy number and that this, in turn, would result in a correlation between M and A: probes measuring ratios of gain will have higher average intensities. The opposite relationship would apply for probes measuring ratios of loss. Consequently, utilizing normalization strategies based on Lowess would possibly correct for correlations between M and A related to genomic imbalances, resulting in loss of biologically relevant variation. To test this, we subjected the mimicked XXX/XX data sets to Lowess normalization. Once again, as a result of an increased fraction of gain the median M for the X chromosome is shifted, this time from 0.42 to 0.22. The shift can also be observed when looking at autosomes for which the median M is shifted from -0.01 to -0.14 (Figures 2d and 2e). Notably, the variation in M for the X chromosome increases with the fraction of gain (Figure 2d). The interquartile range (IQR) for X chromosome M values increases from 0.21 to 0.24, indicating that Lowess normalization is less suitable when discrete copy number populations exist. When visualizing the normalized data in genome plots the shift, as well as the increase in variation of the X chromosome, is apparent (Figure 2f). To illustrate the differences between how Median and Lowess fail in normalizing the data, and to explain the introduced variation, we created M-A plots for the different data sets including correction lines for the two methods (Figure 3). With an increased fraction of gain the median M value is shifted as seen for the correction line for Median (Figures 3a to d, yellow lines). The correction line for Lowess follows the same shift in the lower range of intensities but diverge at higher intensities (Figures 3a to d, green lines). This divergence indicates that the X chromosome ratios yield higher average intensities and that when the percentage of gain increases an intensity bias is introduced for M. Importantly, this intensity bias is not of a technical nature but represents biologically relevant changes and is a result of inherent properties of aCGH data. In the low range of intensities the Lowess correction line is fitted to local means of M reflecting predominantly autosomes. However, at some point as intensities increase, local means are affected by the X chromosome and then reflect a mixed population of autosome and X chromosome M values, i.e., balanced copy number and gain respectively. As the intensities increase further the locally fitted line will be affected by increasing fractions of X chromosome M values and when normalization is applied this will result in differences in the corrections for X chromosome M values. Thus, this normalization introduces variation. We concluded that Lowess normalization erroneously corrects for biological gain – as gain correlates with intensity in aCGH data – resulting in suppressed ratios and increased variation within copy number populations.
Normalization of aCGH data using population-based intensity-based lowess
We sought to develop a method that corrects for intensity dependence of M due to technical bias while retaining intensity dependence of biological relevance. We reasoned that if we could stratify aCGH ratios from an experiment with respect to copy number populations, we could use this information to circumvent the drawbacks with Lowess. One way to do this would be to run Lowess on one selected population and then apply the resulting correction line on all M values. We refer to this general strategy of considering copy number populations when using Lowess as population-based intensity-based lowess (popLowess). Applying popLowess would serve two purposes. Firstly, data would be centered at a copy number population rather than a mean or median of a mixture of different and possibly diverse copy number levels. Secondly, correlations between M and A related to technical bias would be identified and corrected for without affecting the intensity dependence due to different copy numbers. To test this strategy, we subjected the mimicked XXX/XX data sets to popLowess. Since we had prior knowledge about this case we could stratify values into copy number populations based on chromosome mapping. All values for autosomes were considered to comprise one population and all values from the X chromosome another.
After stratification, raw M and A values for the largest population were used to create a Lowess correction curve. The correction curve was generalized to cover the entire range of A and used to correct all values. Results are presented in figure 4. As expected, no apparent shift in median M or in variation for the X chromosome or autosomes can now be seen between the different percentages of gain (Figures 4a and 4b), demonstrating the effectiveness of popLowess. Notably, the correction line for popLowess exhibits a slight curvature (Figure 4c), indicating that certain intensity dependence of M exists for autosomes, possibly of a technical nature. Albeit small in the presented case, the observed intensity dependence underlines the importance of being able to correct aCGH data for technical bias while retaining biological variation. Based on the results in figure 4, we argue that the strategy behind popLowess offers improved means for normalizing aCGH data. However, we utilized prior knowledge about copy number populations, which guided us in data stratification. Equivalent information for tumor samples can be obtained by karyotyping using, e.g., G-banding, multicolor FISH (M-FISH) analysis, or SKY. This information can also be used to relate a ratio level to an absolute copy number. Having verified copy numbers can guide in centering of data, assuring that gains and losses are presented as relative changes in an appropriate fashion. Then again, these are not trivial experimental procedures and, thus, do not provide a plausible solution in most cases.
Stratification of M values into copy number populations
We aimed at developing a method for stratifying data into populations without prior knowledge regarding copy number allowing us to perform popLowess, and sought to identify populations in an automated fashion that requires minimal manual input and that adapt to varying noise levels. To accomplish this, we took advantage of the naively simplistic form of aCGH data, with a predetermined sequential genomic order of probes, and created a procedure described schematically in figure 5, steps 1–5. By removing outlier data based on ratio similarity between adjacent probes, the proposed strategy enriches populations from genomic regions with similar copy number. That is, regions with high variation in M, e.g., breakpoints or high level amplifications or deletions, are filtered out (Figure 5, steps 1–3). The enrichment of copy number populations can be observed in M-A plots and genome plots displaying data before and after the filter is applied (Figures 6a to 6d). We use a sample adaptive cut off for variation inferred from the data to account for a varying noise level between samples. The filtered data is subsequently segmented to further accentuate the underlying copy number populations and clustered into three distinct groups of values by k-means clustering (k = 3) (Figure 5, steps 4–5). The resulting clusters would roughly correspond to dividing the data into three copy number populations. To address situations where less than three populations exist, a merge cluster criterion can be used to merge clusters with insufficient centre-to-centre distance in M.
To test the performance of our stratification procedure in identifying copy number populations, we used a sample set (data set 8) containing eight hyperdiploid childhood acute lymphoblastic leukemia (ALL) cases previously investigated with aCGH, G-banding and M-FISH . All cases show multiple whole chromosome gains and some cases also minor chromosomal regions of gain. For each case, a population of genomic regions affected by copy number gain was identified based on available karyotyping data. Remaining regions were identified as a diploid population. We performed steps 1–5 of the popLowess stratification procedure on each case using a merge cluster criterion of M = 0.3. Effectively, two popLowess populations were obtained for each case, corresponding closely to the karyotyping data of a normal diploid population and a population of copy number gain. For both the gain and diploid popLowess populations the total number of called probes divided by expected total number of probes from karyotyping data was calculated. Furthermore, the fraction of correctly called probes by popLowess for the specific regions of gain defined by karyotyping was calculated. The results demonstrate that the procedure can effectively stratify data into enriched populations that represents discrete copy number levels (Table 1).
A procedure for normalization of aCGH data using popLowess
Once data is stratified into sets of enriched copy number populations we can select one, e.g., the largest, to perform Lowess normalization on. The generated correction curve must be generalized to cover the full range of A allowing for correction of all M values (Figure 6e). This procedure will ensure that the lowess derived correction line trails one population and remains unaffected by adjacent ones. We refer to this action as popLowess-o (where the letter o is a mnemonic for one) as it makes use of one population to derive a correction line for all data. The complete procedure of data stratification and popLowess normalization is shown in figure 5, steps 1–8. Naturally, once data is stratified alternative variants of calculating normalization corrections are imaginable. For example, one could fit lowess lines to each population and correct them individually or one could individually center populations and then use the combined data to create a lowess derived correction line. We refer to these alternatives as popLowess-i (where the letter i is a mnemonic for individual) and popLowess-c (where the letter c is a mnemonic for common) respectively. The latter alternative has the added advantage of reducing the degree to which the correction line needs to be extrapolated to cover the full range of A. Both alternatives require an additional step to center a selected copy number population at M = 0. The variants popLowess-o and popLowess-c rely on that the intensity-based curvature in M-A space is reasonably shared between populations.
Selecting a population to represent intrinsic copy number
The normalization procedure presented herein will center a population with unknown copy number at M = 0. The rationale for selecting an appropriate population for this purpose can differ depending on samples analyzed and the aim of a project. For instance, in the field of cytogenetics, gains and losses in tumors are by convention described as net changes relative to intrinsic balanced copy number, i.e., relative ploididy. As the number of centromeres determines ploidity, a parallel rationale would be to relate imbalances relative to the largest identified population and therefore center this population at M = 0. However, in some applications it might be more appropriate to relate imbalances to a normal diploid state. Thus, selecting a population to center data at can include using prior knowledge about regions with known copy number or selecting the middle population out of three, if present. Irrespectively of preferences of how data best be centered, the proposed popLowess procedure will alleviate the normalization problems related to mixed copy number populations. Importantly, when performing focused aCGH with specialized arrays that do not cover the entire genome, or comprise probes with a disproportioned focus on specific genomic regions, even CNAs that affect a minor part of the genome can introduce a significant correlation between copy number and intensity, and can result in misinterpretations of how a given ratio level relate to copy number.
Application to tumor specimens on different aCGH platforms
We next set out to test the proposed popLowess strategy on tumor aCGH data that display a more complex pattern of genomic imbalances and to test its performance on data derived from different array platforms. Figures 7a and 7c show genome- and M-A plots of a primary BRCA1 mutation positive breast cancer analyzed on a tiling 32 K BAC array. The genomic profile (Figure 7a) shows clear regions of aberration; however factors such as normal cell contamination and potential tumor heterogeneity have decreased the range in M for the sample specific CNAs. In the M-A plot (Figure 7c) the different copy number populations are not as distinct as for the sample in figure 6, likely making it more difficult to identify the copy number populations. We used the proposed popLowess strategy to identify copy number populations and visualized the result in a contour plot. Results are shown in figure 7e together with correction lines for Median, Lowess and popLowess. As observed in figure 7e neither the Lowess, nor the Median correction curve, accurately track a single copy number population. Figure 7g shows the normalized genomic profile after popLowess with the identified populations colored. The genomic profile has now been centered correctly and matches a previous report with detailed investigations of this tumor . Figure 7 (panels b, d, f and h) show the same collection of plots for the same tumor profiled using Agilent 244 K CGH oligonucleotide arrays (DLR-value 0.196). The vast number of probes and a considerably higher level of technical noise for raw data, renders it virtually impossible to visually distinguish the populations, clearly seen in the genomic profile (Figure 7b), using the 2D M-A plot (Figure 7d). Employment of the popLowess strategy enriches copy number populations of data as observed in the contour plot (Figure 7f). Similarly to the BAC array case, neither the Median, nor the Lowess correction curve, accurately track a single copy number population. Figure 7h shows the genomic profile after popLowess with identified populations colored. To assess the effect of normalization on variation for data in Figure 7 we calculated IQR for M values of identified populations. In the BAC case the average change in IQR for the three identified populations was an increase by 0.0012 when Lowess normalization was applied. Contrary, after popLowess the average change in IQR was a decrease by -0.00029. For the Agilent case the corresponding changes were an average increase by 0.059 after Lowess compared to an average decrease by -0.0011 after popLowess. Again, we conclude that Lowess, by not tracking a single population, erroneously corrects for CNAs resulting in an increased variation within copy number populations.
In order to illustrate the differences between alternative popLowess strategies we used variants to derive correction lines (Figure 8). In figure 8a correction lines for individual populations are presented. The popLowess strategy (popLowess-o) used to produce the results in figure 7 corresponds to normalizing data by selecting one of the correction lines in figure 8a. For the results in figure 7, the correction line for the largest population was selected (colored yellow in Figures 7 and 8). In figure 8b the correction line derived from popLowess-c is shown together with individually median centered populations. As mentioned, popLowess rely on that the intensity-based curvature in M-A space is reasonably shared between populations. When inspecting the individual correction lines in figure 8a, populations appear to display similar intensity-based curvature although small differences appear. Differences may partly be a result of extrapolating correction curves at the ends. A thorough investigation of these differences, although outside the scope of this study, would be of interest.
Comparison of popLowess strategy to standard normalization methods
We set out to test if the popLowess strategy could systematically reduce variation in M within copy number populations in different aCGH data sets. We hypothesized that when correction curves cross, or not accurately track, copy number populations; or when intensity-based curvature is not properly addressed, a larger variation in M is obtained after normalization. To this aim, we compared the performance of the popLowess strategy versus Median and Lowess using seven different aCGH data sets (data sets 1–6, 8). The data sets cover three different types of aCGH platforms hybridized with a variety of cell line and tumor samples displaying a large variation of CNAs.
We used the strategy in figure 5 to identify copy number populations in each of the data sets. We then normalized each data set in parallel using popLowess, Lowess, and Median. After normalization, we calculated standard deviations of M for each identified population for each method and compared results.
Results from the comparison are displayed in table 2, showing that the popLowess strategy generated normalized copy number data with smaller standard deviations in M within identified populations for all comparisons and data sets. We repeated the test using the inter-quartile range (IQR) of M for each population instead of the standard deviation and obtained similar results (data not shown).
Since we do not have prior knowledge of CNAs in most of the cases we cannot evaluate variation within confirmed genomic regions of similar copy number. Therefore, one could argue that the better performance of popLowess, resulting in lower variation within populations when compared with conventional normalization, is biased by the fact that populations are inferred from the data. However, from looking at the data in table 1, and at the genome plots in figure 7 (panel g and h) we note that the identified populations reflect regions with discrete copy number levels. Therefore, we argue that decreased intra population variation is beneficial to both interpretation and downstream analysis and provides improved data quality.
Presence of technical artifacts in array data resulting in correlation between M and spatial probe location on the array is a well-known and previously described phenomenon. We focused on two plausible consequences of such spatial effects in aCGH data. Firstly, affected values can introduce populations that compromise normalization in the same way as copy number populations. Secondly, affected values will be incorrectly scaled compared to non-affected.
We reasoned that ratios biased by spatial artifacts are controlled for by our proposed popLowess strategy as it filters outlier data guided by genomic mapping. Thus, when calculating an intensity dependent correction for normalization, our strategy would not be compromised by spatial bias as affected values are disregarded together with values from break points, high-level amplifications, and homozygous deletions. On the other hand, popLowess does not correct for spatial effects and affected values would remain incorrectly scaled after normalization even if the intensity bias is removed.
As the proposed popLowess strategy does not correct for spatial effects, we reasoned that a pre-normalization step might be appropriate for data displaying spatially related bias in order to properly scale affected values. This could be accomplished by applying one of many available spatial correction methods [15–17], or variations thereof, prior to popLowess. However, since we have shown that genomic imbalances correlate with intensity, we are cautious about addressing spatial effects using pre-normalization algorithms that are intensity-based.
To test our reasoning we applied popLowess to data set 7. Samples in this set have little to no genomic alterations but the data display variation in M-A curvature and spatial effects. Data set 7 was normalized using popLowess, block-based Median followed by popLowess, or block-based Lowess followed by popLowess. For popLowess, by itself or in combination with a pre-normalization step, a merge cluster criteria of 0.3 in M was employed to account for the presence of only two copy number populations.
As a measurement of spatial effects we calculated the standard deviation of medians of M from pin-tip blocks before and after normalization. We found that spatial bias may be corrected for by a pre-normalization step, preceding popLowess (Table 3).
We conclude that the proposed popLowess strategy is robust in the sense that it can handle the presence of otherwise deleterious populations without relying on them. We also conclude that, whereas popLowess is inert to spatial effects, in the sense that it does not compromise calculation of an intensity dependent correction, a pre-normalization step that correct for spatial bias is warranted.
Adaptive sample-specific thresholds for calling copy number change
During development of the popLowess strategy, we recognized that the sample-specific cut-off value (Figure 5, step 3) could be used to assess noise level in data and to assign thresholds for copy number imbalances on a sample-specific basis. Several reports [5, 8, 22, 23] have utilized global thresholds in M for calling CNA as gains or losses. These thresholds are assigned by adding/subtracting a value in M from a base line typically at M = 0. Determining suitable thresholds may be problematic in large sample sets with samples of varying quality and heterogeneity, often the case for tumor studies , and may result in setting too conservative thresholds for certain samples in order to avoid erroneous CNA calls. Deriving sample-specific threshold values scalable for desired stringency in an automated fashion is then of relevance.
A parallel can be made to the derivative log ratio spread (DLR) value calculated by the Agilent CGH Analytics software. The DLR-value can be used to assess hybridization quality and provide a sample scalable threshold for calling CNAs using, e.g., the Z-scoring algorithm in the CGH Analytics software.
We used sample specific level thresholds derived from popLowess on aCGH data for a BRCA1 mutation positive tumor analyzed on two array platforms (Figure 9). Figure 9a shows thresholds after popLowess normalization for the BAC array data and figure 9b after application of a 250 kBp smoothing window. Figure 9c shows the same tumor analyzed on the Agilent platform after popLowess and figure 9d after application of a 50 kBp smoothing window. As shown in figure 9, thresholds are automatically adapted to specifically match data. We believe that the use of sample-specific adaptive thresholds will greatly facilitate the analysis of larger aCGH data sets that include samples of varying heterogeneity and quality.
Normalization affects downstream analysis
To exemplify how normalization can affect downstream analysis and interpretation we used data generated from the Agilent array presented in Figure 7. We normalized the raw data shown in Figure 7 (panels b and d) with either Lowess or popLowess. Correction lines for both normalization methods are shown in Figure 7f. We then smoothed data (50 kBp window) and performed segmentation using the CGHplotter algorithm . Results for chromosome 4 are shown in Figure 10. In the given example, segmentation after Lowess and popLowess (Figures 10a and 10b respectively) broadly identifies the same break points. However, after Lowess the data is not centered on any of the identified segments as a result of the correction line not tracking a specific population in the raw data (Figure 10a). Contrary, after popLowess the data is centered on a specific segment level (Figure 10b; e.g. blue arrow). Shifting data to center it on a specific population can be done after any conventional normalization method. An example is shown in Figure 10c where data have been centered after Lowess. Determining the point at which data is centered can for example be achieved by stratifying data into populations using the method presented herein or by the method proposed by Lipson et al. . Importantly, to center data after Lowess does not alleviate the aforementioned problem of introduction of variation and the inappropriate correction of biological gain and loss. As a result, in the example given the dynamic range between segments of gain (Figure 10, red arrows) and loss (Figure 10, green arrows) is reduced after Lowess compared with popLowess, 1.38 versus 1.58. For Lowess a smaller dynamic range between levels is present in both directions relative to the baseline level (Figure 10). Reduced dynamic range or inappropriate centralization of aCGH data can result in misinterpretations when investigating genomic copy number profiles.
We show that the presence of copy number populations in aCGH data deleteriously affects normalization using curve-generating algorithms such as intensity-based lowess and may cause erroneous centering of data. We demonstrate that genomic imbalances correlate with intensity in aCGH data and therefore must be accounted for during normalization in order to correct for intensity dependence of M due to technical bias while retaining intensity dependence of biological relevance. Here we propose a population-based normalization strategy that accounts for the presence of copy number populations. We show that benefits of a population-based normalization approach are clearly recognized for data displaying numerous CNAs. We also demonstrate that the proposed procedure can be applied to assign adaptive sample-specific thresholds for calling copy number changes. We appreciate that the suggested strategy represents only one conceivable way of implementing population-based normalization and that any implementation that effectively discerns copy number populations in aCGH data, whether utilizing prior knowledge regarding samples or inference from the data itself, could be used. In addition, once copy number populations are identified, this information can be used in a variety of ways to circumvent highlighted problems related to conventional normalization of aCGH data. Taken together, we demonstrate that copy number populations in aCGH data should be accounted for during normalization and that the proposed normalization strategy, as well as the adaptive sample-specific level scaling, provides powerful and convenient means for improved copy number analysis using aCGH.
We used eight data sets derived from BAC arrays and from Agilent 244 K oligonucleotide CGH arrays to evaluate normalization methods. Data set 1 consists of seven breast cancer cell lines analyzed using tiling 32 K BAC arrays . Data set 2 consists of 28 lung cancer cell lines analyzed using tiling 32 K BAC arrays . Data set 3 consists of ten breast cancer cell lines analyzed using tiling 32 K BAC arrays . Data set 4 consists of 52 breast cancer tumors analyzed in dye-swaps on 1 Mb BAC arrays . Data set 5 consists of 8 breast cancer tumors and one dye-swap analyzed using Agilent 244 K oligonucleotide CGH arrays . These tumors displayed DLR values between 0.196 and 0.364 when analyzed with Agilent CGHAnalytics software ver 3.4.27 . Data set 6 was created from data set 5 by matching the oligonucleotide probe IDs from the 244 K arrays to the Agilent 44B probe IDs available through Agilent eArray , thus creating a virtual 44 K oligonucleotide CGH array. Of 42,447 genome-mapped probe IDs on the 44B array, 41,599 were found on the 244 K arrays (98%). Data set 7 consists of nine hybridizations of chromosome X aberrant cell lines with karyotype 47, XXX and 48, XXXX, and male 46, XY and female 46, XX samples in various combinations . Samples in data set 7 are expected to display a normal karyotype for chromosomes 1–22. Data set 8 consists of eight hyperdiploid childhood ALL cases analyzed using tiling 32 K BAC arrays .
Pre-filtering and conventional normalization of aCGH data
All data sets were loaded into BioArray Software Environment (BASE)  for analysis. Positive and non-saturated spots were background corrected using the median foreground minus the median background signal intensity for each channel and log ratios (M) were calculated from the background corrected intensities. In all analysis we used M = log2(int1/int2) and A = log10(sqrt(int1*int2)), where int1 and int2 are background corrected intensities from the investigated sample and reference, respectively. Data sets 1–4 and 7–8 were filtered for signal-to-noise ratio for each spot in both channels according to published reports and the remaining data sets for signal-to-noise ratio > 5 in both channels before BASE implemented software plug-ins of the different normalization strategies were employed. A lowess smooth factor of 0.33, delta of 0.1, and four iterations were used for standard Lowess, popLowess and block-based lowess normalization. Block group size was set to 1 for all block-based normalizations.
Population-based intensity-based lowess
A schematic overview of the proposed popLowess normalization strategy is shown in figure 5. The approach is applied on a per sample basis starting with genomic mapping and raw intensities (int1 and int2) for N probe IDs (step 1, Figure 5). The probes are sorted according to genomic position and M and A are calculated for each probe (step 2, Figure 5). Next, a standard deviation in M is calculated for each probe in sliding windows of user-defined size along the genome. The resulting distribution of N standard deviations is subjected to a cut-off criterion generating K probes with standard deviations < cut-off for continued population analysis (step 3, Figure 5). A moving window size of 11 probes was used and the median of the standard deviation distribution was used as cut-off value. This selection criterion is sample adaptive avoiding problems with using a global cut-off criterion. The K selected probes are next segmented on a per chromosome basis using, e.g., the CGHplotter algorithm  or the faster circular binary segmentation (CBS) algorithm  (step 4, Figure 5). Herein, the segmentation algorithm proposed by Autio et al. was used with the constant for computing the number of changes (c-parameter) set to 10 . Segmented values are used to cluster the K probes into three distinct clusters by means of robust k-means clustering (step 5, Figure 5). After clustering, there is an option to merge clusters with cluster centers close to each other. Merging is typically useful for samples not displaying three populations, e.g., samples with 1 or 2 copy number populations. When indicated, a merge cluster criterion of 0.2 or 0.3 in M was used. The resulting data consists of 1–3 distinct populations of data that contains information about the genomic mapping, M, and A for each probe. The largest population is selected for lowess normalization  generating a population specific correction curve (step 6, Figure 5). The correction curve is next extrapolated to the entire range of A and used to correct M for all N reporters similar to Lowess (step 7, Figure 5). The extrapolation is done conservatively in the end points of A by using the first/last data point of the population specific correction curve to level out the global correction curve horizontally in the M-A plot thereby moderating the impact of extreme points or missing values. After lowess correction, one population is selected as the center population and all data is shifted such that this population obtains median M equal to 0. Selection of a center population can be based on different assumptions. Finally, the normalized int1 and int2 intensities are returned (step 8, Figure 5). By not segmenting the entire set of observations, and by setting the crucial segmentation parameters for detecting breakpoints in the lower scale, speed is gained while still retaining robustness as long as the standard deviation cut off is not set too low. The purpose of segmentation is to refine large regions with identical copy number and not to detect small complex copy number alterations.
Comparison of normalization methods
For comparisons, the R implemented lowess function was used to create lowess-normalized data. For each identified population (step 1–5, Figure 5) in every sample in data sets 1–6 and 8, the standard deviations in M of the reporters in the population after Lowess, popLowess, and no normalization (equal to Median) were calculated separately. The number of populations in a data set for which the popLowess strategy rendered a lower standard deviation compared to the competitor was calculated. To evaluate if popLowess resulted in a significant number of populations with lower standard deviations, one sided p-values were calculated using the binomial distribution with p = 0.5. This binomial test corresponds to the null hypothesis that lower standard deviations for popLowess are obtained by chance. This comparison was done both when studying all populations as a whole and for each population individually.
Sample adaptive gain/loss thresholds
Sample adaptive thresholds for calling gain or loss can be generated by performing steps 1–3 in Figure 5 using the same form of data input and standard deviation cut-off criteria. The identified standard deviation cut-off value can be scaled by multiplicative factors to generate sample specific gain/loss thresholds of desired stringency for downstream applications, e.g., calling CNAs after segmentation. Before creating sample adaptive thresholds, data was pre-filtered and normalized using the popLowess strategy. Sample adaptive thresholds for the Ca13928 breast tumor were created before and after a smoothing window of 250 kBp size for 32 K BAC data and 50 kBp for Agilent 244 K data. Thresholds were estimated using a chromosomal moving window of size 1% of the total probe number for each chromosome separately and the standard deviation cut-off value was selected as the median of the standard deviation distribution. The cut-off value was scaled by a factor 2 to create the ± thresholds in M displayed in figure 9.
Availability and requirements
Project name: popLowess
Project home page: http://baseplugins.thep.lu.se/wiki/se.lu.onk.popLowess
Operating system(s): Platform independent
Programming language: R
License: GNU GPL
acute lymphoblastic leukemia
bacterial artificial chromosome
BioArray Software Environment
comparative genomic hybridization
copy number aberration
copy number variation
Fluorescence in situ hybridization
Inter Quartile Range
Global intensity-based lowess normalization
Global median normalization
population-based intensity-based lowess normalization
Spectral karyotyping technique
Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet. 1999, 23 (1): 41-46. 10.1038/12640.
Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, Dairkee SH, Ljung BM, Gray JW, Albertson DG: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet. 1998, 20 (2): 207-211. 10.1038/2524.
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992, 258 (5083): 818-821. 10.1126/science.1359641.
Ylstra B, van den Ijssel P, Carvalho B, Brakenhoff RH, Meijer GA: BAC to the future! or oligonucleotides: a perspective for micro array comparative genomic hybridization (array CGH). Nucleic Acids Res. 2006, 34 (2): 445-450. 10.1093/nar/gkj456.
Hicks J, Krasnitz A, Lakshmi B, Navin NE, Riggs M, Leibu E, Esposito D, Alexander J, Troge J, Grubor V, Yoon S, Wigler M, Ye K, Borresen-Dale AL, Naume B, Schlicting E, Norton L, Hagerstrom T, Skoog L, Auer G, Maner S, Lundin P, Zetterberg A: Novel patterns of genome rearrangement and their association with survival in breast cancer. Genome Res. 2006, 16 (12): 1465-1479. 10.1101/gr.5460106.
Fridlyand J, Snijders AM, Ylstra B, Li H, Olshen A, Segraves R, Dairkee S, Tokuyasu T, Ljung BM, Jain AN, McLennan J, Ziegler J, Chin K, Devries S, Feiler H, Gray JW, Waldman F, Pinkel D, Albertson DG: Breast tumor copy number aberration phenotypes and genomic instability. BMC Cancer. 2006, 6: 96-10.1186/1471-2407-6-96.
Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T, Chen F, Feiler H, Tokuyasu T, Kingsley C, Dairkee S, Meng Z, Chew K, Pinkel D, Jain A, Ljung BM, Esserman L, Albertson DG, Waldman FM, Gray JW: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006, 10 (6): 529-541. 10.1016/j.ccr.2006.10.009.
Jonsson G, Naylor TL, Vallon-Christersson J, Staaf J, Huang J, Ward MR, Greshock JD, Luts L, Olsson H, Rahman N, Stratton M, Ringner M, Borg A, Weber BL: Distinct genomic profiles in hereditary breast tumors identified by array-based comparative genomic hybridization. Cancer Res. 2005, 65 (17): 7612-7621.
Vissers LE, Veltman JA, van Kessel AG, Brunner HG: Identification of disease genes by whole genome CGH arrays. Hum Mol Genet. 2005, 14 (Spec No 2): R215-223. 10.1093/hmg/ddi268.
Pinkel D, Albertson DG: Comparative genomic hybridization. Annu Rev Genomics Hum Genet. 2005, 6: 331-354. 10.1146/annurev.genom.6.080604.162140.
Quackenbush J: Microarray data normalization and transformation. Nat Genet. 2002, 32 (Suppl): 496-501. 10.1038/ng1032.
Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics. 2004, 20 (18): 3636-3637. 10.1093/bioinformatics/bth355.
Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5 (4): 557-572. 10.1093/biostatistics/kxh008.
Autio R, Hautaniemi S, Kauraniemi P, Yli-Harja O, Astola J, Wolf M, Kallioniemi A: CGH-Plotter: MATLAB toolbox for CGH-data analysis. Bioinformatics. 2003, 19 (13): 1714-1715. 10.1093/bioinformatics/btg230.
Neuvial P, Hupe P, Brito I, Liva S, Manie E, Brennetot C, Radvanyi F, Aurias A, Barillot E: Spatial normalization of array-CGH data. BMC Bioinformatics. 2006, 7: 264-10.1186/1471-2105-7-264.
Khojasteh M, Lam WL, Ward RK, MacAulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics. 2005, 6: 274-10.1186/1471-2105-6-274.
Smyth GK, Speed T: Normalization of cDNA microarray data. Methods. 2003, 31 (4): 265-273. 10.1016/S1046-2023(03)00155-5.
Oshlack A, Emslie D, Corcoran LM, Smyth GK: Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 2007, 8 (1): R2-10.1186/gb-2007-8-1-r2.
Johannsson OT, Staff S, Vallon-Christersson J, Kytola S, Gudjonsson T, Rennstam K, Hedenfalk IA, Adeyinka A, Kjellen E, Wennerberg J, Baldetorp B, Petersen OW, Olsson H, Oredsson S, Isola J, Borg A: Characterization of a novel breast carcinoma xenograft and cell line derived from a BRCA1 germ-line mutation carrier. Lab Invest. 2003, 83 (3): 387-396.
Jonsson G, Staaf J, Olsson E, Heidenblad M, Vallon-Christersson J, Osoegawa K, de Jong P, Oredsson S, Ringner M, Hoglund M, Borg A: High-resolution genomic profiles of breast cancer cell lines assessed by tiling BAC array comparative genomic hybridization. Genes Chromosomes Cancer. 2007, 46 (6): 543-558. 10.1002/gcc.20438.
Paulsson K, Heidenblad M, Morse H, Borg A, Fioretos T, Johansson B: Identification of cryptic aberrations and characterization of translocation breakpoints using array CGH in high hyperdiploid childhood acute lymphoblastic leukemia. Leukemia. 2006, 20 (11): 2002-2007. 10.1038/sj.leu.2404372.
de Leeuw RJ, Davies JJ, Rosenwald A, Bebb G, Gascoyne RD, Dyer MJ, Staudt LM, Martinez-Climent JA, Lam WL: Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. Hum Mol Genet. 2004, 13 (17): 1827-1837. 10.1093/hmg/ddh195.
Shadeo A, Lam WL: Comprehensive copy number profiles of breast cancer cell model genomes. Breast Cancer Res. 2006, 8 (1): R9-10.1186/bcr1370.
Lipson D, Ben-Dor A, Yakhini Z: Determining the center of array-CGH data. Computational aspects of DNA copy number measurement. 2007, Technion – Israel Institute of Technology, Computer Science Department, 105-110.
Garnis C, Lockwood WW, Vucic E, Ge Y, Girard L, Minna JD, Gazdar AF, Lam S, Macaulay C, Lam WL: High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer. 2006, 118 (6): 1556-1564. 10.1002/ijc.21491.
Agilent Technologies. [http://www.agilent.com]
Agilent eArray. [http://earray.chem.agilent.com/earray]
Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg A, Peterson C: BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol. 2002, 3 (8): SOFTWARE0003-. 10.1186/gb-2002-3-8-software0003.
Yang MC, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA, She JX: A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol Genomics. 2001, 7 (1): 45-53.
We wish to thank Patrik Edén and Mattias Höglund for helpful comments on the manuscript. This work was supported by the Knut and Alice Wallenberg Foundation via the SWEGENE program (JS and JVC), the Swedish Cancer Society (GJ), the American Cancer Society (GJ and JVC), John och Augusta Perssons stiftelse (GJ and JVC), and the Swedish Foundation for Strategic Research through CREATE Health – the Lund Strategic Centre for Clinical Cancer Research (MR).
The author(s) declares that there are no competing interests.
All authors participated in the development of the model. JS implemented and developed the methods. JS and MR performed the statistical tests. JVC conceived the study. JS and JVC drafted the manuscript. All authors participated in the design of the study and in completing the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Staaf, J., Jönsson, G., Ringnér, M. et al. Normalization of array-CGH data: influence of copy number imbalances. BMC Genomics 8, 382 (2007). https://doi.org/10.1186/1471-2164-8-382