Copynumber: Efficient algorithms for single- and multi-track copy number segmentation

Background Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number. Results A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented. Conclusions The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.


MicMa data set
The biomaterial from early breast cancers was collected from patients included in the Oslo Micrometastasis (MicMa) Study -Oslo1 (Wiedswang et al. 2003). The samples are a subset from a larger cohort of stage I-II primary breast tumors collected in 1995-98 at five hospitals in Norway (Ullevål University Hospital, Norwegian Radium Hospital, Aker University Hospital, Baerum Hospital and Buskerud Hospital, the three first now part of Oslo University Hospital). DNA from a subset of 49 tumor samples from primary surgery (47 primary tumors and 2 lymph node metastases) were analysed analyzed using Agilent's Human Genome CGH 244k Microarrays (Agilent Technologies, Santa Clara, California, USA) (Mathiesen et al 2011). These samples were used in Figures 2 and 4 in the paper, and also when comparing computational performance.
From a subset of patients, bone marrow mononuclear cells were available, and single cell array comparative genomic hybridization was performed for these cells (Mathiesen et al 2011). The bone marrow samples were collected at the time of surgery by aspiration from from posterior or anterior iliac crests. Potential disseminated tumor cells (DTCs) were detected by cytochemistry, as described previously (Wiedswang et al. 2003). These cases were the basis for the disseminated tumor cells example in the main text, including Figure 5.
For many single cells, a fraction of the probe values were very low. To avoid too strong influence of these probes in the copy number estimation, a lowest detectable value was defined and all values below this limit were set equal to the limit (typically -2 on the log2 scale). Following this procedure, the zero line was redefined to give an average value of 0 across all chromosomes for each cell. It was furthermore observed that data from single DTCs showed systematic fluctuations in probe values, probably due to the amplification process. These fluctuations were similar across all cells, including the blood cells. To avoid false aberrations due to these fluctuations, an initial step in the estimation was to construct a moving average based on available blood cells. For each probe value from the DTCs, the corresponding moving average value was subtracted, and the resulting residuals were used in the copy number estimation.

Follicular lymphoma data set
Samples from follicular lymphomas (FL) were selected from the archives of the Pathology Clinic at The Norwegian Radium Hospital, Oslo University Hospital, Norway. The selection criteria were the diagnosis of FL according to the WHO criteria and the presence of fresh frozen tissue from multiple samples of FL positive for the translocation t(14,18). A total of 100 samples from 44 patients (median age 44, range 29-71) diagnosed between 1987 and 2005 were selected; from 39 patients two or more successive biopsies were available. Median observation time was 88 months (range 10-294). The tumor cell content was above 50% in all samples (except one sample with 40% tumor cells).
DNA was isolated from frozen tissue. Copy number alterations (CNAs) were examined by array comparative genomic hybridization on in-house arrays containing approximately 4,500 BAC/PACs in quadruplicate at the resolution of 1Mb. The construction and preparation of the microarrays as well as the procedures for DNA-labeling and hybridization have been previously described (Meza-Zepeda et al. 2006). Arrays were scanned using the Agilent G2565BA scanner. Images were segmented and raw data were filtered in GenePix Pro 6.0. The whole dataset is available in the ArrayExpress database (www.ebi.ac.uk/arrayexpress, accession no. E-TABM-930).
As stated in the paper, a central aim of the study was comparison of aberration patterns seen in biopsies taken at successive relapses in the same patient. Figure B below shows wholegenome copy number estimates from three successive biopsies. Note the apparent loss of aberrations from an early to a late biopsy (see, e.g., chromosome 7q and 13). This indicates that the disease develops in parallel in different lymph nodes, as opposed to the late biopsies being direct descendants from the earlier ones.

Figure B.
Whole-genome copy number estimates from three successive biopsies from the same patient. The graph was created with the function plotGenome in copynumber.