Copynumber: Efficient algorithms for single and multitrack copy number segmentation
 Gro Nilsen†^{1, 2},
 Knut Liestøl†^{1, 2},
 Peter Van Loo^{3, 4},
 Hans Kristian Moen Vollan^{5, 6, 7},
 Marianne B Eide^{2, 8},
 Oscar M Rueda^{9},
 SuetFeung Chin^{9},
 Roslin Russell^{9},
 Lars O Baumbusch^{5},
 Carlos Caldas^{9, 10},
 AnneLise BørresenDale^{5, 6} and
 Ole Christian Lingjærde^{1, 2, 5}Email author
DOI: 10.1186/1471216413591
© Nilsen et al.; licensee BioMed Central Ltd. 2012
Received: 16 May 2012
Accepted: 15 October 2012
Published: 4 November 2012
Abstract
Background
Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of arrayCGH, SNP arrays and highthroughput sequencing, calls for a coherent framework offering unified and consistent handling of single and multitrack segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number.
Results
A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multisample and multitrack segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vectorbased operations in R. Three case studies are presented.
Conclusions
The R package copynumber is a software suite for segmentation of single and multitrack copy number data using algorithms based on coherent least squares principles.
Keywords
Copy number aCGH Segmentation Allelespecific segmentation Penalized regression Least squares BioconductorBackground
In cancer, the path from normal to malignant cell involves multiple genomic alterations including losses and gains of genomic DNA. A long series of studies have demonstrated the biological and clinical relevance of studying such genomic alterations (see, e.g., [1, 2] and references therein). Genomewide scans of copy number alterations may be obtained with arraybased comparative genomic hybridization (aCGH), SNP arrays and highthroughput sequencing (HTS). After proper normalization and transformation of the raw signal intensities obtained from such technologies, the next step is usually to perform segmentation to identify regions of constant copy number. Many segmentation algorithms are designed to analyse samples individually (see, e.g., [3–16] and references therein), while most studies involve multiple samples, multiple tracks, or both. Joint handling of multiple samples is computationally and conceptually challenging, see e.g. [17, 18]. Most systematic approaches for this problem are based on individual segmentation of each sample followed by postprocessing to combine results across samples (see, e.g., [18] and references therein), while some recent publications propose strategies for joint segmentation of all samples [19–23]. Recently, the emergence of new technologies have pushed the limit of genomic resolution, opening new vistas for studying very short aberrations, including aberrations affecting only part of a gene or gene regulatory sites in the DNA. A major challenge raised by these novel technologies is the steadily growing length of the data tracks, which drastically increases the demand for computationally efficient algorithms. The occurrence of extreme observations (outliers) of biological or technical origin pose an additional challenge, as most segmentation methods are substantially affected by such observations. Picard et al. [6] propose a least squares based segmentation method that results in a piecewise constant fit to the copy number data. Their approach assumes that the user either supplies the desired number of segments or leaves to the method to automatically determine this number. In this paper, we describe a related approach. In particular, the proposed method utilizes penalized least squares regression to determine a piecewise constant fit to the data. Introducing a fixed penalty γ>0 for any difference in the fitted values of two neighboring observations induces an optimal solution of particular relevance to copy number data: a piecewise constant curve fully determined by the breakpoints and the average copy number values on each segment. The user defined penalty γ essentially controls the level of empirical evidence required to introduce a breakpoint. Given the number of breakpoints, the solution will be optimal in terms of least squares error.
To achieve high processing efficiency, dynamic programming is used (see [24]). To further increase computational efficiency, a novel vector based algorithm is proposed, and even further speed optimization is obtained through heuristics. A central aim of the present work has been to provide methodology and highperformance algorithms for solving single and multipletrack problems within a statistically and computationally unified framework. All proposed algorithms are embedded in a comprehensive software suite for copy number segmentation and visualization, available as the Bioconductor package copynumber. Main features of the package include:

Independent as well as joint segmentation of multiple samples

Segmentation of allelespecific SNP array data

Preprocessing tools for outlier detection and handling, and missing value imputation.

Visualization tools
Implementation
Systems overview
Segmentation methods for three different scenarios (single sample, multisample and allelespecific segmentation) are implemented in the package. All these methods are referred to as Piecewise Constant Fitting (PCF) algorithms and seek to minimize a penalized least squares criterion. In single sample PCF, individual segmentation curves are fitted to each sample. In multisample PCF, segmentation curves with common segment borders are simultaneously fitted to all samples. In allelespecific PCF, the segmentation curves are fitted to bivariate SNParray data, providing identical segment borders for both data tracks. A set of graphical tools are also available in the package to visualize data and segmentation results, and to plot aberration frequencies and heatmaps. Also included are diagnostics to explore different tradeoffs between goodnessoffit and parsimony in terms of the number of segments. In the remaining part of this section, a formal description of the algorithms is given. However, note that these details are not a prerequisite for reading later sections or for using the copynumber package.
Preprocessing: Outlier handling
Here, θ>0 determines how extreme an observation must be to be relocated, as well as the replacement value. A common choice is θ=τs, where typically τ∈[1.5,3] and s is a robust estimate of the standard deviation (SD). A robust scale estimator is the Median Absolute Deviation (MAD), defined as the median of the values ${y}_{j}\widehat{m}$, where $\widehat{m}$ is the median of y_{1},…,y_{ p }. For normally distributed observations, s_{ M }=1.4826·MAD corresponds to SD.
Winsorization of copy number data may be achieved by first estimating the trend in the data and then Winsorizing the residuals. Let the observations representing copy numbers in p genomic loci be y=(y_{1},…,y_{ p }), ordered according to genomic position. A simple estimator of the trend is the median filter. The trend estimate ${\widehat{m}}_{j}$ in the j th locus is then given by the median of y_{j−k},…,y_{j + k}for some k>0, e.g. k=25. The SD of the residuals ${y}_{j}{\widehat{m}}_{j}$ may then be estimated with the MAD estimator s_{ M }, and Winsorized observations ${y}_{1}^{w},\dots ,{y}_{p}^{w}$ obtained by ${y}_{j}^{w}={\widehat{m}}_{j}+\Psi ({y}_{j}{\widehat{m}}_{j}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\tau {s}_{M})$. Often, such simple and fast Winsorization is sufficient. However, copynumber also includes an iterative procedure with improved trend estimation based on the segmentation procedures described below (see Additional file 1).
Single sample segmentation
where (k:1)=(k,k−1,…,1) and operators are vectorbased. Hence, addition of a vector and a scalar adds the latter to each component of the former, and multiplications and divisions are performed componentwise on the operands, e.g., ${\mathbf{a}}_{k}\ast {\mathbf{a}}_{k}=[{\left({a}_{1}^{k}\right)}^{2},\dots ,{\left({a}_{k}^{k}\right)}^{2}]$. Algorithm 1 summarizes the computations.
Algorithm 1: Single sample PCF
 1.Calculate scores by letting a _{0}=[ ] and e _{0}=0, and iterate for k=1…p

a_{ k }=[a_{k−1} 0] + y_{ k }

d_{ k }=−a_{ k }∗a_{ k }/(k:1)

${\mathbf{e}}_{k}=[{\mathbf{e}}_{k1}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}min({\mathbf{d}}_{k}+{\mathbf{e}}_{k1}+\gamma \left)\right]$
storing also the index t_{ k }∈{1,2,…,k} at which the minimum in the last step is achieved.

 2.
Find segment start indices (right to left) $\left(\right)close="">{s}_{1}={t}_{p},{s}_{2}={t}_{{s}_{1}1}\dots ,{s}_{M}=1$, where M≥1.
 3.
Find segment averages ${\stackrel{\u0304}{y}}_{m}=\text{ave}({y}_{{s}_{m}},\dots ,{y}_{{s}_{(m1)}1})$ for m=1,…,M, where s _{0}=p + 1.
Throughout the paper we will tacitly assume that the penalty for the i th sample is ${\gamma}_{i}=\gamma {\widehat{\sigma}}_{i}^{2}$, where ${\widehat{\sigma}}_{i}^{2}$ is the estimated sample specific residual variance. In this way, we avoid scale dependency, and obtain consistent results for samples with equal signaltonoise ratios. Such rescaling is also done by default in copynumber. Note that replacing the data ${y}_{j}^{i}$ for the i th sample with ${y}_{j}^{i}/{\widehat{\sigma}}_{i}$ for j=1,…,p, and rescaling after estimation, has the same effect. In copynumber, the algorithm has also been extended to allow a constraint on the least number of probes in a segment.
Multisample segmentation
where L(S  y,γ) is defined as in (3) and S is a given segmentation common to all samples.
Algorithm 2: Multisample PCF
 1.Calculate scores by letting A _{0}=[ ] and e _{0}=0, and iterate for k=1…p

A_{ k }=[A_{k−1} 0] + y_{ k }

${\mathbf{d}}_{k}={\mathbf{1}}^{T}({\mathbf{A}}_{k}\ast {\mathbf{A}}_{k})/(k:1)$

${\mathbf{e}}_{k}=[{\mathbf{e}}_{k1}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}min({\mathbf{d}}_{k}+{\mathbf{e}}_{k1}+\mathrm{n\gamma}\left)\right]$
storing also the index t _{ k }∈{1,2,…,k} at which the minimum in the last step is achieved.

 2.
Find segment start indices (right to left) $\left(\right)close="">{s}_{1}={t}_{p},{s}_{2}={t}_{{s}_{1}1}\dots ,{s}_{M}=1$, where M≥1.
 3.
Find segment averages ${\stackrel{\u0304}{\mathbf{y}}}_{m}=\text{ave}({\mathbf{y}}_{{s}_{m}},\dots ,{\mathbf{y}}_{{s}_{(m1)}1})$ for m=1,…,M, where s _{0}=p + 1.
The multisample PCF algorithm (see Algorithm 2) is in principle quite similar to single sample PCF. However, when updating the solution from k−1 to k, the sums and sums of squares for the segments must be accumulated and stored separately for each sample. This can still be done iteratively, implying that the computational effort will be approximately equal to carrying out single sample PCF on the same set of samples. Since the noise level may vary between samples, normalisation of the samples prior to segmentation and corresponding rescaling after estimation is advisable. It may also be desirable to scale the samples, e.g. to adjust for different tumor percentages. Thus, prior to running multisample PCF, we may replace y^{ i } by ${w}_{i}{\mathbf{y}}^{i}/{\widehat{\sigma}}_{i}$ for i=1,…,n, where w_{ i }are weights and ${\widehat{\sigma}}_{i}$ is an estimate of the SD. In copynumber normalization is performed by default for multisample PCF while further weighting is left as an option for the user.
Allelespecific segmentation
The PCF algorithm is easily adapted to variants of the basic segmentation problem discussed above. Here, we consider an adaptation to handle SNP genotype data. We then have for each SNP locus a measurement of (total) copy number (logR) as well as the B allele frequency (BAF). We may also have measurements of copy number only for a number of additional loci. The B allele frequency is a number between 0 and 1 indicating the allelic imbalance of a SNP. For a homozygous locus we have BAF close to 0 or 1, while for a heterozygous locus with an equal number of the two alleles A and B, BAF will be close to 0.5. An imbalance between the number of A’s and B’s results in a BAF value deviating from 0.5. A change in the total number of copies of a segment will alter the logR value, hence result in a level shift in the logR track. Unless the copy number change is balanced with respect to the two alleles, the BAF value will also change. In cases involving multiple copy number events at the same locus, the change may manifest itself only in one of the two tracks. For example, a loss of one copy of A followed by a gain of one copy of B would lead to unchanged logR and changed BAF. The purpose of the allelespecific PCF algorithm is to detect breakpoints for all such events. It fits piecewise constant curves simultaneously to the logR and the BAF data, forcing breakpoints to occur at the same positions in both. We emphasize that the purpose of the allelespecific PCF algorithm is segmentation only and not to make allelespecific copy number calls. However, such calls can be made on the basis of the segmentation described below, and this is done e.g. in the ASCAT algorithm (AlleleSpecific Copy number Analysis of Tumors) which estimates allelespecific copy numbers as well as the percentage of cells with aberrant DNA and the tumor ploidy [26]. Suppose the data are given by (r_{ j }b_{ j }) for j=1,…,p, where r_{ j }denotes the logR value and b_{ j }the BAF value at the j th locus. For copy number probes, only r_{ j }is given and b_{ j }will be missing (henceforth coded as NA). For germline homozygous probes, the BAF values are noninformative and should be omitted from the analysis. If the germline genotype is known (e.g. from a matching blood sample), the user should replace the corresponding BAF values by NA. If the genotype is not known, the algorithm will apply a proxy to handle this issue (see below). Prior to segmentation, the allelespecific PCF algorithm performs the following steps:
The BAF data are mirrored around 0.5 by replacing b_{ j }with 1−b_{ j }if b_{ j }>0.5.
BAF values b_{ j }<θ are replaced by NA. By default θ=0.1. If germline homozygous probes have previously been replaced by NA’s, let θ=0.
Let ${\stackrel{~}{b}}_{1},\dots ,{\stackrel{~}{b}}_{m}$ denote the nonmissing B allele frequencies. Corresponding copy number values ${\stackrel{~}{r}}_{1},\dots ,{\stackrel{~}{r}}_{m}$ are found by pairing each logR probe with the nearest Ballele probe (ignoring those with missing values) and then averaging logR values paired to the same Ballele probe. Finally, let ${\mathbf{y}}_{1}=({\stackrel{~}{b}}_{1},\dots ,{\stackrel{~}{b}}_{m})$ and ${\mathbf{y}}_{2}=({\stackrel{~}{r}}_{1},\dots ,{\stackrel{~}{r}}_{m})$.
where L(S  ·,γ) is defined as in (3).
Fast implementations of PCF
The PCF algorithms may be generalized to allow breakpoints only at certain prespecified positions. Combined with simple heuristics, this may be used to further enhance the computational speed of PCF. For brevity we describe only the single sample segmentation case here; however the copynumber package contains fast implementations of both single and multisample PCF. Computationally inexpensive methods can be used to identify a set of potential breakpoints among which the breakpoints of the solution to (3) are highly likely to be found. Suppose we restrict our attention to such a set of potential breakpoints. All relevant information for solving the optimization problem in (3) may then be condensed into three arrays containing the number of observations between two potential breakpoints, the corresponding sum of the observations and the sum of squares. Based on these quantities, PCF may be used with straightforward modifications. Since the algorithm is of order O(q^{2}), where q is the number of potential breakpoints, the potential increase in speed is substantial. Algorithm 3 outlines the procedure, while possible heuristics for finding potential breakpoints are discussed below. One way to identify potential breakpoints is to use highpass filters, i.e. a filter obtaining high absolute values when passing over a breakpoint. The simplest such filter uses for each position i the difference $\sum _{j=i+\phantom{\rule{0.3em}{0ex}}1}^{i+k}{y}_{j}\sum _{j=i\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}k\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1}^{i}{y}_{j}$ for some k. To reduce artifacts due to the abrupt edges of such a filter, the copynumber implementation assigns half weight to the outer 1/3 of the observations on each side. Fast implementations of such filters in R may be obtained using the cumsum function. We currently use two filters with k=3 and 12, respectively; additionally the single sample PCF implementation includes a filter searching for aberrations of length equal to the lowest accepted one. These filters together identify about 15% of the probe positions as potential breakpoints. An additional way to speed up the computations on long sequences is to initially divide the sequence into overlapping subsequences, and iteratively find the solution.
Having found the solution for the m first subsequences, we use highpass filters to detect potential breakpoints for subsequence m + 1, and then use the fast PCF algorithm with the latter potential breakpoints as well as those found by PCF on earlier subsequences. The intention behind this iterative approach is to reduce potential boundary effects. Due to the quadratic order of the algorithm, this division into subsequences implies a substantial efficiency gain. In copynumber, subsequences are used when the chromosomal arm length exceeds 15000 probes, with subsequences of length 5000 and overlap 1000.
Algorithm 3: Fast PCF
 1.
Apply heuristics to find potential breakpoints r _{0},r _{1},…,r _{ q }, where r _{0}=1 and r _{ q }=p + 1.
 2.
Form aggregates by letting ${u}_{k}=\sum _{j={r}_{k1}}^{{r}_{k}1}{y}_{j}$, where k=1,…,q.
 3.Calculate scores by letting a _{0}=[ ], c _{0}=[ ], e _{0}=0, and iterate for k=1,…,q

a_{ k }=[a_{k−1} 0] + u_{ k }

c_{ k }=[c_{k−1} 0] + r_{ k }−r_{k−1}

d_{ k }=−a_{ k }∗a_{ k }/c_{ k }

${\mathbf{e}}_{k}=[{\mathbf{e}}_{k1}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}min({\mathbf{d}}_{k}+{\mathbf{e}}_{k1}+\gamma \left)\right]$
storing also the index t_{ k }∈{1,2,…,k} at which the minimum in the last step is achieved.

 4.
Find segment start indices (right to left) ${s}_{1}={r}_{{t}_{q}},{s}_{2}={r}_{{t}_{{s}_{1}1}}\dots ,{s}_{M}=1$, where M≥1.
 5.
Find segment averages ${\stackrel{\u0304}{y}}_{m}=\text{ave}({y}_{{s}_{m}},\dots ,{y}_{{s}_{(m1)}1})$ for m=1,…,M, where s _{0}=p + 1.
Results and discussion
Selection of penalty
Aberration calling
Single versus multisample segmentation
Comparing tracks: Analysis of disseminated tumor cells
Defining variables: Genetic evolution in follicular lymphoma
Although the delineation of segments varied between biopsies, several areas with a high frequency of aberrations could be detected. To try to identify aberrations with prognostic potential, we therefore found a common segmentation for the initial biopsies taken from each of the 44 patients using the multisample PCF algorithm. Removing very low variance segments, 93 segments remained. The corresponding copy number estimates were used as covariates in a multivariate Cox proportional hazards regression. This revealed 11 segments for which gains were significantly associated with a survival disadvantage. A particularly strong association was detected for gains on chromosome X in male patients. To study the relation between successive biopsies taken from the same patient, multisample PCF was applied to each patient individually (see Additional file 2). As expected, many aberrations are common, but interestingly, some aberrations are present in early biopsies and not in later ones. This contradicts the hypothesis of linear development which states that late tumor clones arise from earlier ones, and supports the alternative hypothesis of parallel evolution in different lymph nodes.
Allelespecific copy number analysis in breast cancer
If for a certain SNP locus one allele is substantially more frequently gained than the other allele, one may hypothesize that the former allele is subject to a larger selective pressure to change copy number. This, in turn, may be an indication of different roles being played by the two alleles with respect to cancer progression and evolution, suggesting that loci subject to allelic skewness can be potential unique markers for breast cancer development. Even from a relatively small number of samples, probes with highly significant allelic skewness have been identified in a genomewide statistical evaluation [26].
Outliers and Winsorization
Outlier effects
Type  Distribution  Sensitivity(%)  Specificity(%)  False aberrations (%) 

Normal  79.5  96.5  0.15  
A  Normal w/5% contam.  78.8  93.7  1.04 
Normal w/5% contam., Winsor.  78.1  96.0  0.13  
Normal  78.9  93.6  0.20  
B  Normal w/5% contam.  77.8  90.6  1.06 
Normal w/5% contam., Winsor.  77.5  93.3  0.15 
Another way to avoid that a few extreme observations result in a segment is to impose a lower limit on the length (number of probes) of a segment. With a lower length limit of five probes, we found about twice as many false spikes as with Winsorization when adjusting γ to give equal sensitivity for true aberrations. Still, simulations indicate that a lower limit on segment length is valuable in combination with Winsorization. Note that outliers of biological origin will be more extreme if the technology has an inherent low noise level, as is the case, e.g., for BAC arrays and for high throughput sequencing. Thus, outliers are not a sign of inappropriate functioning of a technique, but a characteristic of the data requiring consideration in the analysis. In summary, copy number data tend to contain a high fraction of outliers. These outliers often induce false aberrations, but simple procedures like Winsorization will efficiently reduce these undesired effects.
Computational performance
Computational performance
Method  R package  Agilent 244K  Illumina 1.1M  

Raw data  Outliers removed  Raw data  Outliers removed  
PCF  copynumber  4 (0.2)  4 (0.2)  23 (0.7)  22 (0.4) 
Fused Lasso  cghFLasso  5 (0.2)  5 (0.2)  97 (0.7)  99 (3.3) 
CBS  DNAcopy  15 (4.7)  35 (4.1)  71 (12.9)  219 (12.8) 
The deviations between the solutions found by the exact PCF and fast PCF on the MicMa set were small; in terms of reduction in variance (difference between sample variance and residual variance after fitting PCF curves) below 0.01%. The differences observed for the curves were typically small shifts in the border of aberrations. Thus, we conclude that the results from the fast procedure for practical purposes may be regarded as global solutions to (3), and the fast version is therefore used by default in copynumber. We also compared the performance of PCF with two other segmentation methods: Circular Binary Segmentation (CBS) [4, 36] and Fused Lasso Regression (FL) [12]. In comparison studies [5, 8], CBS has shown good performance in terms of sensitivity and false discovery rate. It is probably the most commonly used freely available algorithm and is also implemented in several commercial analysis tools. CBS is available in the R package DNAcopy, which is used for this comparison. FL is a more recent proposal implemented in the R package cghFLasso, and is one of three preferred methods in the webbased segmentation tool CGHweb [37]. Using default parameter settings, we compared the computing times of PCF, CBS and FL on the 49 samples in the 244 K MicMa data set, and on 6 samples from a 1.1 M Illumina SNP array (using the logR values). Table 2 gives the average computation time (in seconds) per sample. With no preprocessing of the data, PCF is on average 34 times faster than CBS on both data sets, and about 4 times faster than FL on the largest data set. Note that copynumber detects and operates on chromosome arms, while DNAcopy operates on whole chromosomes. This partly explains the difference in performance between PCF and CBS for the MicMa data set; for the Illumina data this has little impact due to the iterative approach used in PCF for the longest sequences. PCF was also markedly faster in evaluations based on simulated data; however, comparisons are complicated by the fact that the speed of CBS depends on the data in a nontrivial manner. As seen from the IQRs listed in parentheses in Table 2, the speed of CBS is quite variable from sample to sample while PCF and FL is nearly constant. Moreover, the table shows that CBS runs 23 times slower when outliers have been removed using Winsorization, underlining that the performance of CBS is highly data dependent. We underline that the abovementioned results only relate to the current R implementations. As mentioned in the introduction, PCF is conceptually similar to the CGHseg method described by Picard et al.[6], and we also examined the computational performance of this method using the implementation in the R package cghseg. Using the version of CGHseg that requires a prespecified number of segments for each chromosome, the algorithm is fast, although the speed depends on the number of segments. Using the full CGHseg algorithm that automatically determines the number of segments, the algorithm is very slow for highresolution data. Hence, making a fair comparison between PCF and CGHseg is difficult.
Segmentation accuracy
We further compare the accuracy of the segmentation solutions found by PCF and CBS. Figure 3c shows ROC curves using MLPA classifications as the truth, and then applying a range of aberration calling thresholds to PCF estimates, CBS estimates, a running median with window size 50 and raw copy number data (details in Additional file 2). Results for PCF and CBS are similar, both achieving high sensitivity and specificity. The running median also gives good results, illustrating that many probes are fairly easy to classify and that the gain obtained by using methods like CBS and PCF is mainly an improved classification close to borders between segments. We also repeated the simulation study in [8] where CBS was found to be the most sensitive method while also having the lowest false discovery rate. Again, we found that PCF and CBS had very similar performance (results not shown). A more detailed comparison of segmentation results shows that overall results are quite similar for single sample PCF and CBS, however for both methods the results depend on the choice of parameter values and the handling of extreme observations, see Additional file 3. In conclusion, PCF and CBS typically provide similar results and have equivalent accuracy when parameters are tuned appropriately.
Conclusions
Copy number segmentation based on least squares principles and combined with a suitable penalization scheme is appealing, since the solution will be optimal in a least squares sense for a given number of breakpoints. We have proposed a suite of platform independent algorithms based on this principle for independent as well as joint segmentation of copy number data. The algorithms perform similarly as other leading segmentation methods in terms of sensitivity and specificity. Furthermore, the proposed algorithms are easy to generalize and are computationally very efficient also on highresolution data. The Bioconductor package copynumber offers a userfriendly interface to the proposed algorithms.
Several extensions and modifications of the proposed leastsquares framework are possible. In principle, the L2based distance measure used in the current implementation of PCF is easily extended to general Lpdistances. However the current implementation is highly optimized for L2, and other distance measures would require substantial heuristics to obtain comparable computational performance. Another extension is to introduce locus specific penalties for breakpoints, thus essentially introducing a prior on the location of breakpoints. Work in progress includes specialized routines to handle high throughput sequencing data more efficiently and joint analysis of multiple samples in allelespecific PCF.
Availability and requirements
Project name: Copynumber
Project home page: http://heim.ifi.uio.no/bioinf/Projects/Copynumber/
Operating system(s): All systems supporting the R environment
Programming language: R
Other requirements: No
License: GNU Artistic License 2.0.
Notes
Abbreviations
 aCGH:

Array Comparative Genomic Hybridization
 AIC:

Akaike’s Information Criterion. ASCAT: AlleleSpecific Copy number Analysis of Tumors
 BAC:

Bacterial Artificial Chromosome
 BAF:

BAllele Frequency
 BIC:

Schwarz’s Bayesian Information Criterion
 CBS:

Circular Binary Segmentation
 CNV:

Copy Number Variation
 DTC:

Disseminated Tumor Cells
 FL:

Fused Lasso
 HTS:

HighThroughput Sequencing
 IQR:

Interquartile Range
 MAD:

Median Absolute Deviation
 MLPA:

Multiplex Ligationdependent Probe Amplification
 PCF:

Piecewise Constant Fitting (the method used for segmentation in this paper)
 ROC:

Receiver Operating Characteristic curve
 SNP:

Singlenucleotide Polymorphism.
Declarations
Acknowledgements
GN, KL and OCL received funding from the Centre of Cancer Biomedicine (CCB) at the University of Oslo for equipment and travelling. PVL is a postdoctoral researcher of the Research Foundation  Flanders (FWO).
Authors’ Affiliations
References
 Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, McHenry KT, Pinchback RM, Ligon AH, Cho YJ, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, et al: The landscape of somatic copynumber alteration across human cancers. Nature. 2010, 18: 899905.View ArticleGoogle Scholar
 Russnes HG, Vollan HKM, Lingjærde OC, Krasnitz A, Lundin P, Naume B, Sørlie T, Borgen E, Rye IH, Langerød A, Chin SF, Teschendorff AE, Stephens PJ, Månér S, Schlichting E, Baumbusch LO, Kåresen R, Stratton MP, Wigler M, Caldas C, Zetterberg A, Hicks J, BørresenDale AB: Genomic architecture characterizes tumor progression paths and fate in breast cancer patients. Sci Transl Med. 2010, 2: 3847.View ArticleGoogle Scholar
 Hupe P, Stransky N, Thiery J, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004, 20: 34133422. 10.1093/bioinformatics/bth418.View ArticlePubMedGoogle Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased copy number data. Biostatistics. 2004, 5: 557572. 10.1093/biostatistics/kxh008.View ArticlePubMedGoogle Scholar
 Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005, 21: 37633770. 10.1093/bioinformatics/bti611.PubMed CentralView ArticlePubMedGoogle Scholar
 Picard F, Robin S, Lavielle M, Vaisse C, Daudin J: A statistical approach for array CGH data analysis. BMC Bioinf. 2005, 6: 2710.1186/14712105627.View ArticleGoogle Scholar
 Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data. Biostatistics. 2005, 6: 4558. 10.1093/biostatistics/kxh017.View ArticlePubMedGoogle Scholar
 Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 40844091. 10.1093/bioinformatics/bti677.View ArticlePubMedGoogle Scholar
 Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, Carter NP, Tavaré S, Hurles ME: Breaking the waves: improved detection of copy number variation from microarraybased comparative genomic hybridization. Genome Biol. 2007, 8: R22810.1186/gb2007810r228.PubMed CentralView ArticlePubMedGoogle Scholar
 Yu T, Ye H, Sun W, Li K, Chen Z, Jacobs S, Bailey D, Wong D, Zhou X: A forwardbackward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using highdensity single nucleotide polymorphism (SNP) array. BMC Bioinf. 2007, 8: 14510.1186/147121058145.View ArticleGoogle Scholar
 BenYaacov E, Eldar YC: A fast and flexible method for the segmentation of aCGH data. Bioinformatics. 2008, 24 (16): i139i145. 10.1093/bioinformatics/btn272.View ArticlePubMedGoogle Scholar
 Tibshirani R, Wang P: Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics. 2008, 9: 1829.View ArticlePubMedGoogle Scholar
 Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I: Hidden Markov models for the assessment of chromosomal alterations using highthroughput SNP arrays. Ann Appl Stat. 2008, 2: 687713. 10.1214/07AOAS155.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang L, Abyzov A, Korbel JO, Snyder M, Gerstein M: MSB: A meanshiftbased approach for the analysis of structural variation in the genome. Genome Res. 2009, 19: 106117.PubMed CentralView ArticlePubMedGoogle Scholar
 Coe BP, Chari R, MacAulay C, Lam WL: FACADE: a fast and sensitive algorithm for the segmentation and calling of high resolution array CGH data. Nucleic Acids Res. 2010, 38 (15): e15710.1093/nar/gkq548.PubMed CentralView ArticlePubMedGoogle Scholar
 Chen C, Lee H, Ling Q, Chen H, Ko Y, Tsou T, Wang S, Wu L, Lee HC: An allstatistics, highspeed algorithm for the analysis of copy number variation in genomes. Nucleic Acids Res. 2011, 39 (13): e8910.1093/nar/gkr137.PubMed CentralView ArticlePubMedGoogle Scholar
 Scherer SW, Lee C, Birney E, Altshuler DM, Eichler EE, Carter NP, Hurles ME, Feuk L: Challenges and standards in integrating surveys and structural variation. Nat Genet. 2007, 39: S7S15. 10.1038/ng2093.PubMed CentralView ArticlePubMedGoogle Scholar
 Rueda OM, DiazUriarte R: Finding recurrent copy number alteration regions: A review of methods. Curr Bioinf. 2010, 5: 117. 10.2174/157489310790596402.View ArticleGoogle Scholar
 Shah SP, Lam WL, Ng RT, Murphy KP: Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics. 2007, 23: i450i458. 10.1093/bioinformatics/btm221.View ArticlePubMedGoogle Scholar
 Zhang NR, Senbabaoglu Y, Li JZ: Joint estimation of DNA copy number from multiple platforms. Bioinformatics. 2010, 26: 153160. 10.1093/bioinformatics/btp653.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang NR, Siegmund DO, Ji H, Li JZ: Detecting simultaneous changepoints in multiple sequences. Biometrica. 2010, 97: 631645. 10.1093/biomet/asq025.View ArticleGoogle Scholar
 Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S: Joint segmentation, calling and normalization of multiple CGH profiles. Biostatistics. 2011, 12: 413428. 10.1093/biostatistics/kxq076.View ArticlePubMedGoogle Scholar
 Teo SM, Pawitan Y, Kumar V, Thalamuthu A, Seielstad M, Chia KS, Salim A: Multiplatform segmentation for joint detection of copy number variants. Bioinformatics. 2011, 27: 15551561. 10.1093/bioinformatics/btr162.View ArticlePubMedGoogle Scholar
 Winkler G, Liebscher V: Smoothers for discontinuous signals. J Nonparametr Stat. 2002, 14: 203222. 10.1080/10485250211388.View ArticleGoogle Scholar
 Venables WN, Ripley BD: Modern Applied Statistics with SPlus. 1994, New York: SpringerVerlagView ArticleGoogle Scholar
 Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, Perou CM, BørresenDale AL, Kristensen VN: Allelespecific copy number analysis of tumors. Proc Natl Acad Sci US. 2010, 107: 1691016915. 10.1073/pnas.1009843107.View ArticleGoogle Scholar
 Zhang NR, Siegmund DO: A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007, 63: 2232. 10.1111/j.15410420.2006.00662.x.View ArticlePubMedGoogle Scholar
 Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, Martinez P, Matthews N, Stewart A, Tarpey P, Varela I, Phillimore B, Begum S, McDonald NQ, Butler A, Jones D, Raine K, Latimer C, Santos C, Nohadani M, Eklund AC, SpencerDene B, Clark G, Pickering L, Stamp G, Gore M, Szallasi Z, Downward J, Futreal PA, Swanton C: Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med. 2012, 366: 883892. 10.1056/NEJMoa1113205.View ArticlePubMedGoogle Scholar
 Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JH, Huang JC, Alexander S, Du J, Kau T, Thomas RK, Shah K, Soto H, Perner S, Prensner J, Debiasi RM, Demichelis F, Hatton C, Rubin MA, Garraway LA, Nelson SF, Liau L, Mischel P, Cloughesy T, Meyerson M, Golub T, Lander ES, Mellinghoff IK, Sellers WR: Genomic alterations reveal potential for higher grade transformation in follicular lymphoma and confirm parallel evolution of tumor cell clones. Proc Natl Acad Sci USA. 2007, 104: 2000720012. 10.1073/pnas.0710052104.PubMed CentralView ArticlePubMedGoogle Scholar
 Klijn C, Holstege H, de Ridder J, Liu X, Reinders M, Jonkers J, Wessels L: Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. Nucleic Acids Res. 2008, 36: e13PubMed CentralView ArticlePubMedGoogle Scholar
 Aguirre AJ, Brennan C, Bailey G: Highresolution characterization of the pancreatic adenocarcinoma genome. Proc Natl Acad Sci US. 2004, 101: 90679072. 10.1073/pnas.0402932101.View ArticleGoogle Scholar
 Mathiesen RR, Fjelldal R, Liestøl K, Due EU, Geigl JB, Riethdorf S, Borgen E, Rye IH, Schneider IJ, Obenauf AC, Mauermann O, Nilsen G, Lingjærde OC, BørresenDale AL, Pantel K, Speicher MR, Naume B, Baumbusch LO: High resolution analysis of copy number changes in disseminated tumor cells of patients with breast cancer. Int J Cancer. 2011, 131 (4): E405E415. 10.1002/ijc.26444.View ArticlePubMedGoogle Scholar
 Wiedswang G, Borgen E, Kaaresen R, Kvalheim G, Nesland JM, Qvist H, Schlichting E, Sauer T, Janbu J, Harbitz T, Naume B: Detection of isolated tumor cells in bone marrow is an independent prognostic factor in breast cancer. J Clin Oncol. 2003, 21: 34693478. 10.1200/JCO.2003.02.009.View ArticlePubMedGoogle Scholar
 Eide MB, Liestøl K, Lingjærde OC, Hystad ME, Kresse SH, MezaZepeda L, Myklebost O, Trøen G, Aamot HV, Holte H, Smeland EB, Delabie J: Genomic alterations reveal potential for higher grade transformation in follicular lymphoma and confirm parallel evolution of tumor cell clones. Blood. 2010, 116: 14891497. 10.1182/blood201003272278.View ArticlePubMedGoogle Scholar
 Tukey JW: A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Edited by: Olkin I, Ghurye SG, Hoeffding W, Madow WG, Mann HB. 1960, Stanford University Press: Stanford, 448485.Google Scholar
 Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657663. 10.1093/bioinformatics/btl646.View ArticlePubMedGoogle Scholar
 Lai W, Choudhary V, Park PJ: CGHweb: a tool for comparing DNA copy number segmentations from multiple algorithms. Bioinformatics. 2008, 24 (7): 10141015. 10.1093/bioinformatics/btn067.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.