 ORIGINAL PAPER
 Open Access
 Published:
CNARA: reliability assessment for genomic copy number profiles
BMC Genomics volume 17, Article number: 799 (2016)
Abstract
Background
DNA copy number profiles from microarray and sequencing experiments sometimes contain wave artefacts which may be introduced during sample preparation and cannot be removed completely by existing preprocessing methods. Besides, large derivative log ratio spread (DLRS) of the probes correlating with poor DNA quality is sometimes observed in genome screening experiments and may lead to unreliable copy number profiles. Depending on the extent of these artefacts and the resulting misidentification of copy number alterations/variations (CNA/CNV), it may be desirable to exclude such samples from analyses or to adapt the downstream data analysis strategy accordingly.
Results
Here, we propose a method to distinguish reliable genomic copy number profiles from those containing heavy wave artefacts and/or large DLRS. We define four features that adequately summarize the copy number profiles for reliability assessment, and train a classifier on a dataset of 1522 copy number profiles from various microarray platforms. The method can be applied to predict the reliability of copy number profiles irrespective of the underlying microarray platform and may be adapted for those sequencing platforms from which copy number estimates could be computed as a piecewise constant signal. Further details can be found at https://github.com/baudisgroup/CNARA.
Conclusions
We have developed a method for the assessment of genomic copy number profiling data, and suggest to apply the method in addition to and after other stateoftheart noise correction and quality control procedures. CNARA could be instrumental in improving the assessment of data used for genomic data mining experiments and support the reliable functional attribution of copy number aberrations especially in cancer research.
Background
Since the introduction of molecularcytogenetic technologies for whole genome copy number aberration screening [1, 2], considerable advances have been made to work with a variety of suboptimal material (e.g. micro dissected samples, aspiration biopsies, paraffin embedded tissue), both in the areas of DNA preparation, labeling and platform technologies as well as in bioinformatic processing of the experimental readout. However, DNA copy number profiles from current microarray and sequencing experiments sometimes suffer from the presence of systematical “wave patterns” [3] throughout the whole genome, where within each genomic segment the estimated copy number deviates from the true value which is supposed to be a constant. These wave artefacts disrupt the piecewise constant signal of the copy number data and may lead to false positives or negatives in identifying CNAs.
One of the known causes to the wave artefacts is differential DNA retrieval across chromosomal regions, which may be due to GCcontent bias [4], DNA replication timing [5], differences in chromatin organization during DNA isolation [6] and damages to the DNA by fixation procedures [7]. Copy number profiles with heavy wave artefacts sometimes can be corrected if certain requirements are met. Marioni et al. developed a method to remove wave artefacts in copy number profiles for normal samples without obvious CNAs [8]. Wiel et al. suggested to eliminate waves in tumor profiles with many CNAs using calibration profiles [3]. Some methods correcting GCcontent bias have also been implemented for microarray [4] and sequencing experiments [9]. However, the methods proposed in these studies have a limited ability to remove wave artefacts or put many restrictions on the type and variability of the input data itself. In addition to wave artefacts, large derivative log ratio spread (DLRS) [10] correlating with poor DNA quality also leads to unreliable copy number profiles.
The limited ability of existing experimental and bioinformatic methods to remove wave artefacts or to correct for source dependent DNA quality variations motivated us to devise a method for assessing if genomic copy number can be reliably estimated from a preprocessed, technology agnostic copy number profiling dataset. Rather than developing a method for improving the sample derived copy number profiles themselves, our primary intention here is to provide measures for the contamination of copy number profiles through artefacts and thereby to support decisions regarding the suitability of these copy number profiles for downstream data analysis and interpretation.
Zhang and Zhang previously designed such a measure in an explorative study [7] where they proposed using autocorrelation scanning profile (ASP) to evaluate data quality, and demonstrated on simulated data that the median of ASP (medASP) can be used as a discriminative metric. However, it will be shown in the “Discussion” section that medASP is not an adequate measure for realworld data of different scenarios.
In this paper, we assess the reliability of copy number profiles using a machine learning approach. From our experience and by experiment, we selected four features which are able to adequately represent the copy number profiles for reliability assessment, i.e. the number of steps that can be detected by the stepfitting algorithm in the copy number profile, a quantitative value indicating how much the copy number data is steplike, the number of segments induced by both CNAs and wave artefacts detected by circular binary segmentation [11, 12], and the mean of DLRS within segments. We will explain in detail about these features in the “Results and discussion” section and discribe the stepfitting algorithm by which the first two features are generated in the “Methods” section. Based on these features, we trained a classifier to predict the reliability of the copy number profile. We will describe the way of classifying reliable and unreliable profiles, and subsequently assigning them into one of the five subcategories each having a biological or experimental correspondence. Our reliability assessment method can also be adapted to assess copy number profiles generated by those sequencing platforms from which copy number estimates could be computed as a piecewise constant signal.
Results
Piecewise constant model and segmentation
Tumor samples often contain CNAs, in which chromosomal segments are found gained or lost in copy number, deviating from the normal diploid status. Genomewide copy number can be depicted as a piecewise constant signal, where the changepoints are the boundaries of the chromosomal segments that differ in copy number, and the constant value between a pair of changepoints is the copy number of the corresponding segment. Copy number profiling by microarray and sequencing techniques gives noisy estimates of the true copy number at specific genomic positions, which can be modeled as follows:
Assume a series of n log ratio copy number estimates x={x _{ i }:i=1,2,⋯,n} ordered by genomic position. The piecewise constant model for the series is
where μ={μ _{ i }:i=1,2,⋯,n} is a piecewise constant function and ε={ε _{ i }:i=1,2,⋯,n} is a sequence of independent and identically distributed errors. Assuming a series of m+1 changepoints τ={τ _{ j }:j=0,1,2,⋯,m} where 1=τ _{0}<τ _{1}<⋯<τ _{ m }=n+1 delimit m segments with copy number level θ={θ _{ j }:j=1,2,⋯,m} such that
The errors are usually assumed to be Gaussian ε∼N(0,σ ^{2}) and supported by experimental data on selfself hybridizations [13], although this assumption is not crucial if the distances between successive τ _{ j }’s are large [14]. The DLRS is denoted by σ and estimated by the standard deviation of the error ε.
Segmentation is applied to recover the genomic position of the boundaries and the underlying copy number for chromosomal segments from the noisy copy number estimates. Under the piecewise constant model, the segmentation problem is to find the changepoints τ delimiting the segments and the copy number levels θ for each segment. If τ is known, θ _{ j } can be estimated by the mean of the copy number estimates that fall in the jth segment, that is
Many segmentation algorithms have been proposed (see [15] and [16] for excellent review), among which the popular circular binary segmentation (CBS) algorithm [11, 12] was found to be one of the most accurate methods. Starting with the whole chromosome, the CBS algorithm detects changepoints delimiting a subsegment with a different copy number level in the middle of a larger segment, and does it recursively until no more changepoints can be found in any of the segments. For any interval [a,b) and 1≤a<b≤n, let the null hypothesis be that the log ratio copy number estimates x _{ a },x _{ a+1},⋯,x _{ b } are independent and identically distributed Gaussian and the alternative be that there is a subsegment with different mean and same variance. CBS calculates the maximal tstatistic T=max _{ a≤i<j≤b }T _{ ij },
where \(\bar {Y}_{ij} = (x_{i+1}+\cdots +x_{j})/(ji)\), \(\bar {Z}_{ij} = (x_{a}+\cdots +x_{i}+x_{j+1}+\cdots +x_{b})/(ba+ij+1)\), and \(s^{2}_{ij}\) is the corresponding mean squared error.
Step detection and steplikeness quantification by stepfitting
Ideally, DNA is uniformly retrieved from the chromosomes during sample preparation, that is, the amount of DNA retrieved is proportional to the true copy number of the chromosomal segments present in the cells. In this case, the log ratio copy number estimates contain only abrupt jumps which satisfy the piecewise constant assumption. However, in practice, due to GCcontent bias [4], DNA replication timing [5] and other biological phenomena such as differences in chromatin organization during DNA isolation [6] or damages to the DNA caused by formalin fixation [7], DNA is sometimes retrieved differentially across the genome, which adds artefacts in the form of waves to the otherwise piecewise constant signal. These wave artefacts adversely affect detecting changepoints which truly delimit the gained and lost chromosomal segments.
Existing segmentation algorithms such as CBS are poorly suited to discriminate the changepoints that are boundaries of the CNA segments from those introduced by wave artefacts. When the magnitude of the waves are less than that of the CNA signal, CBS leads to “hypersegmentation”, in which many changepoints caused by waves are detected in a single CNA segment. In the worst case where the true copy number signal is buried in the waves with comparable or even greater magnitude, changepoints detected by CBS largely depend on the wave artefacts and are no longer boundaries of the CNA segments.
Fortunately, in practice, CNAs induce abrupt jumps which are visualized as steps and wave artefacts only induce gradual changes. As a result, for copy number profiles containing CNAs, reliable samples are usually steplike, with few waves in each CNA segment, whereas unreliable ones contain a lot of waves or the overall shape of the signal does not resemble steps any more. Based on these observations we implemented a fast stepfitting algorithm of time complexity O(nlogn) adapted from the method originally proposed by Kerssemakers et al. (refer to Supplementary Methods 3 in [17]) to capture the step signal which are mostly CNAs regardless of the changepoints introduced by wave artefacts, and assess how much the copy number profile is steplike. See “Methods” section for a detailed explanation about the stepfitting algorithm and the adaptation.
Computer simulation: CBS versus the stepfitting algorithm
To demonstrate the differences between CBS and the stepfitting algorithm in detecting changepoints, 3 groups of copy number profiles each containing 200 samples of 10,000 dimensions were generated (Figs. 1 and 2). Group A are reliable copy number profiles containing many CNAs; Group B are unreliable copy number profiles in which the piecewise constant CNA signal is buried in waves; Group C are hypersegmented copy number profiles in which many changepoints caused by waves are present in a single copy number segment. See Additional file 1 for details on how simulations were performed.
Figure 1 shows a set of simulated copy number profiles having the same true CNA segments. A1 to A3 demonstrate the segmentation on the same reliable copy number profile, where the red lines in A1 are the predefined true CNA segments. Red lines in A2 and A3 are copy number segments recovered by CBS and stepfitting respectively, from which it can be concluded that both methods recovered the majority of copy number segments. While CBS performs better in terms of detecting small segments of low signal to noise ratio, stepfitting is capable of capturing the overall steplike signal. B1 to B3 show the same unreliable copy number profile with heavy wave artefacts generated by introducing large autocorrelation to the reliable copy number profile in group A, for which neither CBS nor stepfitting can locate the true CNA segments correctly. In B2, CBS found greater amount of changepoints most of which are noise, whereas in B3 stepfitting also fitted noise but only those with more abrupt changes and therefore the number of changepoints detected are less than CBS; Nevertheless, changepoints detected by both methods mainly depend on noise and are no longer boundaries of the true CNA segments. C1 to C3 show the same hypersegmented copy number profile created by passing the reliable copy number profile in group A through a median filter to add small waves at the same time preserving the overall steplike structure. While in C2 CBS detected too many changepoints within individual true CNA segments, stepfitting significantly outperformed by recovering majority of copy number segments well. Note here the possibility of merging multiple segments detected by CBS using additional methods is not considered, as our main goal is to find a good proxy for the extent of wave artefacts which is represented by the number of changepoints that CBS can detect under its standard setting.
The observation from Fig. 1 can be generalized to the set of 200 simulated copy number profiles as shown in Fig. 2. For reliable copy number profiles in group A, the number of segments recovered by CBS is a very good approximation to the true CNA segments (Spearman correlation coefficient ρ=0.89; Regressing the number of recovered segments on the number of true segments by robust linear regression results in slope w=0.93 and intercept b=0.28), while stepfitting recovered majority of CNA segments well except for those very short segments of low signal to noise ratio (ρ=0.41; w=0.67 and b=0.79). For copy number profiles in group B containing heavy wave artefacts, both methods fitted noise. CBS recovered more segments than the number of true segments (w=1.86 and b=−2.26) and although in our simulation there is still correlation (ρ=0.70) as the waves introduced in different samples are of comparable size, the correlation is not expected in real data where the waves are of various size coming from different sources. Furthermore, in group B, the number of segments found by stepfitting is largely less than CBS and independent of the number of true segments (ρ=0.04; w=0.13 and b=19.56). Last, for hypersegmented copy number profiles in group C, CBS recovered far more segments than the number of true segments (ρ=0.36; w=2.84 and b=222.80), whereas stepfitting found majority of CNA segments well (ρ=0.38, w=0.79 and b=2.21).
Reliability assessment metrics
In this section we further develop the reliability assessment method on 1522 previously published copy number profiles from different experimental batches and microarray platforms (See “Methods” section for detail).
Figure 3 shows five typical cases each corresponding to one of the reliability groups that can often be observed among genomic copy number profiles. Apart from those containing many CNAs which can be classified as hypersegmented, reliable and unreliable with heavy wave artefacts (Fig. 3 a to c), there exist copy number profiles that have no or very few CNAs (e.g. control samples from normal tissues; Fig. 3 d); but also copy number profiles with extraordinarily large DLRS suggesting poor DNA quality prohibiting the detection of any CNA (Fig. 3 e). The upper panel of each subgraph in Fig. 3 a to e shows copy number profile of a particular case segmented by CBS, and the lower panel shows the same copy number profile segmented by stepfitting in the optimal iteration when S _{ peak }, the maximum value of S generated by the stepfitting algorithm was attained. The corresponding S values throughout iterations are plotted in Fig. 3 f (See The stepfitting algorithm in “Methods” section for definition of S and S _{ peak }).
In Fig. 3 a to c, the performance of CBS versus stepfitting is consistent with that on simulated data in the previous section: In reliable copy number profiles (Fig. 3 b), segments recovered by CBS are comparable to those from stepfitting; in unreliable copy number profiles with heavy wave artefacts (Fig. 3 c) both methods fitted noise in which CBS found more changepoints than stepfitting; in hypersegmented copy number profiles (Fig. 3 a) CBS detected a great many changepoints within individual steps recovered by stepfitting, resulting in a higher number of CBS derived segments. An additional feature associated with reliability can be discovered by looking closer into the three cases in which both the reliable and the hypersegmented copy number profile have clear steplike structures, while in the unreliable case the boundaries of the CNA segments are much less defined. This property is reflected by the peak value of S: As shown in Fig. 3 f, the hypersegmented (Case 1) and the reliable copy number profile (Case 2) both attained relatively high S _{ peak } whereas S _{ peak } for the unreliable copy number profile (Case 3) is much lower.
In Fig. 3 d and e, no apparent CNA exists or can be detected. The former is unreliable due to large DLRS so that the CNAs, even if are present, cannot be detected properly, whereas the latter is a reliable control sample diploid across the genome. Nevertheless, in both cases stepfitting merely fitted noise and the corresponding S values for Case 4 and 5 in Fig. 3 f increase throughout iterations yet remain close to 1 and no apparent peak can be observed.
So far we have discussed 4 features related to the reliability of copy number profiles, i.e. S _{ peak }, an indicator of how much the copy number profile is steplike; l, the number of steps detected by stepfitting; v, the number of segments detected by CBS which could be induced by both CNAs and wave artefacts; and σ, which is the DLRS estimated by the standard deviation of the error ε in Eq. (1).
The full set of reliability assessment metrics is summarized in Table 1, where l and v combined represent the wave density.
CNARA: reliability assessment for copy number profiles
To turn the qualitative metrics into quantitative ones and therefore allow to predict the reliability of a given copy number profile, a support vector machine (SVM) classifier was trained on the 1522 previously published copy number profiles labeled as reliable or unreliable by experts (see “Methods” section).
The 4 features S _{ peak }, l, v and σ were extracted for each of the 1522 samples and were used with the SVM classifier. The svm function in the e1701 R package [18] which is the interface to the C++ implementation libsvm [19] was called, where the cost was set to 1 and the radical basis function (RBF) kernel e x p(−0.4u−v^{2}) was chosen in which the parameters were set by cross validation. Tenfold crossvalidation on the 1522 training samples resulted in a total prediction accuracy of 99.08 % with standard deviation 0.0083 (a 3D visualization can be found in Additional file 1: Figure S5). The extremely high accuracy suggests that the 4 features chosen can represent the copy number profile very well for reliability assessment. The resulted classifier can predict any new copy number profiles as reliable or unreliable in terms of wholegenome CNA evaluation. For fine tuning of the methodology to particular sample compositions or evaluation goals, we have included the option to use user provided training sets (see respective subsection below).
For a copy number profile predicted as reliable, the value of S _{ peak } tells if it has many CNAs or not (case 2 and 5 in Table 1). Usually control samples from normal tissues or samples without many CNAs have S _{ peak }≤t h r _{1}, while tumor samples containing CNAs that we are interested in have S _{ peak }>t h r _{1}, where t h r _{1} is a threshold with default value 1.5.
Copy number profiles predicted as unreliable can be further divided into 3 subcategories (case 1, 3 and 4 in Table 1), by carrying out the following procedure: Denote samples in the training set by \((S_{peak}^{(i)},\ l^{(i)},\ v^{(i)},\ \sigma ^{(i)})\) where i=1, 2, ⋯, 1522. Given any unreliable sample b=(S _{ peak }, l, v, σ), create two dummy samples d _{1}=(S _{ peak }, l, m i n(v ^{(i)}), σ) and d _{2}=(S _{ peak }, l, v, m i n(σ ^{(i)})) by substituting v or σ by the minimum of the training set correspondingly, and predict the reliability of d _{1} and d _{2}. If d _{1} is reliable, b contains wave artefacts and therefore can be subcategorized as either Case 1 or 3; if d _{2} is reliable, b has large DLRS and thus belongs to Case 4; otherwise b suffers from both wave artefacts and large DLRS and can be either Case 1 or 3. As hypersegmented samples in Case 1 may still be of interest for certain tasks such as looking for CNA regions but not counting the number or keeping track of the size of copy number gains or losses, a more stringent threshold t h r _{2} greater than t h r _{1} for S _{ peak } could be set, for example, to accept those hypersegmented samples having S _{ peak }>t h r _{2} with caution where the default value of t h r _{2} is 2.5.
Discussion
A comparison between CNARA and medASP
The performance of CNARA and medASP [7] was compared and shown in Fig. 4. The 1522 samples were randomly split into training and validation sets at the proportion of 50:50 % and the SVM classifier of CNARA was trained on the training set where the parameters for the cost and the RBF kernel were set the same as in the previous subsection. The probability of being predicted as unreliable for each sample in the validation set was computed by the libsvm implementation [19] and the median of ASP was computed for the same sample in the validation set. The receiver operating characteristic (ROC) curves for the two values were then generated against the true class labels for the validation set.
As shown in Fig. 4, the area under the ROC curve (AUC) for medASP is 0.7372, which suggests that the assumptions made in the medASP study [7], i.e. simulating copy number profiles with a gain and a loss region of fixed size, does not comform well with feature distribution in realworld data. By contrast, the AUC of CNARA is 0.9994 which is consistent with the extremely high prediction accuracy of the SVM classifier stated in the previous subsection. The value approaching 1 once again implies that the 4 features chosen summarize the copy number profile very well in terms of its reliability.
Custom training set
Apart from the training set consists of absolutely reliable and unreliable copy number profiles, CNARA is flexible to take in additional average quality samples labeled by experts to the training set to accomodate to their own need. For example, the boundaries of the CNA segments for formalinfixed paraffin embedded (FFPE) samples are known to be less well defined than freshfrozen samples in general [7, 20]. Therefore, when working with a large set of FFPE samples, one may wish to keep those slightly contaminated samples and reject others that are highly contaminated. To achieve this, people can create their own training set by including several slightly contaminated samples as reliable and highly contaminated as unreliable according to expert knowledge to the original training set (See Additional file 1 for more details on building custom training set).
Conclusions
During our efforts on the curation of human cancer copy number data from genomic screening experiments [21, 22], we have frequently encountered data sets challenging the stateoftheart quality assessment procedures. While several methods have been proposed to improve the readout of genomic copy number profiling both through improvements of experimental design [6] as well as application of advanced bioinformatic methods [3, 4, 8, 9], many data sets still contain artefacts that can not be removed sufficiently by these existing methods. As a result, when working with tens of thousands of genomic copy number profiles derived from a multitude of platforms and different preprocessing methods, a robust method capable of identifying the low quality data sets based on extractable features is needed.
Previously, some downstream quality control measures such as genotyping call rate [23] and derivative logratio spread (DLRS) [10] have been reported for microarrays to describe the noise induced during sample hybridization onto the arrays. However, existing evaluation methods are limited in controlling for inherent noise due to differential DNA retrieval across the genome during sample preparation, which is a severe problem in the generation of genomic copy number profiles, and therefore reliability assessment for copy number profiles remained an open problem. Zhang and Zhang previously did an explorative study [7] where they proposed medASP as a discriminative metric for evaluating the quality of copy number profiles and achieved good accuracy on simulated data. However, it has been shown in the previous section that medASP is not an adequate measure for real copy number data which contains different amount of copy number gains and losses of various sizes at different genomic positions.
In this paper, we proposed a method for assessing the reliability of DNA copy number profiles. We showed five typical cases each corresponding to one of the reliability groups that can often be observed in practice and discussed in detail about the 4 features which can represent the copy number profiles well in terms of reliability, namely the number of steps detected by stepfitting, an indicator of how much the copy number data is steplike, the number of segments induced by both CNAs and wave artefacts detected by CBS and the mean of DLRS within segments. To obtain the first two features we proposed a fast stepfitting algorithm of time complexity O(nlogn) which is scalable to highthroughput copy number data. Taking these 4 features as input, an SVM classifer was trained on 1522 samples labeled as reliable or unreliable according to expert knowledge. By tenfold crossvalidation on the whole dataset, the resulting classifier achieved a total accuracy of 99.08 % with standard deviation 0.0083. Predicted samples can be further subcategorized into the five reliablility classes, each having a biological or experimental correspondence. To the best of our knowledge, this is the first application developed for realworld data filling the gap of controlling the quality problem regarding differential DNA retrieval for copy number profiles, which is nonoverlapping with and complementary to the objectives of any stateoftheart quality control measures.
Our method can be applied to log ratio copy number data before or after normalization. Applied before normalization, it helps to judge if the log ratio copy number data needs normalization and noise correction; applied to the normalized data, it assesses the utility of the data in downstream analysis. Nonetheless, since it evaluates an aspect of the quality different from any existing quality control measures, we suggest our reliability assessment method to be the last step after any possible downstream quality control and bias correction methods, to decide if the normalized sample can be finally included in data mining experiments for biomedical knowledge generation. We have to emphasize that, while our method can deliver information about the quality of the signal derived from whole genome copy number screening experiments, by itself it does not address problems arising from the possible clonal heterogeneity, e.g. in biosamples derived from cancer tissues. Also, the impact of reliability assessment will depend on the intended downstream analyses; for instance, the reliability of wholegenome CNA profiles, as determined by our method, may be of less concern when using statistical CNA peak finding tools like GISTIC [24].
Thanks to the competing efforts on estimating genomic copy number from exome and wholegenome sequencing [25] especially recent development of CopywriteR [26] which is capable of extracting uniformly distributed copy number information from sequencing data, we are optimistic that in the near future piecewise constant signal for copy number estimates could be computed reliably with comparable acurracy to that of microarray platforms. Due to the universality of our method in dealing with samples from different platforms and the flexibility in taking new training samples, at that time our reliability assessment method can also be easily adapted to those copy number profiles generated by sequencing platforms.
Methods
The copy number profile dataset
1522 previously published copy number profiles (Additional file 2: Table S1) were used in our study. The samples were obtained from arrayMap [21, 22] with the original data retrieved from NCBI’s GEO repository [27], with preprocessing and noise correction performed through the standard arrayMap data processing pipeline [21]. Probelevel plots with added segmentation markers were visually inspected and selected by experts with respect to the empirically assessed copynumber calling reliability, and with the goal of a balanced representation of reliable and unreliable copy number profiles as well as a sufficient coverage of cases from different reliability groups. This selection resulted in 804 absolutely reliable copy number profiles (cf. cases 2 and 5 in Table 1) and 718 absolutely unreliable copy number profiles including hypersegmented ones (cf. cases 1, 3 and 4 in Table 1). See Additional file 1 for data preprocessing and Additional file 2: Table S1 for a complete list of the arrays being analyzed and their reliability labels. The data is available at arraymap.org [21, 22].
The stepfitting algorithm
To recover the set of changepoints τ={τ _{ j }:j=0,1,2,⋯,m} delimiting the CNA segments from the log ratio copy number estimates x={x _{ i }:i=1,2,⋯,n}, the algorithm updates τ iteratively, starting from τ ^{(0)}={1,n+1}, in each iteration k introduces an additional changepoint τ ^{(k)} to τ ^{(k−1)}, called the bestfit, which minimizes the cost function
by scanning through all possible locations i=2,⋯,n, i∉τ ^{(k−1)} and adding i to τ ^{(k−1)} as the temporary changepoint set \(\boldsymbol {\tau }_{temp}^{(k)}\) such that \(\boldsymbol {\tau }_{temp}^{(k)} = \text {sort}(\boldsymbol {\tau }^{(k1)} \cup i)\), in which the elements are ordered from the smallest to the largest; μ _{ i } in Eq. (5) is computed from Eqs. (2) and (3) where \(\{\tau _{j1}, \tau _{j}\}\subseteq \boldsymbol {\tau }_{temp}^{(k)}\), j=1,2,⋯,k+1. The location i that minimizes Eq. (5) is therefore τ ^{(k)}, and τ in the kth iteration is updated as τ ^{(k)}=sort(τ ^{(k−1)}∪τ ^{(k)}).
Next, in between each pair of τ _{ j−1} and τ _{ j } in τ ^{(k)} where j=1,2,⋯,k+1, find the changepoint c _{ j } which is the bestfit for x _{ i }, i∈[τ _{ j−1},τ _{ j }) such that it minimizes
where c _{ j }∈(τ _{ j−1},τ _{ j }) and
The set of changepoints c _{ j }’s plus the boundary denoted by c ^{(k)}={c _{ j }:j=1,2,⋯,k+1}∪{c _{0}=1, c _{ k+2}=n+1} is called the counterfit changepoints for iteration k. The cost function Q for the counterfit is defined as
where
The algorithm proceeds iteratively adding the bestfit changepoint τ ^{(k)} each time to the changepoint set τ ^{(k−1)} in the previous iteration and finding the set of counterfit changepoints c ^{(k)} correspondingly, until the number of iterations reaches a predefined threshold K. This results in a set of bestfit changepoints and a set of counterfit changepoints located in between one another.
The stepfitting algorithm stated above (refer to Supplementary Methods 3 in [17] for more information) has time complexity of ∼2n K (for definition of the tilde notation see [28]) where n is the dimension of the copy number estimates x and K is the total number of predefined iterations usually greater than 100. An important observation which helps to improve the computational efficiency is that, of all the counterfit changepoints in c ^{(k)} in the kth iteration, the most prominent changepoint c _{ j } if added to τ ^{(k)} decreasing the cost function H the most is always the bestfit changepoint τ ^{(k+1)} to be included in the next iteration, specifically,
and
where i∈[τ _{ j−1},τ _{ j }), {τ _{ j−1},τ _{ j }}⊆τ ^{(k)}, c _{ j }∈(τ _{ j−1},τ _{ j }) and j=1,2,⋯,k+1; μ _{ i } and ψ _{ i } are computed as in Eqs. (2), (3) and (7) respectively. Here we keep track of the set of d _{ j }’s for each corresponding c _{ j } as the difference set d ^{(k)}. Furthermore, after the bestfit changepoint τ ^{(k+1)} being included in τ ^{(k+1)}, to update the corresponding counterfit set c ^{(k+1)} and difference set d ^{(k+1)} we only need to exclude τ ^{(k+1)}=c _{ t } from c ^{(k)} and the corresponding d _{ t } from d ^{(k)} first, and then scan the region delimited by the two bestfit breakpoints directly next to τ ^{(k+1)} in τ ^{(k+1)} for two additional counterfit breakpoints and the corresponding two differences. To be specific, given τ _{ r }=τ ^{(k+1)}, {τ _{ r−1},τ _{ r },τ _{ r+1}}⊆τ ^{(k+1)}, find c _{ r }∈[τ _{ r−1},τ _{ r }) and c _{ r+1}∈[τ _{ r },τ _{ r+1}) as computed in Eqs. (6) and (7), and compute d _{ r } and d _{ r+1} as in Eq. (11); then update c ^{(k+1)}=(c ^{(k)}∖τ ^{(k+1)})∪c _{ r }∪c _{ r+1} and d ^{(k+1)}=(d ^{(k)}∖d _{ t })∪d _{ r }∪d _{ r+1}. In this way, the time complexity is reduced to O(nlogn).
To estimate the number of steps in the copy number profile and assess how much the data is steplike, the same model selection criterion is adopted as in Kerssemakers’ method, where the stepindicator S is introduced and defined as
For reliable copy number profiles or hypersegmented profiles containing many steps which are mostly CNAs, in early iterations changepoints in τ ^{(k)} and c ^{(k)} both locate significant steps such that Q is close to H and therefore S is close to 1. S increases until it peaks in the optimal iteration when changepoints in τ ^{(k)} cover all significant steps and all changepoints in c ^{(k)} merely fit noise so that Q and H differ the most. After that S decreases as changepoints in both τ ^{(k)} and c ^{(k)} begin to fit noise and Q and H become closer again. As a result, the number of iterations it takes for S to reach the peak, denoted by l, is an estimation of the number of steps in the copy number profile.
In Kerssemakers’ literature they also mentioned the peak value of S denoted by S _{ peak } approximates quadratic of signal to noise ratio given by \(S_{peak} \approx 1+\frac {\Delta ^{2}}{4\sigma ^{2}}\), where Δ is the mean of the absolute difference of μ _{ i }’s around each bestfit changepoint τ _{ j } weighted by \(\sqrt {\tau _{j+1}\tau _{j1}}\), where τ _{ j }∈τ ^{(l)}, j=1,2,⋯,l, and σ is the standard deviation of ε in Eq. (1). For unreliable copy number profiles of which the steps are drowned by wave artefacts of large magnitude, S _{ peak } is significantly smaller than that of reliable copy number profiles containing comparable amount of CNAs. For those profiles containing few significant CNAs or control samples which are diploid across genome, S is close to 1 throughout iterations and no apparent peak can be observed so that S _{ peak } is close to 1. Therefore S _{ peak } is an indicator of how much the copy number profile is steplike.
The adapted stepfitting algorithm is summarized in Algorithm 1.
Software
The CNARA software and a tutorial is available at https://github.com/baudisgroup/CNARA.
Abbreviations
 ASP:

Autocorrelation scanning profile
 AUC:

Area under the ROC curve
 CBS:

Circular binary segmentation
 CNA/CNV:

Copy number alterations/variations
 CNARA:

Reliability assessment for copy number profiles
 DLRS:

Derivative log ratio spread
 FFPE:

Formalinfixed paraffin embedded
 medASP:

Median of ASP
 RBF:

Radical basis function
 ROC:

Receiver operating characteristic
 SVM:

Support vector machine
References
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992; 258(5083):818–21.
du Manoir S, Speicher MR, Joos S, Schröck E, Popp S, Döhner H, Kovacs G, RobertNicoud M, Lichter P, Cremer T. Detection of complete and partial chromosome gains and losses by comparative genomic in situ hybridization. Hum Genet. 1993; 90(6):590–610.
van de Wiel MA, Brosens R, Eilers PH, Kumps C, Meijer GA, Menten B, Sistermans E, Speleman F, Timmerman ME, Ylstra B. Smoothing waves in array CGH tumor profiles. Bioinformatics. 2009; 25(9):1099–104.
Redon R, Carter NP. Comparative genomic hybridization: microarray design and data interpretation. DNA Microarrays for Biomedical Research: Methods and Protocols. 2009:37–49.
Koren A, Handsaker RE, Kamitaki N, Karlić R, Ghosh S, Polak P, Eggan K, McCarroll SA. Genetic variation in human DNA replication timing. Cell. 2014; 159(5):1015–26.
van Heesch S, Mokry M, Boskova V, Junker W, Mehon R, Toonen P, de Bruijn E, Shull JD, Aitman TJ, Cuppen E, et al. Systematic biases in DNA copy number originate from isolation procedures. Genome Biol. 2013; 14(4):33.
Zhang L, Zhang L. Use of autocorrelation scanning in DNA copy number analysis. Bioinformatics. 2013; 29(21):2678–82.
Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, et al. Breaking the waves: improved detection of copy number variation from microarraybased comparative genomic hybridization. Genome Biol. 2007; 8(10):228.
Risso D, Schwartz K, Sherlock G, Dudoit S. GCcontent normalization for RNASeq data. BMC Bioinformatics. 2011; 12(1):480.
Largo C, Saéz B, Alvarez S, Suela J, Ferreira B, Blesa D, Prosper F, Calasanz MJ, Cigudosa JC. Multiple myeloma primary cells show a highly rearranged unbalanced genome with amplifications and homozygous deletions irrespective of the presence of immunoglobulinrelated chromosome translocations. Haematologica. 2007; 92(6):795–802.
Olshen AB, Venkatraman E, Lucito R, Wigler M. Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics. 2004; 5(4):557–72.
Venkatraman E, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007; 23(6):657–63.
Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivariate Anal. 2004; 90(1):132–53.
Zhang NR. DNA Copy Number Profiling in Normal and Tumor Genomes. Frontiers in Computational and Systems Biology. London: Springer: 2010. p. 259–81.
Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005; 21(19):3763–70.
Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005; 21(22):4084–91.
Kerssemakers JW, Munteanu EL, Laan L, Noetzel TL, Janson ME, Dogterom M. Assembly dynamics of microtubules at molecular resolution. Nature. 2006; 442(7103):709–12.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2015. R package version 1.67. http://CRAN.Rproject.org/package=e1071.
Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2:27–12727. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Mc Sherry EA, Mc Goldrick A, Kay EW, Hopkins AM, Gallagher WM, Dervan PA. Formalinfixed paraffinembedded clinical tissues show spurious copy number changes in arrayCGH profiles. Clin Genet. 2007; 72(5):441–7.
Cai H, Kumar N, Baudis M. arrayMap: a reference resource for genomic copy number imbalances in human malignancies. PLoS One. 2012; 7(5):36944.
Cai H, Gupta S, Rath P, Ai N, Baudis M. arrayMap 2014: an updated cancer genome resource. Nucleic acids res. 2014; gku1123.
Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, et al. Quality control and quality assurance in genotypic data for genomewide association studies. Genet Epidemiol. 2010; 34(6):591–602.
Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee J, Huang J, Alexander S, Du J, Kau T, Thomas R, Shah K, Soto H, Perner S, Prensner J, Debiasi R, Demichelis F, Hatton C, Rubin M, Garraway L, Nelson S, Liau L, Mischel P, Cloughesy T, Meyerson M, Golub T, Lander E, Mellinghoff I, Sellers W. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A. 2007; 104(50):20007–12.
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using nextgeneration sequencing data: features and perspectives. BMC Bioinformatics. 2013; 14(Suppl 11):1.
Kuilman T, Velds A, Kemper K, Ranzani M, Bombardelli L, Hoogstraat M, Nevedomskaya E, Xu G, de Ruiter J, Lolkema MP, et al. CopywriteR: DNA copy number detection from offtarget sequence data. Genome Biol. 2015; 16(1):49.
Edgar R, Domrachev M, AE L. Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30(1):207–10.
Sedgewick R, Wayne K. Algorithms, 4th edition: AddisonWesley; 2011. ISBN 9780321573513, pp. IXII, 1955.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013; 41(D1):991–5.
Northcott PA, Nakahara Y, Wu X, Feuk L, Ellison DW, Croul S, Mack S, Kongkham PN, Peacock J, Dubuc A, et al. Multiple recurrent genetic events converge on control of histone lysine methylation in medulloblastoma. Nat Genet. 2009; 41(4):465–72.
Chen R, Nishimura MC, Bumbaca SM, Kharbanda S, Forrest WF, Kasman IM, Greve JM, Soriano RH, Gilmour LL, Rivers CS, et al. A hierarchy of selfrenewing tumorinitiating cell types in glioblastoma. Cancer Cell. 2010; 17(4):362–75.
Liu L, Greger J, Shi H, Liu Y, Greshock J, Annan R, Halsey W, Sathe GM, Martin AM, Gilmer TM. Novel mechanism of lapatinib resistance in HER2positive breast tumor cells: activation of AXL. Cancer Res. 2009; 69(17):6871–878.
Haverty PM, Fridlyand J, Li L, Getz G, Beroukhim R, Lohr S, Wu TD, Cavet G, Zhang Z, Chant J. Highresolution genomic and expression analyses of copy number alterations in breast tumors. Genes, Chromosomes Cancer. 2008; 47(6):530–42.
Acknowledgements
The authors want to thank Henrik Bengtsson, Reinhard Furrer and Mark Robinson for helpful discussions. This study was supported by a grant from the Swiss Federation through the Swiss Contribution to the enlarged European Union.
Availability of supporting data
The GEO accession numbers of the arrays used in the study are provided in the Additional files and can be used to access GEO datasets as well as the corresponding processed data through the arrayMap resource. The CNARA software and a tutorial is available at https://github.com/baudisgroup/CNARA.
Authors’ contributions
NA conceived the study and developed computational methods; HC collected and labeled the data; CS provided additional test samples; NA and MB wrote the paper; All authors contributed to the biological insight, read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
The human array datasets utilized in this study represent publicly available datasets that were derived from the Gene Expression Onmibus (GEO [27]) resource at the National Center for Biotechnology Information (NCBI), with the possibly required informed consent lying with the orginal data submitters. For additional samples used for evaluation purposes (data not shown), informed consent was acquired from the patients with approval of the study design through the ethics committee of the University of Medicine and Pharmacology Timisoara.
Author information
Authors and Affiliations
Corresponding authors
Additional files
Additional file 1
Supplementary Methods. This document describes the computer simulation procedure for the 3 groups of copy number profiles in Figs. 1 and 2, the preprocessing procedure for the 1522 copy number profile dataset and the supporting Figures S1S5, and the procedure of building custom training set. (PDF 933 kb)
Additional file 2
Supplementary Table S1. A complete list of the 1522 copy number profiles, including GEO accession number and reliability label. (TSV 58 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ai, N., Cai, H., Solovan, C. et al. CNARA: reliability assessment for genomic copy number profiles. BMC Genomics 17, 799 (2016). https://doi.org/10.1186/s1286401630747
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401630747
Keywords
 Copy number profile
 CNA
 Reliability assessment