Somatic copy number alterations (CNAs) are common genetic events in the development and progression of various human cancers, and significantly contribute to tumorigenesis [1, 2]. The coverage of CNAs in tumors varies from a few hundred to several million nucleotide bases, consisting of both deletions and amplifications with highly complex patterns [3, 4]. Recent advances in oligonucleotide-based single nucleotide polymorphism (SNP) arrays have made it possible to detect regional amplifications and deletions with high resolution on a genome-wide scale [5, 6]. A critical challenge in the genome-wide analysis of CNAs is to distinguish between the “driver” mutations that allow the tumor to initiate, grow, and persist, and the “passenger” mutations that represent random somatic events accumulated during tumorigenesis [1, 3, 7]. Identification of these “driver” alterations can provide important insights into the cellular defects that cause cancer and suggest potential diagnostic, prognostic, and targeted therapeutic strategies [1, 7, 8].
By studying a sufficiently large collection of cancer samples, Significant Copy Number Aberrations (SCAs), defined as significantly recurrent CNAs that affect the same region in multiple tumors, are widely considered as informative surrogates of “driver” mutations that may help pinpoint novel cancer-causing genes [3, 9]. Past studies have detected many SCAs in a wide range of cancer types, with an impressive coverage of many known oncogenes and cancer suppressor genes [1, 2, 7]. Several methods for finding regions of SCAs using CNAs data have been described in the literature, where the task of distinguishing between sporadic CNAs and SCAs is largely a statistical significance testing. Two reviews with qualitative comparison of different methods have been published [10, 11]. Despite the use of different algorithms, a common theme in these methods is that they often adopt a four-step strategy: (1) detect CNAs and separate deletions and amplifications; (2) design and calculate ensemble test statistics associated with a genomic locus; (3) construct and/or estimate the probability distribution of test statistics under the null hypothesis; (4) perform multiple testing on a pool of genomic loci.
Significance testing for aberrant copy number (STAC) starts by converting the normalized log-ratios into a binary matrix, with zeros indicating no change and ones indicting losses and gains . STAC then proposes two statistics (footprint and frequency) to define regions of SCAs while adjusting for multiple comparisons, where the null hypothesis is that the detected CNAs from single-sample analysis are the realizations of random CNA placements whose probability distribution is generated by permutations on CNA segments . Genomic Identification of Significant Targets in Cancer (GISTIC) works on the real-valued step function of log-ratios that allows GISTIC to exploit both the type (amplification/deletion) and amplitude of CNAs [1, 3]. Using a semi-parametric permutation assuming independence between probes, GISTIC calculates a score that is based on both the amplitude and frequency of CNAs at each probe position and subsequently identify regions of SCAs, where amplification and deletion CNAs are handled separately, and armed-level and focal CNAs are further analyzed independently . Aimed to correlate information from neighboring probes with the amplitude and frequency of CNAs at each probe position, Kernel Convolution – a Statistical Method for Aberrant Regions detection (KC-SMART) uses varying-width kernel functions to calculate the testing statistics from the original log-ratios across multiple samples, producing the kernel smoothed estimate (KSE) at each locus by locally weighted regression . SCAs are selected based on a permutation-generated null distribution and Bonferroni correction. To substantially reduce computational burden in analyzing high-resolution and large-population data, correlation matrix diagonal segmentation (CMDS) identifies SCAs based on a between-chromosomal-site correlation analysis directly using the raw intensity ratios across all samples . CMDS uses a correlation statistics to detect SCAs with a standard normal null distribution whose parameters are estimated directly from the data and adjusts for multiple comparisons by false discovery rate.
Existing methods have several limitations. When working with unprocessed raw intensity ratios [13, 15, 16], most methods are oblivious to noise clutter that can significantly confound estimation of the null distribution about true yet sporadic CNAs [9, 17]. Furthermore, these methods cannot distinguish between contributions of amplifications and deletions to the calculated overall test statistics that may affect the power to detect SCAs. While some effort has been made to incorporate correlation among neighboring probes into the test statistics, most methods assign a score to, and test the significance at, each individual probe locus [14, 15]. In addition, while it is widely accepted that CNAs signals at adjacent probes are highly correlated [9, 13–15], the assumption of probe independence is often adopted in constructing and learning the null distribution, probably for mathematical convenience [3, 16]. Moreover, existing permutation experiments using multiple samples cannot distinguish between the contributions of sporadic CNAs (obeying null distribution) and actual SCAs (deviating from null distribution) to the estimation of null distributions, resulting in theoretically conservative estimations especially when the number of true SCAs participating in the permutation is large.
We now report Significant Aberration in Cancer (SAIC), a carefully motivated method for accurately identifying SCAs using CNAs data from multiple samples. To distinguish between different biological roles of CNAs types and between noise and sporadic CNAs, we use discretized CNAs data and separately analyze copy number amplifications and deletions. By exploiting the intrinsic correlation among consecutive probes, we calculate and assign a score (test statistics) to each CNA unit instead of each single probe, based on both the amplitude and frequency of CNAs within the unit. To accurately estimate the null distribution governing sporadic CNAs, we perform random positional permutations on CNA units that preserve correlations inherent to the copy number data. More importantly, to minimize the unwanted participation of true SCAs in determining the null distribution [3, 14], we iteratively detect SCAs and estimate an unbiased null distribution by an SCA-exclusive permutation scheme.
We tested SAIC on extensive simulation data sets, observing significantly improved performance with larger areas under the Receiver Operating Characteristics (ROC) curves and higher sensitivities at acceptable low false discovery rates, as compared to four popular peer methods (GISTIC, STAC, KC-SMART, and CMDS). We then applied SAIC to four real benchmark data sets, successfully identified the majority (84%) of previously reported SCAs harboring regions associated with well-known tumor-causing genes, and more importantly, detected some novel SCAs partially validated by the presence of known cancer-related genes.