seqCNA: an R package for DNA copy number analysis in cancer using high-throughput sequencing
BMC Genomics volume 15, Article number: 178 (2014)
Deviations in the amount of genomic content that arise during tumorigenesis, called copy number alterations, are structural rearrangements that can critically affect gene expression patterns. Additionally, copy number alteration profiles allow insight into cancer discrimination, progression and complexity. On data obtained from high-throughput sequencing, improving quality through GC bias correction and keeping false positives to a minimum help build reliable copy number alteration profiles.
We introduce seqCNA, a parallelized R package for an integral copy number analysis of high-throughput sequencing cancer data. The package includes novel methodology on (i) filtering, reducing false positives, and (ii) GC content correction, improving copy number profile quality, especially under great read coverage and high correlation between GC content and copy number. Adequate analysis steps are automatically chosen based on availability of paired-end mapping, matched normal samples and genome annotation.
seqCNA, available through Bioconductor, provides accurate copy number predictions in tumoural data, thanks to the extensive filtering and better GC bias correction, while providing an integrated and parallelized workflow.
Genomic structural rearrangements are a hallmark of cancer. Among them, deviations in the amount of genomic content that arise during tumorigenesis, called copy number alterations (CNAs), can critically affect gene expression patterns [1, 2]. A priori, expression could be expected to correlate with gene dosage. However, although a certain global correlation exists, individually, it has not been observed to be linear and, in some cases, it can even be inverse . This is due to a range of dosage regulation mechanisms  that confer the cell with robustness to the presence of CNAs. Still, greater expression variability arises from such regulation  and, if large amounts of DNA are affected by CNAs, cell control cannot be kept . Furthermore, post-transcriptional and post-translational modifications, folding stability and gene-protein and protein-protein interactions greatly mask the effect of copy number changes, but correlation between gene dosage and protein expression has been found to be greater in the case of oncogenes . Indeed, the importance of CNAs in cancer is demonstrated by the existence of CNA patterns that allow to differentiate between cancer types [2, 8] and to analyze cancer progression and complexity .
High-throughput technologies, including array comparative genomic hybridization (aCGH), single nucleotide polymorphism (SNP) arrays and high-throughput sequencing (HTS) follow a similar computational analysis workflow for the detection of CNAs: preprocessing of raw data, copy number profile segmentation and CNA calling, based on the average values of the resulting segments. The first step, preprocessing, is vital for improved CNA detection and is often underrated . For HTS data, this step starts with read summarization, which involves counting the number of reads that fall within genomic windows, typically non-overlapping and fix-sized. The result is a window read count (RC) profile, which is a proxy to the true copy number profile. Some reads, such as those with low mapping quality , can be filtered during summarization, while whole windows can be filtered afterwards. Preprocessing may continue with normalization, which corrects for technical or biological factors that confound the true copy number profile, mainly the GC content . Normalization against a matched normal sample allows a better correction of confounding factors  and reduces the need for filters, but it is not always available , hence the relevance of optimal filtering and GC content bias correction.
Here, we present a user-friendly and highly-parallelized R package, called seqCNA, which allows an integrated copy number analysis workflow. The package includes novel methodology on (i) window filtering, reducing false positives in comparison to assessed existing methods, and (ii) GC content correction, improving profile quality, especially under great read coverage and high correlation between GC content and copy number.
seqCNA is available as an R package through the Bioconductor project . It depends on the GLAD , adehabitatLT , doSNOW  and seqCNA.annot R packages, which are automatically downloaded from the Bioconductor and CRAN  repositories as needed. The seqCNA.annot companion package contains annotation on GC content, mappability and presence of common CNVs for the included genome builds, enabling several optional steps of the analysis.
An integrated read summarization function, seqsumm, written in C++ and interfacing with the R code through Rcpp, makes seqCNA the only tool required to obtain copy number profiles from SAM alignment files (SAMtools  is necessary to read BAM files). The subsequent functions in the package return visual feedback throughout the analysis and require little parameterization. A vignette with a worked example and detailed help on functions and parameters are included within the package.
Results and discussion
The seqsumm function summarizes read counts into windows of the selected size, but it also classifies paired-end mapping (PEM) reads based on their SAM flags - which consider read pairing, separation and orientation - and calculates mean window mapping quality, enabling two of the five window filters available in seqCNA. The first filter involves PEM read classification, which has previously been used to select reads prior to summarization  and to detect the limits of structural genomic rearrangements, including CNVs . Improper reads are considered those that are not in read pairs with correct separation and orientation. We saw that genomic windows with an elevated proportion of improper reads tend to be outliers in the RC profile, probably indicating the presence of intra-window structural polymorphisms. If these windows are not of interest, they can be filtered by setting a maximum proportion of improper reads within each window. Second, directly filtering low mapping quality reads reduces the signal-to-noise ratio (SNR) of the RC profile  so, instead, seqCNA provides a filter that discards windows, based on the mean mapping quality of proper reads. A third filter, the trimming filter, removes windows with extreme RC values, with the distinctive feature that a prior correction against GC content is performed, avoiding trimming extreme RCs that are only due to extreme GC content. The remaining two filters discard windows with the presence of common CNVs described by Altshuler et al.  and low mappability, where the mappability of a window reflects the uniqueness of 35-nucleotide long sequences within it .
A matched normal sample is preferable (but not necessary) for the PEM-based, mapping quality and trimming filters, in order to prevent biases that arise due to the presence of CNAs. For instance, some extreme RCs on an unpaired tumoural profile can be due to CNAs and should be kept, so the process of trimming should be able to spot them to avoid their filtering. For the matter, the trimming filter in seqCNA uses an algorithm based on the Wald-Wolfowitz runs test to tell apart CNAs from outliers on unpaired tumoural profiles. It measures the randomness of the position of those windows with RC above a certain threshold, where the higher the threshold, the greater the randomness due to outliers. The threshold is set where a sudden change in randomness occurs due to the inclusion of adjacent windows, not likely to be outliers (see Additional file 1, Automatic trimming Section, for more details). The five filters are independent and are applicable based on availability of PEM reads, matched normal sample and genome build annotation. While each filter targets windows with a specific behavior, many windows are captured by more than one filter, increasing the filtering robustness (see Figure 1).
The next step of the analysis is normalization, which accounts for confounding factors in the RC profile. If a matched normal sample is not available, normalization involves GC content bias correction, which removes a great part of the observed bias. The described relationship between GC content and RC  is generally non-linear, so, until recently, GC content correction has been tackled through local regression (LOESS) or polynomial fitting [10, 11, 26]. Such approach is adequate for genomes where (i) gains and losses account for a small fraction of the genome or they are rather balanced and (ii) GC content distribution is similar among regions with different copy numbers, but this is not typically the case of cancer genomes, where a linear relationship between copy number and GC content produces a bias in the regression. The GC correction algorithm in FREEC  tries to improve mere regression by looking for the GC content curve of the main copy number, sampling window densities at specific GC content levels. Thus, FREEC is able to handle the possible correlation between copy number and GC content, as long as the correlation does not affect the convergence of the algorithm towards the main copy number. Furthermore, such algorithm works under the assumption of known sample ploidy and proportionality between RC and copy number, which may be shifted by the presence of subclones in the cell population.
The approach we propose for GC bias correction, called seqnorm, accounts for the correlation between copy number and GC content independently of sample characteristics. textitseqnorm is a two-iteration algorithm, with a first pass regression that removes much of the GC bias and a second step that accounts for the correlation between GC content and copy number before a second pass regression. While the first regression is sensitive to the correlation and can, therefore, under- or over-correct the GC curve, the GC bias generally decreases. Afterwards, GLAD  produces a segmented profile in a way that segments represent the maximal neighbourhoods in which the local constant assumption of the statistical model holds. Such property is interesting because low intra-segment variability is key to the second regression, which is applied segment-wise. Namely, segments with the highest RC variability, as well as those spanning few windows, may not provide robust enough fits. In turn, those with little GC content variability do not allow estimating the effect of extreme GC content. Therefore, segments undergo a selection process (see Additional file 1, seqnorm Section, for more details). Centering the selected segments removes the read count differences due to copy number changes, essentially removing the undesired correlation. Thus, the subsequent segment-wise regressions provide good approximations to the genome-wide effect of GC content on read counts without the bias that emerges from the correlation and their median gives a robust estimate of the true effect (see Figure 2). Although devised to improve GC content normalization, seqnorm can also be used to normalize against matched paired normal.
Existing tools for CNA detection on HTS data perform a basic preprocessing and put their focus on the copy number calling step, taking advantage of allele-specific information [14, 27, 28] and automatically predicting copy number profiles [10, 14, 27]. On the other hand, seqCNA focuses on preprocessing and provides additional simple methods to obtain the final copy number profiles. Hence, the analysis is completed by segmenting the seqnorm-corrected profile with GLAD and, through visual assessment, defining copy number limits (Figure 3).
For the methodology assessment, we used data from human samples according to the Declaration of Helsinki, the European Guidelines on Good Clinical Practice, relevant national and regional authority requirements and Hospital Clinico San Carlos’s Clinical Investigation Ethics Committee (Madrid, Spain). Informed consent was obtained from every subject.
In a recent comparison , we found GAP  to be the best performing CNA-detecting method on SNP-array data. Knowing that, we compared the results from seqCNA, GAP and state-of-the-art CNA-detecting methods on HTS data, namely FREEC , CNAnorm  and Patchwork , over two colon cancer samples we hybridized on SNP-arrays and paired-end sequenced (raw data is deposited at the European Genome-phenome Archive (EGA) under accession number EGAS00001000558). seqCNA’s copy number profiles were the closest ones to GAP’s (Additional file 1: Table S3) and presented the lowest departure from consensus profiles: 0.04% in both cases (Additional file 1: Table S2). The novel PEM-based and mapping quality filters are the main reason behind the reduced false positive rate in comparison to FREEC and CNAnorm (Additional file 1: Figures S1 and S3).
In order to assess seqnorm’s performance, we built a semi-simulated dataset from the combination of the copy number profiles of 676 cell-lines , determined using PICNIC , and the RC variability of 7 non-tumoural samples (see Additional file 1, Simulated dataset Section, for more details). On the semi-simulated dataset, real cell-line  and prostate cancer samples  the median signal-to-noise ratio (SNR) improvement with respect to typical regression ranged from 1% to 4% (Additional file 1: Figure S9), with a maximum improvement between 27% and 77%, where 2% already yields a visibly more defined density map (see Figure 2). We investigated the factors that affect this improvement and saw that it directly depends on: (i) the SNR of the RC profile and (ii) the correlation between GC content and copy number, which tends to be greater when the top main copy number (spanning at least 5% of the genome) is 4 or 5 and there are between 3 and 4 distinct main copy numbers (Additional file 1: Figure S11).
As of the initial release, the annotation package covers the human genome with builds hg18 and hg19. In the future, we aim at extending the package to include further genomes and builds based on users’ necessities.
We are also pondering possible extensions to the main package, including relevant region annotation and support for side-by-side displaying of copy number profiles in existing databases. Increased automation is also plausible, especially in the final calling step, but we reckon that improved methodology needs to be developed in this regard in order to replace human judgment. In general, while GC bias correction is a mature issue, we expect the filtering and calling steps to see further developments. Specifically, additional intelligent filter threshold selection and multifactorial filtering are issues that remain open.
We have presented seqCNA, a tool that allows integral analyses for the detection of CNAs in HTS tumoural data and provides relevant advancements in the preprocessing steps. Namely, it incorporates a novel normalization method, seqnorm, which significantly improves the performance of typical regression, especially on samples with high SNR (e.g. due to greater coverage) and under high correlation between GC content and copy number. The tool also incorporates novelties for the filtering of windows in RC profiles - thus reducing the amount of false positives, including a PEM-based filter, a method that automatically sets trimming thresholds and a sensible window filter that replaces the removal of low quality reads.
Availability and requirements
Project name: seqCNA
Operating system(s): Platform independent
Programming language: R
Other requirements: SAMtools (only if using BAM files)
Any restrictions to use by non-academics: None
Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature. 2009, 458 (7239): 719-724. 10.1038/nature07943.
Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, Lawrence MS, Zhang C-Z, Wala J, Mermel CH, Sougnez C, Gabriel SB, Hernandez B, Shen H, Laird PW, Getz G, Meyerson M, Beroukhim R: Pan-cancer patterns of somatic copy number alteration. Nat Genet. 2013, 45 (10): 1134-1140. 10.1038/ng.2760.
Schlattl A, Anders S, Waszak SM, Huber W, Korbel JO: Relating cnvs to transcriptome data at fine resolution: assessment of the effect of variant size, type, and overlap with functional regions. Genome Res. 2011, 21 (12): 2004-2013. 10.1101/gr.122614.111.
Straub T, Becker PB: Dosage compensation: the beginning and end of generalization. Nat Rev Genet. 2007, 8 (1): 47-57. 10.1038/nrg2013.
Henrichsen CN, Vinckenbosch N, Zöllner S, Chaignat E, Pradervand S, Schütz F, Ruedi M, Kaessmann H, Reymond A: Segmental copy number variation shapes tissue transcriptomes. Nat Genet. 2009, 41 (4): 424-429. 10.1038/ng.345.
Veitia RA, Bottani S, Birchler JA: Cellular reactions to gene dosage imbalance: genomic, transcriptomic and proteomic effects. Trends Genet. 2008, 24 (8): 390-397. 10.1016/j.tig.2008.05.005.
Geiger T, Cox J, Mann M: Proteomic changes resulting from gene copy number variations in cancer cells. PLoS Genet. 2010, 6 (9): 1001090-10.1371/journal.pgen.1001090.
Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, Mc Henry KT, Pinchback RM, Ligon AH, Cho YJ, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, et al: The landscape of somatic copy-number alteration across human cancers. Nature. 2010, 463 (7283): 899-905. 10.1038/nature08822.
Liu P, Lacaria M, Zhang F, Withers M, Hastings P, Lupski JR: Frequency of nonallelic homologous recombination is correlated with length of homology: evidence that ectopic synapsis precedes ectopic crossing-over. Am J Hum Genet. 2011, 89 (4): 580-588. 10.1016/j.ajhg.2011.09.009.
Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S: Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics. 2012, 28 (1): 40-47. 10.1093/bioinformatics/btr593.
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19 (9): 1586-1592. 10.1101/gr.092981.109.
Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 2008, 36 (16): 105-105. 10.1093/nar/gkn425.
Janevski A, Varadan V, Kamalakaran S, Banerjee N, Dimitrova N: Effective normalization for copy number variation detection from whole genome sequencing. BMC Genomics. 2012, 13 (Suppl 6): 16-10.1186/1471-2164-13-S6-S16.
Boeva V, Zinovyev A, Bleakley K, Vert J-P, Janoueix-Lerosey I, Delattre O, Barillot E: Control-free calling of copy number alterations in deep-sequencing data using gc-content normalization. Bioinformatics. 2011, 27 (2): 268-269. 10.1093/bioinformatics/btq635.
Stransky N, Thiery J-P, Radvanyi F, Barillot E, Hupé P: Analysis of array cgh data: from signal ratio to gain and loss of dna regions. Bioinformatics. 2004, 20 (18): 3413-3422. 10.1093/bioinformatics/bth418.
Calenge C: The package adehabitat for the r software: tool for the analysis of space and habitat use by animals. Ecol Model. 2006, 197: 1035-
Tierney L, Rossini AJ, Li N: Snow: A parallel computing framework for the r system. Int J Parallel Program. 2009, 37 (1): 78-90. 10.1007/s10766-008-0077-2.
The Comprehensive R Archive Network. [http://cran.r-project.org/],
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project DataProcessingSubgroup: The sequence alignment/map format and samtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.
Teo SM, Pawitan Y, Ku CS, Chia KS, Salim A: Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics. 2012, 28 (21): 2711-2718. 10.1093/bioinformatics/bts535.
Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007, 318 (5849): 420-426. 10.1126/science.1149504.
Magi A, Tattini L, Pippucci T, Torricelli F, Benelli M: Read count approach for dna copy number variants detection. Bioinformatics. 2012, 28 (4): 470-478. 10.1093/bioinformatics/btr707.
Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, De Bakker P, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Chang K, Hawes A, Lewis L Ren Y, Wheeler D, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, Lee C, McCarrol SA, et al: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467 (7311): 52-58. 10.1038/nature09298.
ENCODE Mapability Data. [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeMapability/],
Abyzov A, Urban AE, Snyder M, Gerstein M: Cnvnator: an approach to discover, genotype, and characterize typical and atypical cnvs from family and population genome sequencing. Genome Res. 2011, 21 (6): 974-984. 10.1101/gr.114876.110.
Yau C: Oncosnp-seq: a statistical approach for the identification of somatic copy number alterations from next-generation sequencing of cancer genomes. Bioinformatics. 2013, 29 (19): 2482-2484. 10.1093/bioinformatics/btt416.
Mayrhofer M, DiLorenzo S, Isaksson A: Patchwork: allele-specific copy number analysis of whole genome sequenced tumor tissue. Genome Biol. 2013, 14: 24-10.1186/gb-2013-14-3-r24.
Mosén-Ansorena D, Aransay A, Rodríguez-Ezpeleta N: Comparison of methods to detect copy number alterations in cancer using simulated and real genotyping data. BMC Bioinformatics. 2012, 13 (1): 192-10.1186/1471-2105-13-192.
Popova T, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH, Manié E: Genome alteration print (gap): a tool to visualize and mine complex cancer genomic profiles obtained by snp arrays. Genome Biol. 2009, 10 (11): 128-128. 10.1186/gb-2009-10-11-r128.
Bignell GR, Greenman CD, Davies H, Butler AP, Edkins S, Andrews JM, Buck G, Chen L, Beare D, Latimer C, Widaa S, Hinton J, Fahey C, Fu B, Swamy S, Dalgliesh GL, Teh BT, Deloukas P, Yang F, Campbell PJ, Futreal PA, Stratton MR: Signatures of mutation and selection in the cancer genome. Nature. 2010, 463 (7283): 893-898. 10.1038/nature08768.
Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S, Futreal PA, Stratton MR: Picnic: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11 (1): 164-175. 10.1093/biostatistics/kxp045.
Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods. 2008, 6 (1): 99-103.
Berger MF, Lawrence MS, Demichelis F, Drier Y, Cibulskis K, Sivachenko AY, Sboner A, Esgueva R, Pflueger D, Sougnez C, Onofrio R, Carter SL, Park K, Habegger L, Ambrogio L, Fennell T, Parkin M, Saksena G, Voet D, Ramos AH, Pugh TJ, Wilkinson J, Fisher S, Winckler W, Mahan S, Ardlie K, Baldwin J, Simons JW, Kitabayashi N, MacDonald TY, et al: The genomic complexity of primary human prostate cancer. Nature. 2011, 470 (7333): 214-220. 10.1038/nature09744.
DMA is supported by the Navarra Government. AMA, partial research expenses and article-processing charges are supported by the Basque Country Government (Etortek Research Programs 2010/2012) and Bizkaia County’s Innovation Technology Department. This research was also funded by the Spanish Government (CDTI 2008-2010) and the Basque Country Government (Gaitek 2008-2009) under the Eurotransbio 2007 call.
The authors declare that they have no competing interests.
DMA developed the methodology, software and corresponding testing. AMA supervised the project, was responsible for the sequencing of the contributed samples and assessed the user interface. NT worked on the clinical data classification and in the sample selection. SV, VDO and MLM collected the tumoural and non-tumoural tissues from colorectal neoplasms after surgery, extracted the DNA and purified it. All authors have read and approved the manuscript as submitted.
Electronic supplementary material
About this article
Cite this article
Mosen-Ansorena, D., Telleria, N., Veganzones, S. et al. seqCNA: an R package for DNA copy number analysis in cancer using high-throughput sequencing. BMC Genomics 15, 178 (2014). https://doi.org/10.1186/1471-2164-15-178