Inferring single-cell copy number profiles through cross-cell segmentation of read counts

Background Copy number alteration (CNA) is one of the major genomic variations that frequently occur in cancers, and accurate inference of CNAs is essential for unmasking intra-tumor heterogeneity (ITH) and tumor evolutionary history. Single-cell DNA sequencing (scDNA-seq) makes it convenient to profile CNAs at single-cell resolution, and thus aids in better characterization of ITH. Despite that several computational methods have been proposed to decipher single-cell CNAs, their performance is limited in either breakpoint detection or copy number estimation due to the high dimensionality and noisy nature of read counts data. Results By treating breakpoint detection as a process to segment high dimensional read count sequence, we develop a novel method called DeepCNA for cross-cell segmentation of read count sequence and per-cell inference of CNAs. To cope with the difficulty of segmentation, an autoencoder (AE) network is employed in DeepCNA to project the original data into a low-dimensional space, where the breakpoints can be efficiently detected along each latent dimension and further merged to obtain the final breakpoints. Unlike the existing methods that manually calculate certain statistics of read counts to find breakpoints, the AE model makes it convenient to automatically learn the representations. Based on the inferred breakpoints, we employ a mixture model to predict copy numbers of segments for each cell, and leverage expectation–maximization algorithm to efficiently estimate cell ploidy by exploring the most abundant copy number state. Benchmarking results on simulated and real data demonstrate our method is able to accurately infer breakpoints as well as absolute copy numbers and surpasses the existing methods under different test conditions. DeepCNA can be accessed at: https://github.com/zhyu-lab/deepcna. Conclusions Profiling single-cell CNAs based on deep learning is becoming a new paradigm of scDNA-seq data analysis, and DeepCNA is an enhancement to the current arsenal of computational methods for investigating cancer genomics. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09901-5.


Background
Cancer genomes are often featured with extensive aberrations such as copy number alterations (CNA) and point mutations [1].Accurate detection of the mutation profiles is essential for deciphering underlying intra-tumor heterogeneity (ITH) and elucidating tumor evolutionary history [2,3].Specifically, characterizing CNAs from high-throughput sequencing data has emerged as an important field of cancer related studies.
Conventional bulk sequencing has spawned a number of computational methods for CNA calling based on read counts data [4][5][6].A bulk sample contains the DNA of thousands or even millions of cells, therefore inferred CNAs are the weighted average of the mutational profiles of present tumor clones.To understand the clonal structure, one needs to deconvolve the mixed profile into distinct clones and simultaneously infer their respective mutations.A main challenge of this deconvolution comes from the ambiguity that the number of tumor clones, their relative sizes, their mutational states and their genealogy are all undetermined [7].Nowadays singlecell DNA sequencing (scDNA-seq) is becoming a powerful means for profiling single-cell mutations [8], and makes it convenient to elucidate ITH based on single-cell CNAs.scDNA-seq uses whole-genome amplification (WGA) process to generate sufficient DNA material for sequencing from a single cell.The generated nucleotide sequences from WGA are often confounded by amplification bias and errors [9,10], which results in overdispersed read counts and makes CNA calling methods originally designed for bulk data less effective to noisy scDNA-seq data.
To date, a number of computational methods [11][12][13][14][15][16][17][18][19] have been specifically developed to infer single-cell CNAs.These methods adopt a general pipeline that mainly consists of three steps: 1) divide the genome into bins and count reads within each bin (the size of bins can be fixed or variable); 2) preprocess read counts to correct bias caused by GC-content and mappability, remove outlier bins having extremely high or low read counts, and exclude cells that show significantly uneven coverage or are low in sequencing coverage; 3) identify breakpoints that partition the genome into segments (any two adjacent segments have different copy numbers), and call absolute copy number for each segment [20].Most of the approaches require normal cells or a composite normal sample as negative control.For instance, SCOPE [13] uses Gini coefficient to identify normal cells and remove outlier cells, then jointly segments the cells using a generalized likelihood ratio test as well as a modified Bayes information criterion (BIC).SCNV [12] uses normal cells to construct a composite control, and detects CNAs bases on a bin-free segmentation algorithm.Another interesting method SCICoNE [14] is developed to detect CNAs based on reconstruction of a CNA tree.By using SCOPE to preprocess the read counts, SCYN [15] employs a dynamic programming algorithm to segment the genomes.SeCNV [18] is developed to estimate copy number profiles from large scDNA-seq dataset, it needs to find normal cells from input data and uses them as the negative control.A recently proposed method SCONCE [17] uses hidden Markov model (HMM) to call copy number segments for tumor cells, and uses normal cells as controls to attenuate the effects of read counts fluctuation.Aforementioned advanced methods generally require normal cells to normalize read counts for better detection of CNAs, and how to accurately infer single-cell CNAs without negative controls is a hard challenge.To address this problem, rcCAE [19] employs a convolutional autoencoder to distill copy number information from noisy read counts without normal cells as the controls, it can jointly infer tumor clones and singlecell CNAs.However, rcCAE tends to over-segment the genome because it detects breakpoints under a cell-bycell manner, which may complicate downstream phylogenetic analysis.Cross-cell segmentation tends to deliver more accurate predictions of the breakpoints since breakpoints are shared among single cells and could be treated as a process that identifies consecutive bins having highly similar read counts across cells.Considering current scDNA-seq data could contain thousands of cells, segmentation over such a high-dimensional space may suffer from high computation intensity and degraded accuracy, therefore methods for this purpose is still highly required especially on large datasets.
Deep learning models have shown excellent performance in learning latent representations of high-dimensional single-cell data.For instance, variational autoencoder models enable more accurate clustering of single cells over a latent space [21,22].The encoding-decoding process could be treated as a distillation process that fetches effective latent representations from noisy single-cell data by reconstructing the original data.For cross-cell segmentation of the read count sequence, each genomic bin could be encoded into a latent space using an autoencoder model, thus breakpoints could be found by efficient segmentation over the learned low-dimensional sequential data.
In this paper, we introduce a novel computational method called DeepCNA for inferring single-cell CNAs (as shown in Fig. 1).In DeepCNA, we use an autoencoder to learn low-dimensional latent representation of each genomic bin, then apply the circular binary segmentation (CBS) algorithm [23] on each dimension of the latent data to find the breakpoints, and finally employ a mixture model to estimate copy numbers for each cell.We adopt several data filtering and normalization approaches to improve the quality of read counts data.By minimizing reconstruction loss between the recovered and original data, the autoencoder is encouraged to deliver effective representations of genomic bins.Given the inferred segments by the CBS algorithm, we formulate observed data with a mixture model where copy number state of each segment acts as a latent variable and is efficiently optimized by using the expectation-maximization (EM) algorithm.Unlike previous methods in this field, DeepCNA performs cross-cell segmentation through encoding each genomic bin into a low-dimensional space, and thus makes it convenient to directly apply conventional segmentation approaches on the data.We comprehensively evaluate DeepCNA on both simulated and real datasets to show its effectiveness in detecting breakpoints and deciphering single-cell CNAs from complex scDNA-seq data.

Simulated datasets
To fully assess the effectiveness of DeepCNA, we generate various datasets by emulating different cell ploidy.
Fig. 1 The workflow of DeepCNA.To improve read counts obtained from scDNA-seq data, DeepCNA takes several data correction and normalization procedures including correcting GC-content and mappability bias, eliminating outlier bins with extremely high or low read counts, and removing outlier cells with low sequencing quality.Given the normalized read counts, DeepCNA employs an autoencoder network to learn low-dimensional latent representation of each genomic bin, such that the breakpoints can be accurately detected along each latent dimension and merged together to form the final breakpoints.Finally, a mixture model is adopted to estimate copy numbers of the inferred segments for each cell The simulation pipeline is implemented by following our previous study [19].Specifically, the simulation consists of three steps: 1) construct a clonal tree following the approach adopted in [24]; 2) generate genome sequence for each cell based on simulated CNAs; and 3) produce reads of each cell given biological and technological parameters.When generating the clonal tree, we assume the ancestral cell state is homogeneous with the defined ploidy and then all clones evolve from that state.Nodes of the clonal tree represent tumor clones and edges are labeled with CNAs.The size of simulated CNAs ranges between 3 and 20 Mb, the number of clones is set to 4, and the number of cells is set to 100.Given the CNAs of each cell, SCSsim [25] tool is employed to generate reads under sequencing coverage of 0.02.Read alignments are obtained using BWA [26] tool under default parameters, and further processed with SAMtools [27] to generate BAM files.We generate diploid, triploid and tetraploid datasets to examine the ability of DeepCNA in distinguishing between different tumor ploidy.For each tumor ploidy, the simulation is repeated 10 times, resulting in 30 datasets for benchmarking.

Real datasets
Two real datasets are employed in this study.The first dataset consists of 100 single cells from a breast ductal carcinoma patient [28], and sequencing data can be downloaded from NCBI SRA under accession number SRA018951.The second dataset is a 10X Genomics dataset containing 2053 cells from a triple negative ductal carcinoma [16], and sequencing data are freely available at https:// suppo rt.10xge nomics.com/ single-cell-dna/ datas ets/1.0.0/ breast_ tissue_ E_ 2k.

Fetch and preprocess read counts
Read alignments of cells could be stored in a merged BAM file or per-cell BAM files.When a merged BAM file as used in 10X genomics is provided, we use barcodes of the cells to fetch read alignments of each cell and calculate read counts.When per-cell BAM files are available, read counts of each cell are obtained from the corresponding BAM file.The read counts are measured in consecutive equal-sized bins along the genome.We have implemented a tool that uses the APIs of BamTools [29] software to fetch read alignment information from the BAM files.To improve data quality, we also calculate GC-content and mappability of each bin from reference sequence, which are utilized to exclude outlier bins and correct read counts bias.Bins with < 10% GC percent, > 90% GC percent, or > 0.9 mappability score are marked as outliers [30] and excluded from downstream analysis.We then perform library size normalization for each cell by dividing read counts with the mean read count, which makes read counts comparable between cells.To further filter "bad" bins that have extremely high or low read counts, we exclude bins whose mean read counts are in the lower or upper 1% quantile.Moreover, cells with extensively fluctuated read counts should be eliminated as these cells may act as confounding outliers.Specifically, Gini coefficient is calculated for each cell based on the read counts, and cells that have ≥ 0.3 Gini coefficient are excluded.Finally, we employ a median normalization approach [5] to correct read counts bias caused by mappability and GC-content for each cell: where r k is the read count of the k-th bin, m is the median read count of all bins, m k is the median read count of bins that show same GC-content (mappability) as the kth bin, and rk is the corrected read count.We then cal- culate log2 read counts (LRC) for downstream analysis.The resulted LRC data of N cells in M bins are stored in a N × M matrix X.

Learn low-dimensional representations of genomic bins
Each genomic bin is represented by a feature vector y (one col of X) with size of N, and cross-cell segmentation of the LRC data could be treated as a process that identifies consecutive bins having highly similar features.As current scDNA-seq data frequently contain hundreds even thousands of cells, measuring similarity between bins over a high-dimensional space is challenging and requires appropriate selection of the distance metric.To obviate this inconvenience, we project the feature vector of each bin into a d-dimensional space where segmentation of the bins could be efficiently conducted along each dimension and optimize the latent representation by reconstructing the original features.We use a function α to project y into a d-dimensional latent representation z, i.e. z = α y , and employ a function β to recover y from z, i.e. ŷ = β(z) .Specifically, the function α and β are implemented as the encoder and decoder of an autoencoder network, respectively.The encoder consists of three fully-connected layers each with 256, 128 and 64 neurons, while the decoder is composed of the mirrored structure.The network is trained by minimizing the following reconstruction loss: we use gradient decent algorithm to update the network weights by minimizing the mean loss of a batch of samples in each iteration.After the model converges, we get a d-dimensional latent representation of each genomic bin (1) by taking the LRC data of the bin as input of the encoder network, and latent features of M bins are denoted by a d × M matrix Z.

Perform segmentation on the learned low-dimensional sequential data
Based on the learned representation Z, we employ the CBS algorithm [23] to segment each row of Z, and merge the breakpoints from all rows to get the final breakpoints.
To minimize the number of false negative calls for the breakpoints, we set p-value to 0.1 in the CBS algorithm.This may tend to result in over-segmented results containing false positive calls of the breakpoints, while the results could be recalibrated by merging consecutive segments that have same copy number.We use K to denote the number of inferred segments, and L k to represent the number of bins contained in the k-th segment.

Estimate copy numbers for each cell
We estimate absolute copy numbers of the segments for each cell by maximizing the likelihood of LRC data.Without loss of generality, we use x = (x 1 , x 2 , ..., x K ) to denote LRC data of a cell, where x k indicates LRC data of the bins in the k-th segment.Suppose there are S copy number states, the probability density of LRC in the k-th segment is formulated with a mixture model: where we assume the segments are independent of each other, θ denotes model parameters, s is a copy number state, z k denotes copy number state of the k-th segment, π s is the proportion of the s-th component in the mixture model, and ω s represents parameters associated with the s-th component.The conditional likelihood f (x k |ω s ) can be formulated as , here x ki is LRC of the i- th bin of the k-th segment.For computational convenience, we assume x ki follows a normal distribution: with σ being standard deviation and µ s being the mean value.Considering the mean value of LRC in copy-neutral regions may deviate from 0 due to aneuploidy, we define µ s by introducing a parameter o that quantifies baseline shift of LRC as adopted in [19]: where c s is the copy number under state s .The complete likelihood of all segments is then computed as: (3) To infer the hidden states z = (z 1 , z 2 , ..., z K ) and model parameters, we employ EM algorithm to train the model.During the n-th iteration of the EM algorithm, the posterior probability of z K is measured as: where θ (n−1) denotes the estimated parameters in the pre- vious iteration.We then maximize the following objective function to update model parameters θ = (o, σ , π): By solving the optimization problem, we get following rule to update parameter o: and then update σ with We update parameter π under the constraint By initializing σ to the standard deviation of LRC data and performing a linear search of initial values of o, we iteratively update the model parameters until the algorithm converges and the optimal parameters are denoted by θ * .Copy number of each segment is then inferred from the state that has the largest posterior probability: As the number of copy number states considered in the mixture model is limited, copy numbers of highly amplified segments are underestimated.To address this issue, we use the post-processing approach adopted in rcCAE [19] to recalculate copy number of each segment given the learned model parameters and mean of LRC data of the segment.( 6)

Performance evaluation
We compare DeepCNA to six advanced methods including SCOPE [13], SCYN [15], SCICoNE [14], SeCNV [18], SCONCE [17] and rcCAE [19] on the simulated datasets.CHISEL [16] is excluded from performance evaluation since it additionally exploits allele frequency to call allelespecific CNAs, and phasing based on single nucleotide variations is critically challenging due to low coverage of scDNA-seq data.As data preprocessing strategy varies across different methods and size of the resulted data may be inconsistent, we extract common cells and common bins from the results of all methods for performance evaluation.To evaluate the accuracy of breakpoint detection, we adopt two performance metrics that are similar to the ones adopted in SCONCE: 1) on each simulated dataset, we fetch the set of ground truth breakpoints, and measure the distance to the nearest predicted breakpoint for each real breakpoint, then sum the distances across the genome to indicate the breakpoint detection accuracy; 2) as inferring large number of false positive breakpoints tends to decrease the breakpoint distance, we calculate w as the ratio between the number of inferred breakpoints and the number of real breakpoints, such that lower breakpoint distance and w closer to 1 indicate higher detection accuracy.To evaluate copy number estimation performance of each method, we calculate mean of absolute differences (MAD) between real copy number and estimated copy number for each cell across the common bins [17].To assess ploidy estimation performance, average copy number (ACN) is calculated for each cell based on the inferred copy number profiles and compared to the ground truth [19].We follow same strategy as used in rcCAE to run each method, and specific parameters used to run each method are given in Supplementary Table 1 (Additional file 1).

Evaluation of DeepCNA on simulated datasets
We run DeepCNA on the simulated datasets to examine its ability of detecting breakpoints and single-cell CNAs.
The read counts data are obtained from every 200 kb bins (∼3450 bins), and ∼2955 bins are remained after applying the preprocessing steps.For model training, we set the latent dimension d to 1, and train the autoencoder network 500 epochs with learning rate of 0.0001 and batch size of 256.The maximum copy number is set to 10 when calling single-cell CNAs.The following sections give the detailed performance evaluation results.

DeepCNA shows better performance in inferring breakpoints
DeepCNA uses an autoencoder network to project the high-dimensional read counts into a low-dimensional latent sequence, then employs the CBS algorithm to segment the chromosomes into distinct regions.An example of inferred latent representation and breakpoints on the simulated datasets is depicted in Supplementary Fig. S1 (Additional file 1).It is observed that breakpoints associated with some small CNAs (located on chromosome 2) are also effectively characterized in the latent space and thus accurately detected by the CBS algorithm.The autoencoder network is trained to automatically learn the representation of read count sequence, obviating the need for design and calculation of specific statistics for each bin as adopted in the existing methods.This distinctive feature makes DeepCNA well identify the changes along the read count sequence.
Fig. 2 Breakpoint inference accuracy of the investigated methods on simulated datasets.Two performance metrics are calculated for evaluation: 1) we measure the distance to the nearest predicted breakpoint for each real breakpoint, then sum the distances for all real breakpoints; 2) w is defined as the ratio between the number of inferred breakpoints and the number of real breakpoints.A lower breakpoint distance coupled with w closer to 1 indicate higher inference accuracy.Performance of each method on simulated diploid (Simulation A), triploid (Simulation B) and tetraploid (Simulation C) datasets are evaluated We make comparison between DeepCNA and other methods in inferring breakpoints, and the results are shown in Fig. 2 (lower breakpoint distance and w closer to 1 indicate better results).On datasets from Simulation A, SCOPE and SCYN exhibit higher breakpoint distance than other methods, and the mean distances are 288.2 and 391.5, respectively.They may fail to find breakpoints that are only present in small proportion of cells.In addition, SCONCE shows the lowest breakpoint distance but yields many false positives (the mean of w is 8.6).Similarly, rcCAE achieves good performance (the mean distance is 1.5) in finding true breakpoints, nevertheless it produces high proportion of false positives (the mean of w is 3.2).As SCONCE and rcCAE detect breakpoints under a cell-by-cell manner, they are sensitive to read count fluctuation occurring in single cells, and thus tend to over-segment the genome.Giving large number of false positive breakpoints complicates downstream ITH analysis such as inferring tumor subclones and phylogeny.Compared to the existing methods, DeepCNA achieves a good tradeoff between precision and recall, it yields the mean breakpoint distance of 3.2 and mean w of 1.0, which suggests DeepCNA successfully disentangles real breakpoints from technically confounded factors, and effectively represents the read count sequences over the low-dimensional latent space.On datasets from Simulation B and Simulation C, we find each method shows similar performance as observed on Simulation A datasets, and DeepCNA generally surpasses other methods in either finding true breakpoints or suppressing false positives.

DeepCNA achieves competitive performance in estimating single-cell copy numbers
Given the inferred breakpoints, we employ a mixture model to estimate copy numbers of segments for each cell.An example of copy number estimation results on the simulated datasets is shown in Fig. 3, which shows DeepCNA correctly identifies cell ploidy and precisely predicts copy number of each segment.As our method explicitly formulates the relationship between LRC and cell ploidy as well as copy number, it is able to find the We evaluate the copy number and cell ploidy estimation accuracy of all methods by calculating the MAD as well as difference between the ground truth and inferred ACNs (∆ACN).Lower MAD and ∆ACN closer to 0 imply higher estimation accuracy.The copy number and cell ploidy estimation results are shown in Fig. 4 and Supplementary Fig. S3, respectively.On diploid datasets (Simulation A), SCOPE, SCYN, SeCNV and SCONCE exhibit good performance with mean MADs of 0.06, 0.03, 0.02 and 0.09 respectively, while SCICoNE is less accurate in calling absolute copy numbers (mean MAD is 0.72) because it tends to overestimate cell ploidy (median ∆ACN is -0.99).rcCAE accurately estimates copy numbers for most of the cells, but overestimates cell ploidy for 65 cells of which 39 cells are recognized to be tetraploid and 21 cells are predicted to be pentaploid, therefore it produces mean MAD of 0.16.By comparison, our method presents high accuracy in identifying both cell ploidy (median ∆ACN is -0.002) and absolute copy numbers across different datasets, thus reaches the lowest mean MAD of 0.01.On triploid datasets (Simulation B), the mean MADs of SCOPE, SCYN, SCICoNE, SeCNV, SCONCE, rcCAE and DeepCNA are 0.59, 0.43, 0.23, 0.15, 0.81, 0.20 and 0.02, respectively.The lower accuracy of existing methods results from their less effectiveness in estimating cell ploidy.For instance, given 915 common cells covered by the results of all methods, SCOPE erroneously predicts 360 cells as diploid, SCYN tends to underestimate cell ploidy and erroneously identifies 378 cells as diploid, SCICoNE marks all normal cells as triploid, and SeCNV incorrectly labels 107 triploid cells either diploidy or tetraploidy.Despite that rcCAE explicitly takes cell ploidy into consideration in the copy number computation model, it still overestimates ploidy of 74 normal cells.On tetraploid datasets (Simulation C), our method also gains better or comparable performance compared to existing methods.It achieves 0.02 mean MAD and accurately predicts cell ploidy on all datasets.SCOPE, SCYN, SCICoNE, SeCNV and SCONCE tend to underestimate cell ploidy, while rcCAE overestimates ploidy of 54 cells.These results demonstrate DeepCNA is robust to the change of cell ploidy and read count fluctuation.

DeepCNA performs robustly against the change of hyperparameters
There are several hyperparameters, such as bin size and latent dimension, that may affect the detection results of DeepCNA.To fully investigate DeepCNA' robustness against the changes of these hyperparameters, we simulate various datasets for performance evaluation.First, we compare the results on Simulation A datasets by using different bin sizes in {100 kb, 200 kb, 500 kb, 1000 kb}.The results in Supplementary Fig. S4 suggest DeepCNA performs robustly in identifying cell ploidy and copy numbers of segments.Increasing the bin size tends to yield less breakpoints since more real breakpoints will locate inside the bins and thus are undetectable.Second, to assess the effects of latent dimension d on breakpoint inference, we compare the breakpoint distance and metric w under different values of d, and the results in Supplementary Fig. S5 imply the larger the value of d, the smaller the breakpoint distance and the larger the w.This is in concordance with the intuitive conviction that larger latent dimension will result in more breakpoints and thus probably decrease the breakpoint distance.In addition, Supplementary Fig. S6 shows example segmentation results on three simulated datasets when the latent dimension is set to 2. The results suggest the first dimension well captures most of the breakpoints, while Fig. 4 Copy number estimation accuracy of the investigated methods on simulated datasets.We first extract common bins covered by the results of all methods, then calculate mean of absolute differences (MAD) between real copy number and estimated copy number for each cell across the common bins.Performance of each method on simulated diploid (Simulation A), triploid (Simulation B) and tetraploid (Simulation C) datasets are evaluated sometimes underrepresents the breakpoints that may be shared by only a few cells, and these breakpoints are detectable from the second dimension.It is also observed that some breakpoints can be simultaneously detected from both dimensions due to the fact that we do not introduce any constraints when learning the different latent dimensions with the autoencoder network.Our results indicate setting latent dimension to 1 or 2 is appropriate to capture local changes of LRC data, accurately detect breakpoints and simultaneously suppress false positives.

DeepCNA shows good robustness against the change of data size and heterogeneity
We also assess DeepCNA' ability of coping with varying data size and heterogeneity.To check if DeepCNA generalizes well on different-sized datasets, we generate simulation datasets by sampling the number of cells from {100, 200, 300} and fixing the number of clones to 8. The evaluation results (Supplementary Fig. suggest our method scales well to larger datasets and performs consistently well in identifying cell ploidy as well as copy numbers.In addition, runtime performance analysis (Supplementary Table 2) shows our method is more efficient than other methods and requires less than 130 s to process a dataset with 300 cells.To further examine the effect of data heterogeneity, we compare detection results of DeepCNA on simulated datasets with different numbers of clones.Specifically, the number of clones changes from 4 to 8, and the number of cells is fixed to 100.The results (Supplementary Fig. S8) show increasing the number of clones does not appreciably affect the performance of Deep-CNA, suggesting our method can accurately disentangle copy number segments form noisy read counts regardless of the data heterogeneity.These results demonstrate the applicability of DeepCNA to complex datasets.
We proceed to check if our method can still accurately estimate copy number segments without normal cells mixed in the input data.To achieve this, we exclude normal cells from the simulated datasets (Simulations A-C), and run DeepCNA under same parameter configurations as used on full datasets.The results in Supplementary Fig. S9 demonstrate our method can still yield accurate results, which suggests DeepCNA is applicable to datasets containing only tumor cells.

Evaluation of DeepCNA on a breast cancer dataset
To further verify the effectiveness of DeepCNA, we apply it to a triple negative ductal carcinoma dataset (denoted as T10) consisting of 100 cells [28].Sequencing data of the cells are obtained from NCBI SRA (accession number SRA018951), and further processed using BMA to generate read alignments.A previous study [28] has analyzed copy number profiles of the cells and identified 4 subpopulations including D + P, H, AA and AB.The copy number profiles indicate D + P comprises mainly diploid cells, H is a tumor clone that undergoes hemizygous deletions on multiple chromosomes, AA and AB are aneuploid clones showing copy number amplifications across the chromosomes.
When running DeepCNA on this dataset, we use bin size of 500 kb to calculate the read counts, and finally obtain an LRC matrix with size of 100 × 5120 after applying the preprocessing steps.For training the autoencoder network, the latent dimension, number of epochs, batch size and learning rate are set to 2, 500, 256 and 0.0001, respectively.We randomly select a cell from each subpopulation to compare the copy number profiling results of DeepCNA as shown in Fig. 5.As expected, cell "SRR053611" from the D + P subpopulation is a diploid cell that exhibits few CNAs across the genome, and cell "SRR053678" from the H clone has extensive hemizygous deletions on the chromosomes such as 1p, 4q, 5q, 13q and 14.Instead, cells "SRR054594" and "SRR054634" are mainly characterized with copy number gains on most of the chromosomes.The ploidy of the cells is similar to the previously reported results [19,28].
We also run SCOPE, SCYN, SCICoNE and rcCAE on this dataset by setting bin size to 500 kb.SeCNV and SCONCE are excluded from evaluation as SeCNV throws runtime errors when calling CNAs, and SCONCE needs to be manually provided with normal cells to create negative control but we have no prior knowledge about which cells are normal cells.As the ground truth breakpoints and copy number profiles are not available for the T10 dataset, we just perform a qualitative analysis of the results inferred by different methods.In general, Deep-CNA yields results that better explains the observed data (Supplementary Figs.S10-13), and SCICoNE tends to overestimate the cell ploidy as observed on the simulated datasets.For instance, a copy number amplification on chromosome 7 of cell "SRR053675" is detected by Deep-CNA (Supplementary Fig. S10), while SCOPE, SCYN and rcCAE predict it as normal copy number, and their predictions obviously deviate from the observed distribution of depth or LRC data.In addition, our method detects a small copy number segment on chromosome 7 of cell "SRR054618" that is missed by SCOPE, SCICoNE and rcCAE (Supplementary Fig. S11).Similarly, existing methods except SCYN underestimate the number of breakpoints on chromosome 2q of cell "SRR089402" (Supplementary Fig. S12), thus yield biased predictions of the copy numbers.

Evaluation of DeepCNA on a 10X Genomics dataset
We also evaluate DeepCNA on a large dataset including 1446 cells from a breast tumor tissue [16], and sequencing data of this 10X Genomics dataset (denoted as E) can be freely downloaded from https:// suppo rt.10xge nomics.com/ singl ecell-dna/ datas ets/1.0.0/ breast_ tissue_ E_ 2k.Sequencing coverage of each cell varies between 0.02X and 0.05X.The original BAM file contains read alignment results of 2053 cells, of which 607 cells are identified to be low sequencing coverage, doublets or with S-phase of the cell cycle [16], and therefore eliminated from CNA analysis.For the remaining 1446 cells, CHISEL algorithm [16] has found one diploid subpopulation comprised of 388 cells and also 5 tumor clones (labeled as I-V), of which each contains 168, 58, 20, 782 and 30 cells, respectively.We use DeepCNA to call breakpoints as well as estimate single-cell copy numbers from this dataset, and check if it could give reliable detection results.As the ground truth breakpoints and copy numbers are not available, we therefore use the results of CHISEL as the baseline to evaluate different methods.This is appropriate since the copy number profiles inferred by CHISEL are further verified by allele frequency data.CHISEL gives a CNA matrix with size of 1446 × 570 obtained from 5 Mb consecutive bins, and we find 104 breakpoints by scanning the columns of the matrix.For performance comparison, the existing methods except rcCAE are unable to process the merged BAM file, therefore we only obtain the results of rcCAE on this dataset.
Read counts in 500 kb bins are fetched from the BAM file and further preprocessed to improve the data quality, which results in an LRC matrix with size of 1446 × 5139.By setting the latent dimension d to 2, we train the autoencoder network 500 epochs with learning rate of 0.0001 and batch size of 256.We randomly select a cell from each subpopulation inferred by CHISEL and compare the breakpoints and copy number estimation results of DeepCNA as shown in Fig. 6.There are significant differences between copy number profiles of distinct cell subpopulations.Cell with barcode "TAA GAG ATC AAG AAAC" is a diploid cell that presents neutral copy number across the genome.The distinctive feature of cell "ATT GGA CTC CCA AAGT" from tumor clone I is that it has copy number of 3 on chromosomes 2-3 Fig. 5 Copy number estimation results of DeepCNA on the breast cancer dataset.A previous study [28] has clustered the cells into 4 subpopulations, and we select four cells for comparison including "SRR053611" from D + P subpopulation, "SRR053678" from H clone, "SRR054594" from AA clone, and "SRR054594" from AB clone.Cell "SRR053611" is a diploid cell, and cell "SRR053678" shows hemizygous deletions on multiple chromosomes.Instead, cells "SRR054594" and "SRR054634" have copy number amplifications on most of the chromosomes while copy number of 4 on chromosome 4.By comparison, the corresponding copy numbers of cell "CTG GTC TGT TGC GGCT" from clone II are 4 and 3, respectively.Cell "GGC CGA TGT AAC CGAG" from clone III has very similar copy number profiles to the cell "CTG GTC TGT TGC GGCT" on most of the genome regions, but shows different copy number on chromosome 9q.Compared to the cell "GGC CGA TGT AAC CGAG", the main difference of cell "TGA CTT TGT TTC CATT" from clone IV is that its copy number on chromosome 2 is 3, and cell "TGG CTG GGT TAC GCGC" from clone V shows copy number of 3 on chromosomes 2-3.Taken together, these results imply our method successfully deciphers the copy number profiles of underlying tumor clones.
To check if our method gains advantage over rcCAE in inferring breakpoints and estimating copy numbers, we calculate three performance metrics including breakpoint distance, w and MAD by treating CHISEL results as the ground truth.Common bins covered by both methods are used for evaluation.rcCAE infers 450 breakpoints of which most are false positive calls and yields breakpoint distance of 462 and w of 4.3.By comparison, Deep-CNA infers 103 breakpoints, it shows larger breakpoint distance of 1270 while a lower w of 0.99.As CHISEL infers copy numbers in low-resolution 5 Mb bins, the real breakpoints may locate far from the bin boundaries, and this may explain why DeepCNA gives similar number of breakpoints but large breakpoint distance.The copy number estimation results may also give evidence that DeepCNA achieves better performance than rcCAE in suppressing false positive calls of breakpoints.The MADs of rcCAE and DeepCNA are 0.08 and 0.07, respectively, which suggests breakpoints inferred by DeepCNA are more likely to be real breakpoints and thus aid in accurate estimation of copy numbers.We further investigate if copy numbers shared by few cells are accurately Fig. 6 Copy number estimation results of DeepCNA on the 10X Genomics dataset.CHISEL algorithm [16] has clustered the cells into 6 subgroups, and we select one cell from each subpopulation for illustration.Copy number profiles of these cells exhibit significant differences.For instance, cell with barcode "TAA GAG ATC AAG AAAC" is a diploid cell that shows few CNAs across the genome; Cells "CTG GTC TGT TGC GGCT" and "GGC CGA TGT AAC CGAG" show different copy number on chromosome 9q; compared to cell "GGC CGA TGT AAC CGAG", cell "TGA CTT TGT TTC CATT" has copy number of 3 on chromosome 2, and cell "TGG CTG GGT TAC GCGC" shows copy number of 3 on chromosomes 2-3 detected by each method, and results of cell"AAG GCA GGT TCG CGTG" are given in Supplementary Fig. S14.rcCAE produces highly similar predictions to DeepCNA on all chromosomes except chromosome 2, while it probably yields biased predictions on chromosome 2 because the expected LRC mean deviates from the observed distribution of LRC.As the LRC data used in rcCAE are obtained from the outputs of an autoencoder network, and this distillation process may suppress LRC signal of low-prevalence copy numbers.By comparison, the copy numbers estimated by DeepCNA well match the distribution of LRC data.Furthermore, we calculate Pearson correlation to compare DeepCNA with rcCAE by treating CHISEL results as the baseline.The results in Supplementary Fig. S15 show our method achieves a median Pearson correlation coefficient of 0.92, while rcCAE gets a median Pearson correlation coefficient of 0.89.This gives enhanced evidence that our method performs better than rcCAE on the 10X Genomics dataset.

Evaluation of DeepCNA on non-tumor datasets
We further assess if DeepCNA can still accurately estimate copy numbers from the scDNA-seq datasets that only contain normal/non-tumor cells.Such datasets can be used as negative controls to find somatic single-cell CNAs.
First, we apply DeepCNA to a composite control dataset that is a subset of the 10X Genomics dataset.Specifically, as the diploid subpopulation of the 10X Genomics dataset is mainly composed of normal cells, we use the 388 cells contained in this subpopulation to construct a negative control dataset, and run DeepCNA under the same parameters as used for the full dataset.The results show our method achieves high accuracy in measuring copy numbers of the normal cells with mean MAD of 0.01.Despite that DeepCNA yields more breakpoints than the expected (78 vs. 45), copy numbers of the oversegmented regions are still precisely estimated.
Next, we evaluate DeepCNA on a non-tumor dataset consisting of 1094 cells from BJ Fibroblast Euploid Cell Line, and the merged BAM file containing read alignments of all cells can be freely downloaded from https:// www.10xge nomics.com/ resou rces/ datas ets/1-k-cellsfrom-bj-fibro blast-euplo id-cell-line-1-stand ard-1-1-0.As the ground truth copy number profiles are not available for this dataset, we only perform qualitative analysis of the results of DeepCNA.Read counts are measured for 500 kb bins and preprocessed to filter outlier cells and bins, which results in an LRC matrix with size of 1094 × 5120.We then train the autoencoder model 500 epochs with learning rate of 0.0001, batch size of 256 and latent dimension of 1. DeepCNA predicts all cells as diploidy with the minimum ACN of 1.931 and maximum ACN of 2.072, which suggests our method correctly identifies the ploidy of all cells.An example of the copy number estimation results as depicted in Supplementary Fig. S16 indicates the cells present broad copy neutral regions across the whole genome.These results demonstrate that our method can still yield reliable estimations of singlecell copy number profiles from scDNA-seq datasets composed of only non-tumor cells, and thus help to find somatic single-cell CNAs.

Discussion
Breakpoint detection and copy number estimation for each segment are two basic tasks of single-cell copy number analysis.As tumor evolves by accumulating CNAs, breakpoints are shared among different cell subpopulations, and cross-cell segmentation of high-dimensional read count sequence is essential for reducing false positive calls of the breakpoints caused by local fluctuation of read counts.Existing methods for breakpoint detection are usually based on manually designed statistics of read counts, and less sensitive to real breakpoints or less robust to local noise of read counts.In this paper, we develop a novel method for cross-cell segmentation and single-cell copy number estimation, the proposed Deep-CNA treats read counts of all cells of each bin as a feature vector and automatically learns its latent representations by leveraging an autoencoder network, which makes it convenient to call breakpoints along each latent dimension and yield the final set of breakpoints by merging the results from all dimensions.The approach exploits the distillation process of the autoencoder to effectively represent each bin over a low-dimensional space and gets higher robustness to the outliers present in read counts.To the best of our knowledge, DeepCNA is the first deep learning method that performs cross-cell segmentation of read count sequence.For estimating single-cell copy numbers, a mixture model is employed to cluster segments into different copy number states, and cell ploidy is determined by exploring the baseline shift of LRC data.Although DeepCNA follows the same idea as proposed in rcCAE to estimate cell ploidy, its main difference to rcCAE lies in the fact that DeepCNA jointly segments the read counts of all cells while rcCAE segments the read counts under a cell-by-cell manner, and this makes Deep-CNA more powerful in suppressing false positive calls of breakpoints.
To comprehensively evaluate the performance of Deep-CNA, we generate various datasets by mimicking different cell ploidy.Evaluation results suggest our method is highly effective in finding breakpoints and recognizing cell ploidy, thus yields accurate estimations of copy numbers.With DeepCNA, we elucidate single-cell copy number profiles from a breast cancer dataset, and obtain similar prediction results as previously reported.Compared to the existing methods, DeepCNA generally better explains the observed depth data.In addition, we test DeepCNA on a large 10X Genomics dataset containing 1446 cells, and the estimation results provide clear indication of copy number profile difference between cell subpopulations.We also show DeepCNA is able to call low-prevalence CNAs, and these are missed by rcCAE due to deficient recovery of the LRC signal for low-prevalence copy numbers.
Despite that DeepCNA shows advantages in inferring breakpoints and single-cell copy number profiles, there are still two potential to further improve its accuracy.First, DeepCNA does not consider the relative positions of bins when learning latent representation, which probably makes DeepCNA yield over-segmented results when processing extensively fluctuated read counts.Positional encoding is an effective way to represent the relative position of each bin and can be used as additional features when learning the latent representation.Therefore, how to encode the positions and utilize positional encoding features in the representation learning model is one of our future works to improve DeepCNA.Second, DeepCNA implements breakpoint detection and copy number estimation as two separate procedures, and errors made in breakpoint detection may be propagated to the CNA analysis.Joint inference of breakpoints and copy numbers from all cells is a feasible solution to improve the accuracy, while this may be a very challenging task as the candidate breakpoints and copy number states form a huge search space.Some of the existing methods have employed HMMs to simultaneously detect breakpoints and CNAs under a cell-by-cell manner, and mixture hidden Markov model (MHMM) [31] could be used to implement cross-cell segmentation and copy number estimation in a single model provided that the hidden states and mixture components are properly defined.The mixture components could be defined as cell subpopulations showing distinct copy number profiles between each other, which makes it convenient to identify tumor clones.

Conclusions
In summary, we introduce a new method DeepCNA for profiling single-cell CNAs from scDNA-seq data.Deep-CNA leverages deep representation learning to cope with the difficulty of segmenting high-dimensional read count sequence and employs an effective mixture model to decipher copy numbers for each cell.Evaluation results demonstrate DeepCNA has advantages over other methods in inferring breakpoints and single-cell CNAs.We believe our work will promote development of computational methods for downstream ITH and phylogenetic analysis.
segment on chromosome 7 that is missed by SCOPE, SCICoNE and rcCAE.Copy number deletion, normal copy number and copy number amplification are marked by green, blue and red, respectively.Fig. S12.Copy number estimation results on cell "SRR089402" from the breast cancer dataset.Existing methods except SCYN underestimate the number of breakpoints on chromosome 2q, thus yield biased predictions of the copy numbers.SCICoNE overestimates the cell ploidy.Copy number deletion, normal copy number and copy number amplification are marked by green, blue and red, respectively.Fig. S13.Copy number estimation results cell "SRR054632" from the breast cancer dataset.SCOPE and SCICoNE predict asmall segmenton chromosome 6 to copy number amplification.rcCAE fails to find a small segment on chromosome 7. Copy number deletion, normal copy number and copy number amplification are marked by green, blue and red, respectively.Fig. S14.Copy number estimation results on the cell with barcode "AAG GCA GGT TCG CGTG" from the 10X Genomicsdataset.rcCAE' predictions on chromosome 2 probably deviate fromthe ground truth.Copy number deletion, normal copy number and copy number amplification are marked by green, blue and red, respectively.Fig. S15.Comparison of Pearson correlation coefficients on the 10X Genomics dataset.By using CHISEL results as the baseline, Pearson correlation coefficientsare calculatedbased on copy numbers inferred byeach method.Fig. S16.An example of copy number estimation results of DeepCNA onthe BJ Fibroblast Euploid Cell Linedataset.DeepCNA predicts all cells as diploidy.Table S1.The parameters used to run each method on simulated datasets.Table S2.The runtime performance comparison results of all investigated methods.The results are obtainedfromsimulated datasets with varying number of cells.All the experiments are conducted on a computational server with64 CPU cores,128 GB RAM and 1 GeForce RTX 2080 Ti GPU.

Fig. 3
Fig. 3 An illustration of copy number estimation results on simulated datasets.The top four sub-figures depict LRC data of four tumor clones (copy number deletion, neutral and amplification are marked by green, blue and red, respectively), and the bottom subfigure shows the inferred absolute copy numbers of each tumor clone