SCYN: single cell CNV profiling method using dynamic programming

Background Copy number variation is crucial in deciphering the mechanism and cure of complex disorders and cancers. The recent advancement of scDNA sequencing technology sheds light upon addressing intratumor heterogeneity, detecting rare subclones, and reconstructing tumor evolution lineages at single-cell resolution. Nevertheless, the current circular binary segmentation based approach proves to fail to efficiently and effectively identify copy number shifts on some exceptional trails. Results Here, we propose SCYN, a CNV segmentation method powered with dynamic programming. SCYN resolves the precise segmentation on in silico dataset. Then we verified SCYN manifested accurate copy number inferring on triple negative breast cancer scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation. We tested SCYN on two datasets of the newly emerged 10x Genomics CNV solution. SCYN successfully recognizes gastric cancer cells from 1% and 10% spike-ins 10x datasets. Moreover, SCYN is about 150 times faster than state of the art tool when dealing with the datasets of approximately 2000 cells. Conclusions SCYN robustly and efficiently detects segmentations and infers copy number profiles on single cell DNA sequencing data. It serves to reveal the tumor intra-heterogeneity. The source code of SCYN can be accessed in https://github.com/xikanfeng2/SCYN. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07941-3).

prevailing in pan-cancer studies [7]. The loss of tumor suppressor genes like KDM6A and KAT6B are proclaimed indirectly amplifies harmful cancer-related pathways [8,9].
Conventional experimental protocols for CNV segmentation lies in the following scenarios. Researchers may infer a coarse CNV profiles utilizing bulk RNA sequencing [10] and single cell RNA sequencing [11][12][13] [10]. Moreover, scientists may leverage bulk genome techniques such as DNA array comparative genomic hybridization (aCGH) [14], single-nucleotide polymorphism (SNP) arrays [15,16], and DNA next generation sequencing (NGS) [17,18] to generate high resolution CNV. Although bulk genome sequencing studies have contributed insights into tumor biology, the data they provide may mask a degree of heterogeneity [19]. For instance, if the averaged readout overrepresents the genomic data from the dominant group of the tumor cells, rare clones will be masked from the signals. The advent of single-cell DNA sequencing (scDNA-Seq) delivers a potential solution to overwhelm the deficiencies of bulk approaches [20][21][22]. By assigning a unique barcode to each single cell, scDNA-seq is able to record the minority cell population, thus to address intratumor heterogeneity (ITH) [22], detect rare subclones [19], and reconstruct tumor evolution lineages [20,23].
In this study, we concentrate on the CNV segmentation and turning points detection approaches customized for single cell DNA sequencing. CNV Segmentation refers to partitioning the genome into non-overlapping segments with the objective of that each segment shares intra-homogeneous CNV profile, and the segment boundaries are often termed to be checkpoints or turning points [24]. Although numerous CNV segmentation tools have emerged leveraging high throughput sequencing data such as Circular Binary Segmentation (CBS) [25,26] and Hidden Markov Model (HMM) [27,28], the methods customized for scDNA data is in its infancy. Gingko [29], SCNV [30], and SCOPE [31] applied diverse strategies to normalize the scDNA intensities through simultaneously considering sparsity, noise, and cell heterogeneity, and adopted variational CBS for checkpoint detection. While after in silico experiment, we argue that those CBS approaches might not lead to an optimal segmentation result, some turning points might be masked. Furthermore, with the advance of large scale high throughput technologies, the scale of cells for a single dataset climbs exponentially. For instance, the newly emerged 10x Genomics CNV solution can profile the whole genome sequencing of thousands of cells at one time [22]. Thus, efficiently processing scDNA-seq data is crucial. However, current scDNA CNV segmentation methods are too time-consuming to process thousands of cells.
Therefore, in this paper, we propose SCYN (Single Cell and dYNamic programming), an effecient and effective dynamic programming approach for single cell data CNV segmentation and checkpoint detection. SCYN resolves the precise turning points on in silico dataset, while existing tools fail. SCYN manifested more precise copy number inference on a triple-negative breast cancer scDNA dataset, with array comparative genomic hybridization results of purified bulk samples as ground truth validation. We tested SCYN on two datasets of the newly emerged 10x Genomics CNV solution. SCYN successfully recognizes gastric cancer cells from 1% and 10% spikeins 10x datasets. Last but not least, SCYN is about 150 times faster than state of the art tool when dealing with thousands of cells.

Overview of SCYN
We developed an algorithm, SCYN, that adopts a dynamic programming approach to find optimal single-cell CNV profiles. The framework for SCYN displayed in Fig. 1A. First, the raw scDNA-seq reads of FASTQ format are preprocessed with standard procedures (Fig. 1A). SCYN then takes the aligned BAM files as the input. SCYN integrates SCOPE [31], which partitions chromosomes into consecutive bins and computes the cell-by-bin read depth matrix, to process the input BAM files and get the raw and normalized read depth matrices. The segmentation detection algorithm is then performed on the raw and normalized read depth matrices using our dynamic programming to identify the optimal segmentation along each chromosome. The segmentation results are further applied to copy number calculation. Finally, SCYN outputs the cellby-bin copy number matrix and the segmentation results of all chromosomes for further CNV analysis.

SCYN effectively identifies all turning points on synthetic trial
To evaluate the segmentation power of SYCN against SCOPE, we conducted one simulation experiment. We first generated a synthetic CNV profile of 100 singe cells on chromosome 22, with 50M bp as one bin, resulting in a 100 ×70 CNV matrix. As illustrated in Fig. 2B, there is a large proportion of normal cells with average diploid copy number and four tumor subclones, which manifests six turning points and seven segments on chr22. Then, we fit the ground truth CNV profiles into single cell sequencing simulator SCSsim [32] to get the synthetic FASTQ reads ( Fig. 2A). Figure 2C-D shows the inferred CNV profiles on the simulated reads from SCYN and SCOPE, respectively. Both SCYN and SCOPE able to recognize the normal cells and mask the noises. SCYN did sound work on CNV segmentation to correctly identify all six turning points and uncovered the cell heterogeneity. Nevertheless, SCOPE adds one nonexistent turning point inside segmentation S1, and drops two critical turning points which discriminate S4-S5 and S6-S7. These then lead to erroneous CNV segmentation and CNV estimations. Furthermore, we conducted a series of in silico spike-in experiments for CNV turning points detection with different proportion of normal cells (Additional file 1, Supplementary Figure S1A), different number of cell clusters (Additional file!1, Supplementary Figure S1B), and different number of CNV segments (Additional file 1, Supplementary Figure S1C), respectively. Our results show that SCYN call turning points with 100% accuracy regardless of the cell and segment CNV complexity of ground-truth settings, whilst SCOPE always call false positive and false negative points. As previously mentioned, the core principle of CNV segmentation is partitioning the genome into non-overlapping areas with the objective of that each area shares intra-homogeneous CNV profile [24,30]. SCOPE fails to hit the correct answer as its turning point detection fails. Overall, our experiment on synthetic data suggest that empowered with dynamic programming, SCYN can achieve the correct copy number turning point detection against the segmentation schema SCOPE proposed.

SCYN successfully identifies subclones in wet-lab cancer datasets
We illustrate the performance of SCYN in cancer singlecell datasets. We collected two cancer data sets, namely the Nature_TNBC (two triple-negative breast cancers) [33] and 10x_Gastric (gastric cancer spike-ins). We illustrated the tumor intra-heterogeneity discovered by SCYN and validated the results of SCYN against the estimation made by SCOPE for ground truth available datasets.
The first benchmark dataset we investigated is Nature_TNBC. 100 single cells were separately sequenced from two triple-negative breast cancer samples, namely, T10 and T16 [33]. For T10, we removed cell SRR054599 as it did not pass the quantity control, resulting 99 single cells from held four subgroups: Diploid (D), Hypodiploid (H), Aneuploid A (A1), and Aneuploid B (A2). We first verified if SYCN could replicate the subclone findings previously reported. Figure 3A Figure 4A records T16 genome-wide copy number profiles across the 100 single cells. In all, the cell subclones recognized by SCYN are consistent with SCOPE (Additional file 1, Supplementary Figure S2B) and Navin et al.'s findings. Hierarchical clustering characterizes T16 into seven subgroups. As depicted in Fig.4 and Additional file 1 Supplementary Figure S3B-4B, cluster 1 mates the primary diploid (PD) cells. Cluster 3 represents metastasis aneuploid (MA), and cluster 6,7 together pictures primary aneuploid (PA). As Navin et al. only profiled four bulk dissections using of T16 aCGH [34], there lacks the CNV gold standard for 16T in su subclones. So we calculated the CNV correlation and RMSE between inferred primary aneuploid (PA) subpopulation and the four dissections, respectively. From Fig. 4B-C, although the association between PA group and four bulk dissections is relatively low, SCYN profiles a closer correlation than SCOPE with higher correlation and lower discrepancy.
We next employed SCYN and SCOPE to the lately published single cell DNA spike-in demo datasets available at The heatmap of copy number profiles with single cells as row and genomics bins as columns, from ground truth, SCYN, and SCOPE, respectively. The colorbar (blue, white, dark red) represents the value of copy number from 1 to 9, respectively. In (B), vertical lines signify the ground-truth turning points, which divide the genomic bins into seven segments from S1 to S7 the 10x Genomics official website. 10x Genomics mixed BJ fibroblast euploid cell line with 1% and 10% spike-in of cells from MKN-45 gastric cancer cell line. As illustrated in the CNV heatmap Fig.5A and Additional file 1 Supplementary Figure S5, SCOPE successfully distinguished the two spike-in gastric cancer cells. Furthermore, we visualized the first two principal components of the estimated CNV profiles in Fig.5B-C. Cells whose Gini coefficient more massive than 0.12 were highlighted in yellow and regarded as gastric cancer cells from the 1% and 10% spike-ins, respectively. Then, we checked if SYCN produced CNV profiles better preserves the cell subpopulation information against SCOPE. Leveraging Gini 0.12 as the cut-off value, we partitioned cells into normal and cancer subset as benchmark labels. Next, we practiced hierarchical clustering into CNV matrices attained from SYCN and SCOPE, and get two clusters for each spikein sets. Then, we adopt four metrics to inquire about the clustering accuracy of SYCN against SCOPE. The adjusted Rand index (ARI) [35], Normalized mutual information (NMI) [36], and Jaccard index (JI) [37] measures the similarity between the implied groups and goldenstandard labels; a value approaching 0 purports random assignment, and one reveals accurate inferring. As evidenced in Tables 1 and 2, with ARI, NMI, and JI as measurements, SYCN holds equal clustering accuracy to SCOPE on both 1% and 10% spike-in sets, which indicates SYCN captures substantial interior tumor heterogeneity.

SCYN segmentation is fast
Recall that efficient processing of scRNA-seq data is essential, especially in today's thousands of single cells throughput. To evaluate the efficiency of SCYN against SCOPE, we measured the checkpoint detection step CPU running time of SCYN and SCOPE on T10, T16M, T16P, 10x 10% spike-in, 10x 1% spike-in, and several simulation data sets (90-1, 90-2, 2000-1, 2000-2, 2000-3 , 2000-4, and 2000-5), with the cell number ranging from 48 to around 2000. We respectively ran SCYN and SCOPE on each dataset ten times and calculated the mean CPU running  Table 3 and Fig. 6, the CPU consuming time of SCYN is almost linear in log scale with the increase of cell number. However, the CPU time of SCOPE rises dramatically when the cell number goes to hundreds or thousands. For instance, for large datasets with 2k cells, SCYN is around 150 times faster than SCOPE, SCYN finished the tasks within eight minutes, while SCOPE is unable to scale 2k cells within 16 hours. In all, SCYN is super fast in respective of datasets scale up to hundreds or thousands.

SCYN segmentation has better mBIC values
SCYN is fast because we only adopt the simplified version (Equation 1 in Method) of total SCOPE-mBIC [31] as the objective of segmentation and optimize it utilizing dynamic programming. Experiments on synthetic datasets and real cancer datasets successfully validated the tumor intra-heterogeneity exposure efficacy of SCYN against SCOPE. Here we further evaluate SCYN optimization effectiveness against SCOPE in respective of the original SCOPE-mBIC objective. We compared SCOPE-mBIC The colorbar (blue, white, dark red) represents the value of copy number from 0 to ≥10, respectively. N/A denotes the missing of copy number in corresponding genomic region. An interactive visualization is available on scSVAS [42] (https://sc.deepomics.org/ oviz-project/analyses/view) with demo set "10x_10%spike-ins". (B-C) PCA plots on 1% and 10% spike-in datasets respectively. The yellow and purple dots denote cancer and normal cell, respectively value by adopting the segmentation results of SCYN and SCOPE on real cancer datasets T10, T16P, T16M, and 10x spike-ins. As illustrated in Fig. 7A and Supplementary Figure S6A, the mBICs yielded from SCYN on samples across all chromosomes are always more massive than the mBICs produced by SCOPE, except chromosome 16 of 1% spike-in. Clearly, SCYN achieves better segmentation concerning the tedious SCOPE objective. Furthermore, as illustrated in Fig. 7B and Supplementary Figure  S6B, the proportions of the simplified mBIC against overall SCOPE-mBICs are overwhelming across all chromosomes except chr16, indicating the residual terms actually can be neglected without loss of accuracy and the minor fluctuations of mBIC will not affect the ability of SCYN to detect subclones, as proved in the previous section.

Discussion
In this study, we proposed SCYN, a fast and accurate dynamic programming approach for CNV segmentation and checkpoint detection customized for single cell DNA sequencing data. We demonstrated SCYN guaranteed to resolve the precise turning points on in silico dataset against SCOPE. Then we proved SCYN manifested a more accurate copy number inferring on triple-negative For 10x merged BAM (One bam file), SCYN provides the function to split merged bam to cell bams based on the barcodes. SCYN outputs the segmented CNV profiles and cell meta-information for downstream analysis, such as hierarchical clustering and phylogeny reconstruction.
We neglected one crucial issue. Cancer scDNA-seq intensities should be regarded as a mixture of subclone cell signals with confounding of sparsity, GC bias, and amplification bias [31]. The perfect CNV segmentation heavily relies on the cross-cell normalization of intensities in the first place. While we brutely adopt the normalization schema from SCOPE; there lacks a comprehensive evaluation of scDNA intensities normalization. Speaking to further work, inferring CNV profiles from single-cell RNA sequencing (scRNA-seq) is trending [11][12][13]38]. Incorporating DNA and RNA to profile single cell CNV segmentation might lead to tumor intra-heterogeneity to a higher resolution.
Copy number variation is crucial in deciphering the mechanism and cure of complex disorders and cancers. The recent advancement of scDNA sequencing technology sheds light upon addressing intratumor heterogeneity, detecting rare subclones, and reconstructing tumor  evolution lineages at single-cell resolution. Nevertheless, the current circular binary segmentation based approach proves to fail to efficiently and effectively identify copy number shifts on some exceptional trails.

Conclusion
To summarize, we propose SCYN, a CNV segmentation method powered with dynamic programming. Experiments on in silico and wet-lab data demonstrate that SCYN robustly and efficiently detects segmentations and infers copy number profiles on single cell DNA sequencing data. It serves to reveal the tumor intra-heterogeneity.

Data sets Synthetic data
The workflow for synthetic data generation is displayed in Fig. 2 A. Firstly, we pre-defined a CNV matrix which contains 100 cells and 70 bins for chromosome 22 and each bin has a fixed length of 500M. Also, 10% random noise is applied to this CNV matrix. Secondly, the bed file of each cell was generated according to the corresponding CNV number. Thirdly, we adopted the tool, SCSsim [32], which accepts the bed file as the input, to generate the DNA sequence of each cell(FASTQ format). Finally, the DNA sequence data of 100 cells were generated.

Single-end real scDNA-seq data
Two single-end breast cancer scDNA-seq datasets were downloaded from NCBI Sequence Read Archive with the SRA number of SRA018951. The raw fastq files were aligned using BWA-mem [39] to the human hg19 reference genome, and the BAM files were sorted using SAMtools [40]. Picard toolkit [41] was used to remove duplicate reads. The clean BAM files were fed as the input of SCYN package.

Ten-x (10x) data
The 10x spike-in scDNA-seq data was collected from the 10x Genomics official dataset with the accession link https://support.10xgenomics.com/single-celldna/datasets. The cell-mixed BAM files were demultiplexed to cellular BAMs according to cellular barcodes using Python scripts.

Notations
To profile the CNV along genomes, first, we partition the genome into fix-size bins. Assume the number of bins as m. If the number of cells is n, then the input matrices, Y m×n andŶ m×n , contain the raw and normalized reads counts, respectively; that is, Y i,j includes the number of raw reads count belong to bin i at cell j andŶ m×n contains the number of normalized reads count belong to bin i at cell j, where 1 ≤ i ≤ m and 1 ≤ j ≤ n.

Segmentation
The first task is to partitioning the bins into segments to optimize an objective function. Here, we choose the objective function to maximize the simplified version of modified Bayesian information criteria (mBIC) proposed by Wang et al. [31].
To calculate the simplified mBIC, we need to partition the sequence of bins into segments s 1 , ..., s , where s k = Denote the number of bins in segment s k as |s k | With the partitioning, we can calculate two matrices X ×n ,X ×n , where X k,j = 1 Given a segmentation S = (s 1 , ..., s ), its simplified mBIC is calculated as where log L τ L 0 is the generalized log-likelihood ratio, κ 1 and κ 2 are two pre-defined constants and For more details on the interpretation of the terms in mBIC, we refer the readers to Wang et al. [31]. Our objective here is to find a segmentation S opt such that β(S opt ) is maximized.

Optimal algorithm
Let β(k, i) store the simplified mBIC value for the optimal segmentation which partitions bins 1, ..., i into k segments. Associated with β(k, i), we also store the corresponding generalized log-likelihood ratio L(k, i), which is the first term in Eq. 1, the log-likelihood ratio l(i, j) for a single segment starting at the i-th bin and ending at the j-th bin, and The β(k, i) is calculated by the following recursive formulations: where C is the sum of last two terms in Equation 1. As demonstrated in Equation 3, the value of each cell β(k, i) in table β can be computed based on the earlier store data L(k − 1, i ) and l(i + 1, ..., i). The computed β(k, i) is then used to incrementally with k and i to compute the correct values of β. Clearly, the values of β and L for one segment can be initialized to equal to l.
The values of β can be stored in a two dimensional array, i.e., a table. The procedure for computing the table β is also displayed in Algorithm 1. The table β will be constructed starting from a single segment β(1, i), and moving towards more segments β(k, i). The β(1, i) and L(1, i) are initialized to l(1, i) and T(0, i) is initialized to 0 when there is only one segment. When computing a cell β(k, i)(k > 1), we will checks all possible i , (k ≤ i < i) and compute all values of (L(k − 1, i ) + l(i + 1, ..., i) + C) and β(k, i) is determined by max (L(k−1,i )+l(i +1,...,i))+C . Processing the bins form in increasing order on length guarantees that the final optimal segmentation can be detected when i is equal to the total number of bins m. At the last, the positions of k − 1 turning points are stored in table T. for segment number k from 1 to pre-defined K do 3: for each bin i from 1 to m do 4: if k == 1 then 5: β(1, i) = l(1, i) 6: L(1, i) = l(1, i) 7: T(0, i) = 0 8: else 9: β(k, i) = max 1≤i <i (L(k − 1, i ) + l(i + 1, ..., i) + C) 10: L(k, i) = arg max i (β(k, i))(L(k−1, i )+ l(i + 1, ..., i)) 11: T(k − 1, i) = arg max i (β(k, i)) 12: end if 13: end for 14: end for 15: end procedure

Backtracking
The backtracking process of finding the positions of the optimal turning points is demonstrated in Fig. 1B. Let the table at the left-side of Fig. 1B as T, where i and j are the indexes of turning points and bins respectively. T(i, j) is the position of the i-th optimal turning point for a segment s(0, j). The optimal total turning points number is determined by the maximum value of β(i, m), where m is the total number of bins. Then the positions of the optimal turning points can be found by the following formulation: T(k − 2, j) = T(k − 2, T(k − 1, m) − 1) where k is the total segmentation number (1 < k ≤ K), j is the index of bin and m is the total number of bins.

Time complexity
The

Benchmark settings
SCOPE is a state-of-the-art tool for single cell CNV calling. We followed the steps in SCOPE README tutorial to perform the call CNV tasks in all datasets and the default parameters were used in all experiments. For SCYN, the function 'call()' was used and all parameters were set to default values. For running time analysis experiments, all experiments were run on a Dell server with an Intel(R) Xeon(R) CPU E5-2630 v3 with a clock speed of 2.40GHz. The mean value of 5 independent runs was regarded as the final running time for each tool.