Visualization and inspection of genomics aberrations yielded by SCSilicon
We first applied BWA 0.7.17 [23] to align the synthetic single-cell FASTQ reads yielded by SCSilicon to the human reference (hg19). Then we visualized the ground-truth genomics aberration produced by SCSilicon and inspected whether the single-cell DNA reads carry the ground-truth abnormalities in SNP, SNV, Indel, and CNV, respectively.
Figure 2A exhibits the SNPs profile SCSilicon automatically generated across 10 single-cells. The cell population was assumed to share similar but slightly varied SNPs profiles. In Fig. 2A, we only visualized 100 randomly selected SNPs as we found when the number of SNP data increases (for example, 1000 SNPs), the heatmap would look a little fuzzy to clearly reflect the above characteristics.
Next, we leveraged IGV browser [24] to visualize the landscape of simulated SNPs across 25kb local genome region on gene SEZ6L (chr22:26,615,000-26,640,000) in two cells. As expected in Fig. 2B, the synthetic SNPs are randomly scattered in the reads. Likewise, cell 1 and cell 2 share alike but slightly different SNPs profiles. We then manually checked three SNPs (Fig. 2C). Located in chr22:36,750,551, SNP1 has three reference alleles G and three alternative alleles A in cell 1, and four alternative alleles A in cell 2. Located in chr22:36,750,587, SNP2 has seven reference alleles T and two alternative alleles A in cell 1, and five reference alleles T in cell 2. Located in chr22:36,750,622, SNP3 has five reference alleles G and two alternative alleles T in cell 1, and three reference alleles G in cell 2. Similarly, Additional file 1 Supplementary Fig. S1 and Fig. S2 demonstrates the SNV and Indel events SCSilicon generated. We also evaluated the generating-accuracy (the percentage of correctly generated SNPs, SNVs or Indels in sequence data from all SNPs, SNVs or Indels in ground truth data) of SCSilicon. We generated three dataset, SNP dataset, SNV dataset and Indel dataset respectively. Each dataset contained 10 cells and the average generating-accuracy was calculated for each dataset. The result show that the average generating-accuracies of these three dataset are all 100% which reflects the stability of SCSilicon.
Figure 3A and Fig. 3B are illustrations of two CNV matrices automatically generated by SCSilicon’s CNVSimulator with the random seed. The configuration is 100 single cells among chr22 with 50M as a bin, leading to matrices size of 100 × 70. The left-side matrix offers 20 normal cells and seven tumor cell clusters, with four CNV breakpoints and five CNV segments. The right-side matrix is more complicated. It owns 40% healthy cells and eight tumor subclones, with nine CNV breakpoints and ten CNV segments. Figure 3C is a snapshot of IGV visualization of CNV breakpoint chr22:49,500,000 across two cells. In cell 3, the breakpoint’s downstream region’s coverage is much higher than the upstream region. In cell 4, the breakpoint’s downstream region’s coverage is much lower than the upstream region. Meanwhile, the cell 3 breakpoint’s upstream region coverage is lower than the cell 4 breakpoint’s downstream region. These observations are concordant with the synthetic CNV ground-truth (cell 3 upstream region: 1, cell 3 downstream region: 8, cell 4 upstream region: 8, cell 4 downstream region: 3).
Benchmarking state-of-the-art single-cell CNV caller
Recall that copy number variation (CNV) is considered to be a driving force in cancer progression and metastasis in single-cell genomics study [4, 5, 25]. Over the decades, an arsenal of scDNA-Seq CNV caller has been proposed. AneuFinder automatically qualifies the CNV profile leveraging a Hidden Markov model [10]. SCOPE [11] detected CNV by a Poisson latent factor model. SCYN [12] adopts SCOPE’s normalization policy and utilizing dynamic programming to conduct CNV segmentation. Be aware of each scDNA-seq CNV caller’s merits and demerits and choose the most robust one is essential to conduct single-cell genomics studies. Herein, we utilized the synthetic single-cell DNA reads generated by SCSilicon to evaluated three state-of-the-art CNV callers: AneuFinder, SCOPE, and SCYN.
We have mimicked two CNV matrices with 100 single cells on chr22 (50M bp/bin), dataset1 and dataset2 with noise rate 10% and 12%, respectively. For CNV dataset1, Fig. 4A and Fig. 4E displays the noisy and clean CNV ground-truth. This benchmark set has five cell subpopulations, with one normal cells clusters (average CNV is 2) and four tumor cell clusters with different CNV gains and losses. The ground-truth CNV matrix harbours six CNV breakpoints (chr22:29,500,000, chr22:31,500,000, chr22:39,000,000, chr22:40,500,000, chr22:43,000,000, chr22:49,500,000), leading to seven CNV segments. Figure 4B,C,D illustrates the estimated CNV matrix on the synthetic reads from AneuFinder, SCOPE, and SCYN, respectively. From bare-eye checking, SCOPE and SCYN can absorb the noise and distinguish the healthy cells, whereas AneuFinder’s performance is hugely skewed by the bias, mistakenly recognizing a large proportion of healthy cells as aneuploidy. However, AneuFinder successfully detected all six CNV breakpoints just like SCYN, while SCOPE attaches one fictional breakpoint between CNV segment chr22:29,500,000-31,500,000, and fails to call two vital breakpoints (chr22:40,500,000 and chr22:49,500,000). Furthermore, Fig. 4F,G,H reveals that the CNV inferred from SCYN has the highest Pearson correlation (R=0.99,p<2.2e−16) with ground truth CNV. We checked the CNV calling accuracy as well. In terms of neutral, gain, and loss, we treat them as three binary classification problems. We labeled the ground-truth and inferred CNV of each cell bin region with “neutral” (CN = 2) and “not neutral” (CN ≠ 2), “gain” (CN > 2) and “not gain” (CN ≤ 2), and “loss” (CN < 2) and “not loss” (CN ≥ 2). We defined the CNV calling accuracy as the correct predictions divided by the total number of predictions and used Python sklearn “metrics.accuracy_score” function to calculate it. Figure 4I and Additional file 1 Supplementary Table S1 demonstrates that SCYN manifests the highest CNV calling accuracy in neutral, gain, and loss region, respectively. For CNV dataset2 with a higher noise rate, Additional file 1 Supplementary Fig. S3 and Supplementary Table S2 demonstrate SCYN shows the highest Pearson correlation (R=0.96,p<2.2e−16) and the highest CNV calling accuracy in neutral, gain, and loss region respectively.
Overall, SCYN demonstrates the best efficacy in both breakpoint detection and CNV estimation, whereas AneuFinder has a deficiency in CNV normalization, and SCOPE is limited to correct breakpoint detection.