Millefy highlights cellular heterogeneity in gene expression and transcribed gene structures
Researchers often merge read alignment files of single cells and visualize “synthetic bulk” data using standard genome browsers. However, in such cases, the merged (or averaged) read coverage cannot capture heterogeneity in read coverage. For example, a change in the merged read coverage cannot indicate whether the number of cells expressing a gene increased or the expression level of that gene increased across all cells. In contrast, Millefy visualizes read coverage of all individual cells in a scRNA-seq dataset as a heat map and thereby provides detailed information on cellular heterogeneity in read coverage.
To demonstrate the usefulness of Millefy’s ability to visualize read coverage in scRNA-seq data, we used a time-course RamDA-seq dataset derived from mouse embryonic stem cells (mESCs) upon induction of cell differentiation to primitive endoderm cells (at 0, 12, 24, 48, and 72 h) [12]. The dataset consists of 421 single cells.
Figure 2 shows the read coverage at Sox17, a differentiation marker gene. Cells were reordered according to the first diffusion component values calculated by a diffusion map of read coverage data for the locus, either within user-defined cell groups (Fig. 2a) or across all cells (Fig. 2b). While the height of the mean read coverage increased along the differentiation time course, the reordered heat map highlights the heterogeneity of read coverage among cells from the same time points (e.g., the 12 h group) (Fig. 2). Specifically, Millefy showed that the number of cells with Sox17 expression increased, indicating asynchronous cell differentiation progression among cells.
Another example is Zmynd8, a transcriptional repressor. Figure 3 shows read coverage of 421 individual cells at the Zmynd8 locus. The cells were dynamically reordered using diffusion maps based on the read coverage in the focal region. Expression of the Zmynd8 short isoform is known to be associated with the expression of its antisense RNA Zmynd8as [24]. While Zmynd8as is unannotated in the current gene annotation, the heat map by Millefy clearly showed differential regulation of the long isoform of Zmynd8 and Zmynd8as, facilitating visual inspection of the two separated transcription units (Figure 3). We note that the averaged read coverage for each time point cannot distinguish whether the long and short isoforms of Zmynd8 and Zmynd8as are correlated or uncorrelated. These results demonstrate that Millefy’s functionality for displaying read coverage as a reordered heat map reveals cell-to-cell heterogeneity at the focal locus.
Millefy application on scRNA-seq data from triple-negative breast cancer patients
We also applied Millefy to a scRNA-seq dataset from triple-negative invasive cancer (TNBC) patients: the Smart-seq2 data of single sorted cells from six tumors collected from six women with primary, non-metastatic triple-negative invasive ductal carcinomas [25]. The dataset consists of B-cells (n=19), endothelial cells (n=14), epithelial cells (n=868), macrophages (n=64), stromal cells (n=94), and T-cells (n=53), according to the cell type annotation published by the authors. Epithelial cells were further classified into five clusters by the authors [25]: Clusters 1 (n=22), 2 (n=398), 3 (n=231), 4 (n=170), and 5 (n=47).
We checked whether 3 ′ UTR shortening is observed in the endothelial cell clusters. The Neuroblastoma RAS viral (v-Ras) oncogene homolog (NRAS) and the Jun proto-oncogene (c-JUN) are known to show alternative polyadenylation (APA)-dependent 3 ′ UTR shortening in TNBC cells [26]. Millefy appears to show that there is cell-to-cell heterogeneity in the length of the 3 ′ UTRs of c-JUN and NRAS (Fig. 4). Specifically, for c-JUN, some cells showed short read coverage and others showed long read coverage (Fig. 4a). In the last exon of NRAS, many cells showed long 3 ′ UTR read coverage but some cells showed a shortened 3 ′ UTR read coverage (Fig. 4b). Such heterogeneity cannot be determined by the aggregated (averaged) read coverage alone (Fig. 4). The triple-negative breast tumors with the 3 ′ UTR shortening of c-JUN and/or NRAS are reported to be smaller and less proliferative but more invasive than those without the 3 ′ UTR shortening [26]. Therefore, such subpopulations may confer heterogeneous invasiveness even when the population size is small; thus, further investigation is warranted.
Millefy associates read coverage with genomic contexts to facilitate interpretation of read coverage
Genomic contexts are crucial for interpretation of read coverage in bulk and single-cell RNA sequencing data. For example, read coverage overlapped with gene annotations can confirm known and reveal novel exon-intron structures. Moreover, read coverage overlapped with enhancer annotations can be interpreted as eRNA expression [6]. Using Millefy, single-cell read coverage can be compared with genomic and epigenomic features like enhancer elements.
To demonstrate the usefulness of the simultaneous visualization of single-cell read coverage and genomic contexts, we compared read coverage of the RamDA-seq data from mESCs (0 h) with mESC enhancer regions. We downloaded H3K4me1 and H3K4me3 ChIP-seq peak regions for mESCs from the ENCODE project [27] and defined mESC enhancers as the H3K4me1 peaks that (1) did not overlap with the H3K4me3 peaks, (2) were at least 2 kbp away from the transcriptional start sites, and (3) were not included in the gene bodies of the GENCODE gene annotation (vM9) [28].
Figure 5 displays read coverage at the Myc locus, with the positions of enhancers active in mESCs. The Myc gene models and read coverage reveal that Myc was transcribed in mESCs. In addition, Millefy showed that some of the intergenic regions with transcribed RNA overlapped with the Myc downstream enhancer regions (Fig. 5). This is consistent with the previous report that RamDA-seq can detect eRNAs [12]. This result exemplifies how Millefy can help to interpret read coverage of scRNA-seq data in genomic contexts.
Millefy facilitates quality control in full-length scRNA-seq methods
Millefy can also be used for QC in full-length scRNA-seq methods. For example, scRNA-seq read coverage of long transcripts indicates whether the method employed provided full-length transcript coverage. Full-length transcript coverage provides accurate information about isoform expression and gene structures and is a fundamental feature of full-length scRNA-seq methods [29].
We applied Millefy to C1-RamDA-seq data (n=96) and C1-SMART-Seq V4 (n=95) data from a dilution of 10 pg of mESC RNA. Figure 6 shows the read coverage at Mdn1, a gene with a long transcript (17,970 bp) consisting of 102 exons. C1-RamDA-seq detected all exons in most samples, while C1-SMART-seq V4 failed to detect a fraction of known exons. Interestingly, the patterns of missing exons in C1-SMART-seq V4 seemed to vary among the samples. The lower reproducibility in read coverage of C1-SMART-seq V4 relative to C1-RamDA-seq is likely owing to technical noise because the samples were prepared not from living cells but from a dilution of 10 pg of RNA. We note that mean read coverage cannot provide such detailed information on reproducibility in read coverage (Fig. 6). This result exemplifies how Millefy can be used for quality control of scRNA-seq methods.
Computational time
We measured the computational time of Millefy for visualizing whole gene bodies using RamDA-seq data with 793 samples from mESCs [12]. For 1000 randomly selected gene loci (of expressed genes with average TPM >5), Millefy processed BigWig and BAM files in 39.3 and 138.9 s, respectively, on average.