Read coverage of RNA sequencing data reflects gene expression and RNA processing events. Single-cell RNA sequencing (scRNA-seq) methods, particularly “full-length” ones, provide read coverage of many individual cells and have the potential to reveal cellular heterogeneity in RNA transcription and processing. However, visualization tools suited to highlighting cell-to-cell heterogeneity in read coverage are still lacking.
Here, we have developed Millefy, a tool for visualizing read coverage of scRNA-seq data in genomic contexts. Millefy is designed to show read coverage of all individual cells at once in genomic contexts and to highlight cell-to-cell heterogeneity in read coverage. By visualizing read coverage of all cells as a heat map and dynamically reordering cells based on diffusion maps, Millefy facilitates discovery of “local” region-specific, cell-to-cell heterogeneity in read coverage. We applied Millefy to scRNA-seq data sets of mouse embryonic stem cells and triple-negative breast cancers and showed variability of transcribed regions including antisense RNAs, 3 ′ UTR lengths, and enhancer RNA transcription.
Single-cell RNA sequencing (scRNA-seq) has been increasingly important in many areas, including developmental biology and cancer biology. In scRNA-seq data analyses, visualization is crucial for quality control (QC) as well as exploratory data analyses. For example, dimensionality reduction techniques such as principal component analysis (PCA)  or t-distributed stochastic neighbor embedding  are applied to a gene expression matrix to visualize individual cells as points in 2- or 3-dimensional space. Heat maps are also used to visualize gene expression matrices and highlight latent clusters of genes and cells. To date, many tools for visualizing gene expression matrices of scRNA-seq data have been proposed .
In contrast to visualization of gene expression matrices, visualization of read coverage, which is the distribution of mapped reads along genomic coordinates, helps reveal diverse aspects of RNA sequencing data and thus RNA biology and functional genomics. For example, read coverage reflects transcribed gene structures (e.g., exon-intron structures and transcript isoforms) , RNA processing events (e.g., normal and recursive splicing ), and transcription of intergenic and unannotated regions (e.g., enhancer RNAs [eRNAs]) . Moreover, visual inspection of read coverage enables quality assessment of experimental methods (e.g., whether amplification is biased ) and bioinformatic methods (e.g., the accuracy of expression level estimation).
Given that scRNA-seq has revealed cellular heterogeneity in gene  and splicing isoform expression [9, 10], visualization of read coverage of scRNA-seq data is expected to reveal cellular heterogeneity in read coverage, which can be interpreted as biological (e.g., transcription and RNA processing) and technical (e.g., amplification biases) heterogeneity. Read coverage is informative, especially for so-called “full-length” scRNA-seq methods such as Smart-seq2  and RamDA-seq , compared with “3 ′-tag sequencing” scRNA-seq methods, which sequence only the 3 ′ ends of RNAs and cannot be used to extract rich information from read coverage [13, 14]. Despite their potential importance, however, tools specifically for the visualization of read coverage of scRNA-seq data are still lacking.
To explore cell-to-cell heterogeneity in read coverage, we propose several requirements of a tool for visualization of read coverage in scRNA-seq data (Table 1). First, the tool must be able to display read coverage of all individual cells in a scRNA-seq dataset at once. This is because scRNA-seq data consist of many (102– 103) cells and frequently includes latent heterogeneity that is masked by the summation of expression across cells. Second, the tool must associate read coverage with genomic contexts, such as gene structures and epigenomic features, because read coverage data can be interpreted only when it is displayed simultaneously with their genomic contexts. Third, the tool must highlight the cell-to-cell heterogeneity of read coverage within focal regions. This is because there should be “local” region-specific cell-to-cell heterogeneity in read coverage at transcriptional (e.g., antisense RNAs and eRNAs) and post-transcriptional (e.g., alternative splicing) levels, and such heterogeneity is difficult to notice in advance by cell groupings defined according to global similarity among cells.
Genome browsers and heat maps are two major tools for read coverage visualization. However, they are insufficient for fulfilling the above requirements.
Genome browsers, such as IGV  and JBrowse , utilize “tracks” to display various types of biological data, including gene annotations, positions of regulatory elements, and read mapping of next-generation sequencing (NGS) data along genomic coordinates. By stacking tracks in genome browsers, read coverage can easily be compared with other features and be interpreted in genomic contexts like gene models and epigenomic signals, which helps to generate and validate biological hypotheses. However, existing genome browsers are not suited for the large numbers of samples (i.e., cells) in scRNA-seq experiments. Indeed, efforts to visualize read coverage of scRNA-seq data using genome browsers have been limited to displays of a few dozen cells without the need to scroll [17, 18]. Although IGV and JBrowse implement heat map representations of tracks to show many cells at once, they cannot dynamically reorder tracks to reveal local cell-to-cell heterogeneity in read coverage.
Tools for heat maps combined with clustering algorithms have been used in the analysis of scRNA-seq data. Thus, heat maps can be used to visualize read coverage of all cells at once and reveal heterogeneity in read coverage. However, tools for generating heat maps are unsuited for visualizing read coverage of scRNA-seq data in genomic contexts, or they lack functionality to directly extract read coverage from standard NGS data formats.
Here, we have developed Millefy, which combines genome-browser-like visualization, heat maps, and dynamic reordering of single-cell read coverage and thus facilitates the examination of local heterogeneity within scRNA-seq data. Millefy extracts and organizes various types of useful information from read coverage of scRNA-seq data.
Millefy visualizes read coverage from each individual cell as a heat map in which rows represent cells and columns represent genomic bins within a focal region. The heat map is aligned with tracks for gene annotations, genomic features, and bulk NGS data, enabling comparisons of single-cell read coverage with genomic contexts. To highlight latent cell-to-cell heterogeneity in read coverage, the heat map rows (i.e., cells) are automatically and dynamically reordered by ’local’ pseudo-time, which is calculated using diffusion maps , a nonlinear dimensionality reduction method. Specifically, diffusion maps are applied to matrices of single-cell read coverage, where rows are cells and columns are genomic bins, and the first diffusion component is used to dynamically reorder cells either in an“all cells” manner or in a “group-wise” manner when groupings of cells are provided by users. Alternatively, PCA can also be used to rearrange the order of cells.
Effective visualization requires iterative adjustment of various aspects of plots by visual inspection. Millefy supports iterative adjustment of plots by the millefy_adjust() function, which reuses the read coverage matrices of the last plot, enabling faster adjustment than simply replotting. For example, after viewing plots, users can easily adjust the maximal value of the color scale, which is essential in cases with exceptionally high read coverages in the focal region.
Millefy visualizations consist of five types of tracks (Fig. 1): (1) scRNA-seq tracks, which display scRNA-seq read coverage as a heat map with ordered cells, (2) mean scRNA-seq read coverage tracks, (3) bulk NGS data tracks, which display read coverage of other NGS data, (4) BED tracks, which display genomic intervals defined by BED files, and (5) gene annotation tracks. In scRNA-seq tracks and bulk NGS data tracks, read coverage is normalized by user-provided normalization factors to correct for differences in the number of mapped reads among samples. Using the above tracks, Millefy can simultaneously display read coverage of each cell and mean read coverage of cells in each user-defined cell group as well as align scRNA-seq data with genome annotation data and NGS data.
Millefy was implemented in R and can import scRNA-seq data without the need for format conversion. For scRNA-seq data, Millefy accepts BAM and BigWig formats, which are standard file formats for NGS data analysis. Millefy is dependent on the rtracklayer package  and Rsamtools package  for importing BAM and BigWig files, respectively. For gene annotation data and genomic features, Millefy accepts GTF and BED formats, respectively. The data.table package  is used to import GTF and BED files. For performing diffusion maps on read coverage data, Millefy utilizes the destiny package .
Millefy highlights cellular heterogeneity in gene expression and transcribed gene structures
Researchers often merge read alignment files of single cells and visualize “synthetic bulk” data using standard genome browsers. However, in such cases, the merged (or averaged) read coverage cannot capture heterogeneity in read coverage. For example, a change in the merged read coverage cannot indicate whether the number of cells expressing a gene increased or the expression level of that gene increased across all cells. In contrast, Millefy visualizes read coverage of all individual cells in a scRNA-seq dataset as a heat map and thereby provides detailed information on cellular heterogeneity in read coverage.
To demonstrate the usefulness of Millefy’s ability to visualize read coverage in scRNA-seq data, we used a time-course RamDA-seq dataset derived from mouse embryonic stem cells (mESCs) upon induction of cell differentiation to primitive endoderm cells (at 0, 12, 24, 48, and 72 h) . The dataset consists of 421 single cells.
Figure 2 shows the read coverage at Sox17, a differentiation marker gene. Cells were reordered according to the first diffusion component values calculated by a diffusion map of read coverage data for the locus, either within user-defined cell groups (Fig. 2a) or across all cells (Fig. 2b). While the height of the mean read coverage increased along the differentiation time course, the reordered heat map highlights the heterogeneity of read coverage among cells from the same time points (e.g., the 12 h group) (Fig. 2). Specifically, Millefy showed that the number of cells with Sox17 expression increased, indicating asynchronous cell differentiation progression among cells.
Another example is Zmynd8, a transcriptional repressor. Figure 3 shows read coverage of 421 individual cells at the Zmynd8 locus. The cells were dynamically reordered using diffusion maps based on the read coverage in the focal region. Expression of the Zmynd8 short isoform is known to be associated with the expression of its antisense RNA Zmynd8as . While Zmynd8as is unannotated in the current gene annotation, the heat map by Millefy clearly showed differential regulation of the long isoform of Zmynd8 and Zmynd8as, facilitating visual inspection of the two separated transcription units (Figure 3). We note that the averaged read coverage for each time point cannot distinguish whether the long and short isoforms of Zmynd8 and Zmynd8as are correlated or uncorrelated. These results demonstrate that Millefy’s functionality for displaying read coverage as a reordered heat map reveals cell-to-cell heterogeneity at the focal locus.
Millefy application on scRNA-seq data from triple-negative breast cancer patients
We also applied Millefy to a scRNA-seq dataset from triple-negative invasive cancer (TNBC) patients: the Smart-seq2 data of single sorted cells from six tumors collected from six women with primary, non-metastatic triple-negative invasive ductal carcinomas . The dataset consists of B-cells (n=19), endothelial cells (n=14), epithelial cells (n=868), macrophages (n=64), stromal cells (n=94), and T-cells (n=53), according to the cell type annotation published by the authors. Epithelial cells were further classified into five clusters by the authors : Clusters 1 (n=22), 2 (n=398), 3 (n=231), 4 (n=170), and 5 (n=47).
We checked whether 3 ′ UTR shortening is observed in the endothelial cell clusters. The Neuroblastoma RAS viral (v-Ras) oncogene homolog (NRAS) and the Jun proto-oncogene (c-JUN) are known to show alternative polyadenylation (APA)-dependent 3 ′ UTR shortening in TNBC cells . Millefy appears to show that there is cell-to-cell heterogeneity in the length of the 3 ′ UTRs of c-JUN and NRAS (Fig. 4). Specifically, for c-JUN, some cells showed short read coverage and others showed long read coverage (Fig. 4a). In the last exon of NRAS, many cells showed long 3 ′ UTR read coverage but some cells showed a shortened 3 ′ UTR read coverage (Fig. 4b). Such heterogeneity cannot be determined by the aggregated (averaged) read coverage alone (Fig. 4). The triple-negative breast tumors with the 3 ′ UTR shortening of c-JUN and/or NRAS are reported to be smaller and less proliferative but more invasive than those without the 3 ′ UTR shortening . Therefore, such subpopulations may confer heterogeneous invasiveness even when the population size is small; thus, further investigation is warranted.
Millefy associates read coverage with genomic contexts to facilitate interpretation of read coverage
Genomic contexts are crucial for interpretation of read coverage in bulk and single-cell RNA sequencing data. For example, read coverage overlapped with gene annotations can confirm known and reveal novel exon-intron structures. Moreover, read coverage overlapped with enhancer annotations can be interpreted as eRNA expression . Using Millefy, single-cell read coverage can be compared with genomic and epigenomic features like enhancer elements.
To demonstrate the usefulness of the simultaneous visualization of single-cell read coverage and genomic contexts, we compared read coverage of the RamDA-seq data from mESCs (0 h) with mESC enhancer regions. We downloaded H3K4me1 and H3K4me3 ChIP-seq peak regions for mESCs from the ENCODE project  and defined mESC enhancers as the H3K4me1 peaks that (1) did not overlap with the H3K4me3 peaks, (2) were at least 2 kbp away from the transcriptional start sites, and (3) were not included in the gene bodies of the GENCODE gene annotation (vM9) .
Figure 5 displays read coverage at the Myc locus, with the positions of enhancers active in mESCs. The Myc gene models and read coverage reveal that Myc was transcribed in mESCs. In addition, Millefy showed that some of the intergenic regions with transcribed RNA overlapped with the Myc downstream enhancer regions (Fig. 5). This is consistent with the previous report that RamDA-seq can detect eRNAs . This result exemplifies how Millefy can help to interpret read coverage of scRNA-seq data in genomic contexts.
Millefy facilitates quality control in full-length scRNA-seq methods
Millefy can also be used for QC in full-length scRNA-seq methods. For example, scRNA-seq read coverage of long transcripts indicates whether the method employed provided full-length transcript coverage. Full-length transcript coverage provides accurate information about isoform expression and gene structures and is a fundamental feature of full-length scRNA-seq methods .
We applied Millefy to C1-RamDA-seq data (n=96) and C1-SMART-Seq V4 (n=95) data from a dilution of 10 pg of mESC RNA. Figure 6 shows the read coverage at Mdn1, a gene with a long transcript (17,970 bp) consisting of 102 exons. C1-RamDA-seq detected all exons in most samples, while C1-SMART-seq V4 failed to detect a fraction of known exons. Interestingly, the patterns of missing exons in C1-SMART-seq V4 seemed to vary among the samples. The lower reproducibility in read coverage of C1-SMART-seq V4 relative to C1-RamDA-seq is likely owing to technical noise because the samples were prepared not from living cells but from a dilution of 10 pg of RNA. We note that mean read coverage cannot provide such detailed information on reproducibility in read coverage (Fig. 6). This result exemplifies how Millefy can be used for quality control of scRNA-seq methods.
We measured the computational time of Millefy for visualizing whole gene bodies using RamDA-seq data with 793 samples from mESCs . For 1000 randomly selected gene loci (of expressed genes with average TPM >5), Millefy processed BigWig and BAM files in 39.3 and 138.9 s, respectively, on average.
In this paper, we proposed Millefy, a tool for visualizing cell-to-cell heterogeneity in read coverage in scRNA-seq data. Millefy combines genome-browser-like visualization, heat maps, and dynamic reordering of single-cell read coverage. Thereby, Millefy can display read coverage of all cells at once, associate read coverage with genomic contexts, and highlight the cell-to-cell heterogeneity of read coverage within focal regions (Table 1).
Using scRNA-seq data of mESCs and TNBC, we demonstrate the effectiveness of Millefy to reveal local heterogeneity in read coverage within scRNA-seq data. First, Millefy showed cellular heterogeneity in gene expression and transcribed gene structures through the cell differentiation time course of mESCs (Figs. 2 and 3). Second, we found cellular heterogeneity in the 3 ′ UTR shortening of c-JUN and/or NRAS in TNBC data (Fig. 4), which was not mentioned in the original paper . Third, by associating read coverage with enhancer annotations, Millefy helped to interpret RNA transcription events in non-coding regions (Fig. 5). Collectively, these results indicate that Millefy enables the exploration of cellular heterogeneity of various biological events from read coverage of scRNA-seq data, which could be missed by conventional visualization tools.
Quality control is essential for developing new experimental and computational tools. Using scRNA-seq data of diluted RNA with different full-length scRNA-seq methods (Fig. 6), we demonstrate that Millefy visualizes read coverage in scRNA-seq as a QC measure and complements existing scRNA-seq QC pipelines based primarily on gene expression matrices . In the development of bioinformatics methods using rule-based and machine learning approaches for profiling alternative splicing or novel RNAs by scRNA-seq, read coverage visualization tools like Millefy will become more important for evaluating and representing the predictions of algorithms. Indeed, Millefy was recently used for evaluating and visualizing the results of software developed to discover differentially expressed (DE) gene regions .
Millefy, which is integrated with Jupyter Notebook and provided as a Docker image, can easily be utilized in exploratory analyses of scRNA-seq data. In conclusion, Millefy will provide new opportunities to analyze scRNA-seq data from the point of view of cell-to-cell heterogeneity in read coverage, and help researchers assess cellular heterogeneity and RNA biology using scRNA-seq data.
Song Y, Botvinnik OB, Lovci MT, Kakaradov B, Liu P, Xu JL, Yeo GW. Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation. Mol Cell. 2017; 67(1):148–161.
Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, Goodstein DM, Elsik CG, Lewis SE, Stein L, et al. Jbrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 2016; 17(1):66.
Coifman RR, Lafon S, Lee AB, Maggioni M, Nadler B, Warner F, Zucker SW. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc Natl Acad Sci. 2005; 102(21):7426–31.
Miles WO, Lembo A, Volorio A, Brachtel E, Tian B, Sgroi D, Provero P, Dyson N. Alternative polyadenylation in triple-negative breast tumors allows nras and c-jun to bypass pumilio posttranscriptional regulation. Cancer Res. 2016; 76(24):7231–41.
Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, et al.Gencode reference annotation for the human and mouse genomes. Nucleic Acids Res. 2018; 47(D1):766–73.
Ramsköld D, Luo S, Wang Y-C, Li R, Deng Q, Faridani OR, Daniels GA, Khrebtukova I, Loring JF, Laurent LC, et al.Full-length mrna-seq from single-cell levels of rna and individual circulating tumor cells. Nat Biotechnol. 2012; 30(8):777.
We thank the members of the Bioinformatics Research Unit, particularly Hirotaka Matsumoto and Mika Yoshimura, for discussion of data analyses and Manabu Ishii and Akihiro Matsushima for IT infrastructure management.
HO was supported by the Special Postdoctoral Researchers Program from RIKEN. This work was supported by the Projects for Technological Development, Research Center Network for Realization of Regenerative Medicine by Japan (18bm0404024h0001), the Japan Agency for Medical Research and Development (AMED), and JST CREST grant number JPMJCR16G3, Japan to I.N. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Authors and Affiliations
Bioinformatics Laboratory, Faculty of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8575, Japan
Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
Haruka Ozaki, Tetsutaro Hayashi, Mana Umeda & Itoshi Nikaido
Bioinformatics Course, Master’s/Doctoral Program in Life Science Innovation, School of Integrative and Global Majors, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8575, Japan
HO designed and implemented the software, performed the data analyses, and wrote the manuscript. TH and MU performed the C1-RamDA-seq and C1-SMART-Seq V4 experiments. IN contributed to the design of the data analyses. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.