 Research
 Open Access
 Published:
TSEE: an elastic embedding method to visualize the dynamic gene expression patterns of time series singlecell RNA sequencing data
BMC Genomics volume 20, Article number: 224 (2019)
Abstract
Background
Time series singlecell RNA sequencing (scRNAseq) data are emerging. However, the analysis of time series scRNAseq data could be compromised by 1) distortion created by assorted sources of data collection and generation across time samples and 2) inheritance of celltocell variations by stochastic dynamic patterns of gene expression. This calls for the development of an algorithm able to visualize time series scRNAseq data in order to reveal latent structures and uncover dynamic transition processes.
Results
In this study, we propose an algorithm, termed time series elastic embedding (TSEE), by incorporating experimental temporal information into the elastic embedding (EE) method, in order to visualize time series scRNAseq data. TSEE extends the EE algorithm by penalizing the proximal placement of latent points that correspond to data points otherwise separated by experimental time intervals. TSEE is herein used to visualize time series scRNAseq datasets of embryonic developmental processed in human and zebrafish. We demonstrate that TSEE outperforms existing methods (e.g. PCA, tSNE and EE) in preserving local and global structures as well as enhancing the temporal resolution of samples. Meanwhile, TSEE reveals the dynamic oscillation patterns of gene expression waves during zebrafish embryogenesis.
Conclusions
TSEE can efficiently visualize time series scRNAseq data by diluting the distortions of assorted sources of data variation across time stages and achieve the temporal resolution enhancement by preserving temporal order and structure. TSEE uncovers the subtle dynamic structures of gene expression patterns, facilitating further downstream dynamic modeling and analysis of gene expression processes. The computational framework of TSEE is generalizable by allowing the incorporation of other sources of information.
Background
Singlecell RNA sequencing (scRNAseq) technology provides snapshots of transcriptomes at singlecell resolution, offering a comprehensive approach to study complex biological processes, such as cell fate decisions [1–3]. Given the challenges raised by highdimensional gene expression profiles of largescale heterogeneous cell populations at diverse stages during cell statetransitions, many computational methods have been proposed for the visualization, clustering, and reconstruction of scRNAseq data (see [4], a recent review).
Cells used for a singlecell experiment are snapshot of heterogeneous dynamic populations. However, the dynamic range is usually limited, and only specific cell fate decisions can be analyzed when the samples are collected at a single stage or condition. Therefore, to extend the dynamic range and account for whole processes of cell development, time series scRNAseq data are emerging. These time series data are generated by sampling single cells collected and sequenced at multiple time stages along the time course of cell developmental processes [5, 6]. Yet, by integrating time series analysis into scRNAseq, new questions and challenges arise. Importantly, we know that scRNAseq data across time stages are contaminated by assorted sources of variations during data collection and generation [7]. Moreover, the celltocell gene expression is highly variable across the stochastic dynamics of gene transcription [8]. To address these issues, a handful computational time series scRNAseq methods have been developed to (1) determine temporal trajectories [9], (2) reconstruct cell development landscapes based on the sophisticated mathematical tool of optimal transport [10], and (3) identify genegene interactions, as well as gene networks [8, 11].
However, for such methods to perform model reduction, computation and validation, they must incorporate dimensionality reduction and/or visualization of singlecell highdimensional gene expression profiles. Accordingly, dimensionality reduction would ideally reveal the intrinsic structure of the data by representing both global structure by preserving the topology and geometry and local structure by preserving the neighborhood relationship. Among the numerous dimensionality reduction and visualization methods, principal component analysis (PCA) and tdistributed stochastic neighbor embedding (tSNE) are most widely used in the singlecell community to visualize data structures [4, 12]. However, PCA is designed to preserve linear structure based on eigendecomposition of a matrix into canonical form, and as such, it is incapable of handling nonlinear structures. While tSNE emphasizes neighborhood information to reveal the local cluster structures of the data, it tends to shatter trajectories and fails to preserve the global structures [12].
To address the limits of dimensionality reduction methods, like PCA and tSNE, we recently adopted the nonlinear dimensionality reduction algorithm, Elastic Embedding (EE) [13], to accurately visualize and reconstruct the embedded intrinsic latent space of cell development trajectory[14]. Both tSNE and EE embed highdimensional data points into lowdimensional latent space by modeling the data points interactively with two terms. One term is an attractive force that attracts pairs of points towards each other, while the other term activates a repulsive force that simultaneously separates all pairs of points. A similar embedding idea was adopted in forcedirected layout embedding [15], which has also been used in the visualization of scRNAseq [16]. As an extension of tSNE, the EE algorithm penalizes placing latent points far from similar data points, as well as penalizes placing latent points from dissimilar data points close together [13], thereby preserving both local and global intrinsic data structures [17].
However, to the best of our knowledge, no dimensionality reduction and visualization methods can now incorporate temporal information of singlecell experiments into times series scRNAseq data, i.e., time stages when samples are collected. Instead, scRNAseq data from different time stages are combined as the input for methods like PCA and tSNE to obtain 2dimensional visualization of the data structure [5, 6, 10]. TSEE can carefully incorporate experimental temporal information, resulting in significant improvement of temporal resolution of the cells on the 2dimensional plane, as well as uncovering the subtle structures of dynamic gene expression patterns.
In this study, we propose a time series elastic embedding (TSEE) algorithm for dimensionality reduction and visualization of time series scRNAseq data by incorporating the temporal information of the experiments. TSEE is an extension of EE by introducing an additional repulsive force term when pairs of data points are collected at distinct time stages (see Eq. 1). TSEE penalizes placing latent points in close proximity to data points otherwise separated by experimental time interval. For developmental processes, the rationale that underlies TSEE holds that data points sampled at the same time stage should be more similar than those sampled from adjacent time stages, while those data points sampled from adjacent time stages should be more similar than data points sampled from the nonadjacent time stages. In this way, TSEE preserves the temporal order and structure of time series data through the input of experimental temporal information. A similar motivation was successfully applied in the timedependent community detection of timevarying networks [18].
In this paper, we first introduce the TSEE algorithm, and then provide an efficient numerical implementation of TSEE based on the partialHessian method developed by EE [19]. Next, we demonstrate the power of TSEE in the visualization of two datasets of time series scRNAseq: human preimplantation embryo (hereinafter denoted as HPE) dataset [5] and early zebrafish embryonic development (hereinafter denoted as Zebrafish) dataset [6]. By establishing a new constraint term of temporal repulsive force, TSEE dilutes the distortions of the assorted sources of data variations across time stages and achieves temporal resolution enhancement. Compared to existing methods such as PCA, tSNE and EE, our TSEE shows superior ability to gain time resolution on 2dimensional space by preserving local, global and temporal structures of time series scRNAseq data. Furthermore, the visualization represented by TSEE uncovers the subtle patterns of dynamic gene expression by showing continuous patterns with regularity at the interface between samples from adjacent time stages. For example, TSEE reveals the oscillating waves of gene expression for HER1, HER7, SOX2 and etc. along the time course, providing a solid foundation for downstream mathematical modeling and analysis. We also demonstrate robustness in the choice of TSEE.
Methods
Time series elastic embedding (TSEE)
Given the time series scRNAseq dataset, the singlecell samples are collected at n time stages {t_{1},t_{2},…,t_{n}}, and for each time stage t_{i}(1≤i≤n), a number of m_{i} single cells are sequenced with the corresponding gene expression vectors \(y_{1}^{(t_{i})},\ldots,y_{m_{i}}^{(t_{i})} \in \mathbb {R}^{D}\), where D is the number of genes for single cells. Thus, the single cell gene expression profile of a total number of \(N=\sum _{i=1}^{n} m_{i}\) cells is contained in the data matrix
TSEE is an extension of EE, a nonlinear dimensionality reduction method [13] by incorporating the information of t_{i}, embedding the N data points in Ddimensional space into the latent ddimensional (d≪D) coordinates of
In a manner that minimizes the pseudo potential energy function as
where the weights \(w_{nm}^{P},w_{nm}^{N},t_{nm} \) are defined as
Among these, N^{+} and N^{−} are normalization factors of weights for the two summation terms which, respectively, are \(N^{+} = \left (\sum _{n,m=1}^{N} {w_{nm}^{+}} \right)^{1}\) and \(N^{} = \left (\sum _{n,m=1}^{N}{w_{nm}^{}} + \beta t(n)t(m) \right)^{1}\); t(n)−t(m) is the time interval between sample y_{n} and y_{m}. TSEE is different from EE by incorporating additional temporal information of t_{nm} in the model.
The pseudo potential energy function of TSEE has two terms on the right hand side of Eq. 1. The first term (attractive term) attracts pairs of data points towards each other, while the second term (repulsive term) separates all pairs of points. The weights \(W^{P}=\left (w_{nm}^{P} \right)\) of the attractive term are defined based on the similarity between data points in highdimensional space (original space) [20], enabling TSEE to place neighboring points in highdimensional space still close to each other in lowdimensional embedding space. The weights of the repulsive term of TESS are composed of two parts. The weights \(W^{N}=\left (w_{nm}^{N}\right)\) are the disparities between data points in highdimensional space (original space), and T^{−}=(t_{nm}) are the time intervals between samples. Thus, in the constructed latent space, TSEE not only preserves the local and global structures of data hidden in original highdimensional space, as EE does [14, 17], but also further preserves temporal structure of data by penalizing placing close together latent points that correspond to data points separated by experimental time intervals. Intuitively, the experimental temporal information of the experimental time of samples reflects, to some extent, cell development stages. In other words, the later a cell is collected in time, the more likely it will be in later developmental stages, and a longer experimental time interval between two cells typically indicates a longer distance between the cells, along the developmental process.
TSEE has two regularization parameters in Eq. 1. Parameter λ trades off the attractive and repulsive terms, while parameter β trades off the composite weights of data disparities in highdimensional space and time intervals on the repulsive term. By fixing λ, a smaller β will have smaller effects by the temporal order/structure of samples, leading to less consideration of temporal information. If β=0, TSEE degenerates into EE, and no temporal information will be incorporated. If fixing β, a smaller λ makes TSEE focus on local structures, determining X_{N×d} based on the attractive term, and when λ=0, TSEE degenerates to Laplacian Eigenmap methods [13]. Choices and robustness analysis of both λ and β on embeddings in latent space are discussed in “Results” section. If either λ or β is too large, the result will either be distortion of local structures or loss of global structures in latent space.
The pseudocode of TSEE is in Algorithm 1.
The numerical implementation of TSEE
In the numerical implementation of TSEE, we adopt the efficient computational method of partialHessian proposed in [19] to find X_{N×d} that minimizes the function E of TSEE (Eq. 1).
For a given symmetric weighted graph (matrix) W=(w_{nm}), the Laplacian matrix graph is defined as L=D−W, where \(D = \text {diagonal}\left (\sum _{m=1}^{N} w_{nm}\right)\) is defined as degree matrix. L is positive semidefinite if W is nonnegative. The gradient of objective function E of TSEE can be represented in terms of Laplacian as
where L is the Laplacian matrix of W=(w_{nm}) with
This optimization problem can be solved by an iterative approach in the form of
where α_{k}>0 is the step size in each iteration that satisfies the linear search principles (e.g., Armijo rule), and \(p_{k} \in \mathbb {R}^{ND}\) is the search direction. In each iteration, p_{k} is determined by
where g_{k} is the gradient at kth iteration, and B_{k} is a positivedefinite matrix to achieve a descent direction satisfying \(p_{k}^{\mathrm {T}}g_{k}<0\). The procedures iterate until certain conditions are satisfied, for example, the iteration terminates when the number of iterations achieves the given constraint, or the distance between x_{k} and x_{k+1} obtained by two adjacent iterations is less than a given threshold.
The partialHessian method relies on the spectral direction based on partial Hessian, the attractive Hessian L^{P}⊗I_{d}, to strike the best compromise between deep descent and efficient computation, where L^{P} is the Laplacian matrix of W^{P}. In order to prevent the problem of singularity, a small μ_{k}I is added to the partial Hessian, ensuring positive definiteness of B_{k}, that is
and μ_{k} is set as 10^{−10} mini,j(L^{P})_{i,j} in this study. Algorithm 2 shows the detailed steps of PartialHessian strategies. The authors of EE [19] demonstrated that the spectral direction obtained, as described above, can be rapidly computed, leading to global and fast convergence. Compared with EE, TSEE adds an additional repulsive factor in the objective function, which results in a simple modification of W and L with no effect on computational performance, making the complexity of TSEE comparable to that of EE. In terms of the largescale Zebrafish dataset with sample size ∼40k, the computational time of EE and TSEE is 38 min and 45 min, respectively.
Datasets
Two publicly available datasets of time series scRNAseq are used in this study. The Zebrafish dataset [6] was obtained across 12 closely spaced stages of early zebrafish development spanning from high blastula stage (3.3 hourspostfertilization (hpf), just after transcription from the zygotic genome begins) to sixsomite stage (12 hpf, shortly after the completion of gastrulation). The number of cells at each time stage ranges from 200 to 7162, comprising a total of 39,505 cells. The UMI count data can be accessed at NCBI GEO (accession no.GSE106587). The human preimplantation embryo (HPE) dataset [5] consists of individually isolated embryonic cells during the human preimplantation process, starting from the 8cell stage at embryonic day 3 (E3) up to the time point just prior to implantation at E7, consisting of a total of 1529 single cells. The data can be downloaded from https://www.ebi.ac.uk/arrayexpress/experiments/EMTAB3929/.
In addition, the scRNAseq dataset of Drosophila embryos [21] is utilized here as a demonstration of the generalizability of TSEE for incorporating spatial information. The highquality data of a total 1297 cells, as defined in [21], were selected. The data are used to reconstruct the spatial positions of cells in the precisely staged embryos. The 84 genes used as predictors of spatial positions of cells in the precisely staged embryos as in [21] are utilized as our spatial genes. The data can be obtained at https://shiny.mdcberlin.de/DVEX/.
Preprocessing of time series scRNAseq data
Prior to the implementation of TSEE, the time series scRNAseq data are preprocessed as follows. First, we select out the most variable genes according to the Zscores of their variations across all samples. Second, the selected gene expression profile is normalized with all values, subtracting the minimum in the matrix, followed by dividing all values by their maximum of the processed matrix, such that the values of the elements of the normalized profile matrix range from 0 to 1. The time points of samples are also normalized to range from 0 to 1 with the initial time stage set as 0 and the final time stage set as 1. Third, the PCA algorithm is utilized to select the top principal components which preserve most variations, and the dimension of components is determined following the procedure in [14]. Finally, the data of top PCA components are further normalized, as described in the second step.
Performance comparisons
We compare TSEE with PCA, tSNE and EE to evaluate their individual performance on visualizing data, preserving local and global structures, and revealing gene expression patterns.
The scatting plots of cells on the 2dimensional embedded space are first displayed to compare their visualization results. A good visualization result should reveal the local and global structures, as well as uncover the latent specific gene expression patterns.
To measure the preservation of structures quantitatively, we adopt 1) the ingroup proportion (IGP) [22] and 2) IGP2, a modification of IGP, to evaluate local structure preservation. In addition, we adopt the Pearson correlation coefficient (PCC) between experimental time and pseudotime obtained by DensityPath [14], to evaluate global structure preservation.
The IGP, defined as the proportion of samples in a group, the nearest neighbors of which are also in the same group [22], can be used to assess the accuracy of distinction of cells on 2dimensional space. In this study, we use temporal information to define groups. For a given dataset X={x_{1},x_{2},…,x_{N}}, consisting of N samples which belong to n groups U={1,2,…,u,…,n}, we define j^{N}= arg mink≠j∥x_{j}−x_{k}∥ as the index of x_{j}’s nearest sample, and we regard Class_{x}(j) as the label for sample x_{j}. Based on the definition above, IGP of group u based on dataset X is defined as
The value of IGP ranges from 0 to 1, and a higher value indicates a better cell distinction between groups.
Because of the heterogeneity of cells, the cells at distinct time stages may still belong to the same cell type, which means that IGP tends to underestimate the performance of cellspecific distinction. Consequently, we weaken the requirements of IGP and propose a new IGP score by defining the proportion of samples in a group, the nearest neighbors of which are also in the same, or adjacent group (adjacent time stages here). The socalled IGP2 of the adapted index is formulated as
In this study, the tSNE algorithm is implemented by the Rtsne function in the R package Rtsne with default parameters, and the EE algorithm is performed according to the code downloaded from http://faculty.ucmerced.edu/mcarreiraperpinan/software.html.
Results
We analyze two time series datasets: Zebrafish [6] and HPE [5] (see “Methods” section for details). To validate the performance of TSEE, we compare it with three other dimensionality reduction algorithms, PCA, tSNE and EE.
TSEE outperforms other dimensionality reduction methods in visualization and structure preservation
We apply the 4 dimensionality reduction algorithms noted above to HPE and Zebrafish datasets and demonstrate the embedding results on the 2dimensional embedding space to visualize cell development structures. TSEE not only preserves the local and global data structures, but also has higher temporal resolution of cells with different time stages (Fig. 1).
More specifically, in the HPE dataset, PCA combines cells from E4 to E7 tightly, grouping cells into two groups visually, one mainly containing cells at E3 and the other filled with cells from E4 to E7, which would affect downstream analysis, such as clustering. The embedded points with various stages obtained by tSNE and EE are both evenly distributed as a whole, but cells from E5 to E7 heavily overlap, with some E7 cells even falling into the area of E5 cells (Fig. 1a). For TSEE, cells at different time stages are well separated so that only cells at adjacent time stages are mixed in agreement with the heterogeneity of cells (Fig. 1a).
For the Zebrafish dataset, all four methods arrange cells along time with relatively large gaps arising in early developmental stages (Fig. 1b). However, PCA and EE results show that cells from 8 to 12 hpf are highly mixed, almost stacked together, even when two or more time intervals exist among them. Although clearly separating points with different time stages, tSNE shows distorted 2dimensional temporal structures in that cells from 3.3 hpf to 4.3 hpf are distributed along time course, while cells at 4.7 hpf are separated with the cells before 4.7 hpf by cells at 5.3 hpf (Fig. 1b). In contrast, for TSEE, the cells from 3.3 to 6 hpf are obviously separated. Even though cells from 7 hpf to 12 hpf are not separated as distinctly as the cells before 6 hpf, significant gaps still exist between cells at adjacent time stages, thus showing the best performance on preserving structures along time (Fig. 1b).
To quantitatively evaluate the performance of TSEE, we apply IGP and IGP2 scores to measure the distinction of cells, regarding time stages as standard of cell groups. TSEE always has highest averaged values of IGP and IGP2 (see Tables 1 and 2), and its performance is similar to that of tSNE based on the two scores across all time stages in HPE and Zebrafish (Fig. 2), indicating that TSEE preserves the local structures and differentiates cells at various time stages better than the three other methods.
To evaluate the preservation of global and temporal structures by the four methods, we measure performance quantitatively by applying DensityPath [14], which has high accuracy in the reconstruction of cell development trajectory, as well as pseudotime calculation, for both datasets. Since our purpose is to investigate the performance of the four dimensionality reductions, we only apply DensityPath to dimensionality reduced data of 2dimensional space by the four methods to reconstruct cell statetransition path and calculate pseudotime. After setting the root as 1472th and 1st cell in the HPE and Zebrafish datasets, respectively, the accuracy of calculated pseudotime is measured by PCC between the calculated pseudotime and experimental time of the cells. A larger value of PCC indicates more consistency in global structure preservation along time. We find that TSEE preserves temporal structure information best, while the PCC value of tSNE on the Zebrafish dataset is negative, indicating that the embedding of cells by tSNE distorts whole structures of data along time (Fig. 3).
The hierarchical clustering algorithm is also applied to measure the preservation of distance of time stage. The complete linkage method is adopted here based on centroids of samples at each time stages, where the calculated distance of the centroids is based on 2dimensional Euclidean distances of the four dimensionality reduction methods separately. Although the adjacent time stages are always close to each other in hierarchical clustering trees based on the results of all four methods (Fig. 4), the clustering tree constructed on the basis of TSEE results shows the best consistency between the distance of centroids and the time order. Specifically, the lengths of leaf nodes (3.3, 3.8, 4.3, 4.7, 5.3, and 6) to the root of the clustering tree decrease with increasing time stage from 3.3 to 6 hpf, and then the lengths of leaf nodes (7, 8, 9, 10, 11, and 12) to the root of clustering tree increase with the increasing time stage from 7 to 12 hpf (Fig. 4). As an indication of cell development processes, the clustering tree result by TSEE indicates that the centroids of cell populations propagate along one direction from 3.3 to 6 hpf and then propagate along another direction from 7 to 12 hpf, which can be further supported by the large gap between 6 and 7 hpf on the 2dimensional space by tSNE, EE, and TSEE (Fig. 1b).
TSEE reveals subtle dynamic patterns of gene expression during zebrafish embryogenesis on 2dimensional space
The largescale Zebrafish dataset with closely spaced stages allows us to further explore dynamic patterns of gene expression during zebrafish embryogenesis.
The power of TSEE to allow visualization of structures that show dynamic patterns of gene expression can first be illustrated through the expression of marker genes HER1 and HER7 on the 2dimensional space by PCA, tSNE, EE and TSEE in Zebrafish data (Fig. 5). HER1 and HER7, for which the existence of oscillation expression patterns by negative feedback has been reported in presomitic mesoderm of zebrafish [23], clearly show oscillatory expression patterns along time stages on the 2dimensional space of TSEE, indicating that TSEE successfully reveals the underlying dynamic patterns of gene expression. On the other hand, PCA only shows two wide strips of highly expressed genes on the whole plot, and tSNE gathers cells with highly expressed genes together without any oscillation pattern whatsoever. Meanwhile EE just shows a narrow region with oscillation expression pattern in early developmental stages, and the cells with highlevel expression are stacked in the latter developmental stages, resulting in low temporal resolution. For TSEE, however, points on the 2dimensional plane clearly demonstrate the oscillation patterns of gene expression with high resolution, validating the effectiveness of TSEE in preserving the intrinsic structures of data with high resolution, while, at the same time, allowing visualization of the underlying dynamic structures of genes.
We also explore the expression patterns of genes related to embryogenesis from the list provided by [6] on the 2dimensional spaces by the four methods to examine the developmental patterns. Here, visualization by TSEE can also show clearer oscillation waves of gene expressions along time compared to the other methods (see Fig. 6 for four examples). PCA tends to distribute the cells with highlevel gene expression uniformly, or accumulate them in the region of late developmental stages. The tSNE methods tends to gather together the cells with high gene expression, while small gaps occur among them, but this only shows meaningful clusters of cells without any evidence of patterns of gene expression. EE displays the products resulting from mixtures of cells at various time stages as the region of highlevel gene expression, where genes are continuously expressed. The three methods place together the regions where the specific genes are highly expressed, failing to uncover the dynamic structures even if some special gene expression patterns exist among all cells. In TSEE, different patterns of gene expression can be revealed. As examples, WNT8A, an essential transcript for zebrafish axis development [24], displays an oscillatory gene expression pattern throughout all samples. TBX6L, which contributes to posterior paraxial mesoderm formation during zebrafish embryogenesis [25], reflects a periodically expressed pattern in the right region. SOX2, identified as an essential transcription factor to maintain selfrenewal or pluripotency [26], reflects oscillation, as well, in the left region in the whole stages. Finally, NOTO displays a periodic gene expression pattern entirely, but fades away gradually along time. More genes with an oscillating pattern if expression can be found at https://github.com/ShaokunAn/TSEE.
Oscillating patterns of gene expression revealed by TSEE may be linked to the genomescale oscillations in DNA methylation during exit from pluripotency [27]. Therefore, our TSEE results provide new insights and perspectives for subsequent analysis of dynamic transitions and regulation mechanisms of key genes.
Parameter choices and robustness analysis of TSEE
Two tuning parameters, λ and β, are found in the pseudo potential energy function E of TSEE (Eq. 1). The parameter choices of λ and β are critical to the embedding results by TSEE. In the implementation of TSEE, we set the default values of both λ and β equal to 10.
To test the robustness of the parameter choices, we first choose the values of each parameter separately in a wide range from 1 to 1000, while keeping the other parameter fixed at default, and calculate weighted mean of IGP scores for both HPE and Zebrafish datasets (Fig. 7). The weighted mean of IGP is the sum of IGPs at each time stage, weighting by the proportion of samples (cells) at each time stage.
When fixing β=10, TSEE achieves the highest weighted mean of IGP at λ=100 in both datasets (Fig. 7a,b). However, the large λ(≥100) tends to separate samples into clusters on the 2dimensional space of TSEE, breaking the structure into discontinuities (Fig. 8a,c), which is inappropriate for continuous embryonic development. Therefore, λ=10 best balances the preservation of local and global structures.
When fixing λ=10, the weighted mean of IGP increases sharply when increasing β from 1 to 10 and tends to be saturated after β≥100 (Fig. 7c,d). Although the global structures of cells on the 2dimensional space of TSEE are stable by varying β from 1 to 1000, TSEE may sacrifice some local structures with large β(≥100) since the repulsive terms are determined by the difference between time stages, while the dissimilarities based on gene expression are seldom considered. For example, the two corner sections at stage E7 and the one at stage E4 become less distinguishable as β grows in HPE data. Therefore, β=10 is sufficient for the incorporation of temporal information data and trades off the weights between gene expression and time stage.
To determine the parameters more precisely, as well as analyze the robustness of TSEE, we further study the performance of TSEE when the parameters are tested in detail in the region from 1 to 100. The three metrics IGP, IGP2 and PCC are employed here. Based on IGP and IGP2, TSEE has similar performance for various β and λ from 1 to 50 in the HPE and Zebrafish datasets, indicating that TSEE is quite robust to the change of parameters on preserving local structures. We calculate PCC for HPE and Zebrafish data to analyze the robustness to parameter choice, as well. Figure 9 demonstrates that TSEE is quite robust for the choices of β, as well as choices λ, when varying them separately in the range from 1 to 50.
Based on all the results above, we set the optimal default values of both β and λ to be equal to 10.
Discussion
In summary, we propose a novel visualization method, TSEE, for time series scRNAseq data. To the best of our knowledge, in the singlecell community, TSEE is the first visualization method that considers additional temporal information.
TSEE enhances resolution of the map by correcting for unknown noise variation. The repulsive force based on time intervals of samples enables TSEE to align cells along time, preserving temporal structures, while the intrinsic structures remain well preserved owing to the incorporation of the attractive term and the repulsive term based on their distance in gene expression space. Figure 10 displays the correction of cells by TSEE. The samples colored in red are those for which the nearest neighboring points are at least two time intervals apart, indicating with high probability that they are misplaced in the 2dimensional plane. The misplaced points by EE are distributed uniformly (Fig. 10a), but when displayed on the 2dimensional space obtained by TSEE (Fig. 10b), they are aligned into their respective time stages, indicating that TSEE corrects the misplacement of samples by EE. Moreover, the misplaced cells by TSEE are generally located in the interface of two time stages (Fig. 10c), which makes sense since cells are typically heterogeneous, especially at late developmental stages. Because of the correction of cells along time, gene expression patterns can be revealed, such as the oscillation process of HER1 and HER7 in Zebrafish data, as demonstrated in the “Results” section. The increasing number of genes with oscillatory expression pattern discovered is further supported by the genomescale oscillations in DNA methylation[27], and the uncovered subtle dynamic structures of time series scRNAseq data, as, for example, through the use of TSEE, can be utilized for further analysis of gene expression.
The visualization tool of the forceddirected layout [15] utilizes the concept of attractive force and repulsive force, as well, to visualize network structures in lowdimensional space. The data tend to be collapsed along branches, covering celltocell heterogeneity, as well as latent gene expression pattern, and the results in lowdimensional space generally need to be adjusted manually. For example, the marker WNT8A demonstrates an oscillating expression pattern in the TSEE visualization result, but the development tree result in the Supplementary File of [6], which was obtained by a handtuned forcedirect layout merely shows a growing tendency along time, and the heterogeneity of cells is hidden because cells collapse along branches.
The computational framework of TSEE can also be potentially extended to incorporate other sources of information (e.g., spatial information) of scRNAseq data. We utilize the scRNAseq data of embryo cells from [21] to demonstrate the extended application based on spatial information of cells. A set of 84 marker genes are considered to uniquely classify almost every position within the embryo [21]. Therefore, we employ the information from the 84 marker genes to quantify the spatial distance of single cells, defining as 1−PCC(x_{i},x_{j}) between samples i and j, where x_{i} and x_{j} are the expression vectors of the 84 marker genes. Figure 11 displays the expression of three markers of dorsal ectoderm in 2dimensional space, as obtained by tSNE, EE, and TSEE, respectively. Compared to the results of EE and tSNE, the highly expressed genes tend to locate along the boundary of the 2dimensional space by the modified spatial TSEE, showing better consistency with the spatial patterns of the genes in the embryo. The accurate reconstruction of spatial location of single cells will be challenging, and this will be an interesting topic for our future study.
Conclusions
In this study, we propose an efficient algorithm, TSEE, for the visualization of time series scRNAseq data. TSEE is an extension of EE by introducing an additional repulsive force term when pairs of data points are collected at distinct time stages, thereby balancing the effects from disparities between samples in highdimensional gene expression space.
To incorporate the temporal information of data, TSEE adds additional terms of the time intervals between samples into the repulsive terms to balance the effects from the disparities between samples in highdimensional gene expression space. In this way, TSEE dilutes the distortions of the assorted sources of variations of the data across time stages and achieves temporal resolution enhancement by preserving temporal order and structure. In addition, TSEE uncovers the subtle dynamic structures of gene expression patterns, as exemplified by oscillating waves in our results, facilitating further downstream dynamic modeling and analysis of gene expression processes. The computational framework of TSEE is generalizable for the incorporation of other sources of information.
Abbreviations
 EE:

Elastic embedding
 HPE:

Human preimplantation embryo
 IGP:

Ingroup proportion
 IGP2:

A modification of ingroup proportion
 PCA:

Principal component analysis
 PCC:

Pearson correlation coefficient
 scRNAseq:

Singlecell RNA sequencing
 TSEE:

Time series elastic embedding
 tSNE:

tdistributed stochastic neighbor embedding
References
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014; 32(4):381–6. https://doi.org/10.1038/nbt.285.
Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, Trapnell C. Reversed graph embedding resolves complex singlecell trajectories. Nat Methods. 2017; 14(10):979–82. https://doi.org/10.1038/NMETH.440.
Setty M, Tadmor MD, ReichZeliger S, Ange O, Salame TM, Kathail P, Choi K, Bendall S, Friedman N, Pe’er D. Wishbone identifies bifurcating developmental trajectories from singlecell data. Nat Biotechnol. 2016; 34(6):637–45. https://doi.org/10.1038/nbt.356.
Chen J, Renia L, Ginhoux F. Constructing cell lineages from singlecell transcriptomes. Mol Asp Med. 2018; 59:95–113. https://doi.org/10.1016/j.mam.2017.10.00.
Petropoulos S, Edsgard D, Reinius B, Deng Q, Panula SP, Codeluppi S, Reyes AP, Linnarsson S, Sandberg R, Lanner F. SingleCell RNASeq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos. Cell. 2016; 165(4):1012–26.
Farrell JA, Wang Y, Riesenfeld SJ, Shekhar K, Regev A, Schier AF. Singlecell reconstruction of developmental trajectories during zebrafish embryogenesis. Science. 2018; 360(6392):979. https://doi.org/10.1126/science.aar313.
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018; 36(5):411. https://doi.org/10.1038/nbt.409.
Gao NP, UdDean SMM, Gandrillon O, Gunawan R. SINCERITIES: inferring gene regulatory networks from timestamped single cell transcriptional expression profiles. Bioinformatics. 2018; 34(2):258–66. https://doi.org/10.1093/bioinformatics/btx57.
Rashid S, Kotton DN, BarJoseph Z. TASIC: determining branching models from time series single cell data. Bioinformatics. 2017; 33(16):2504–12. https://doi.org/10.1093/bioinformatics/btx17.
Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, Liu S, Lin S, Berube P, Lee L, Chen J, Brumbaugh J, Rigollet P, Hochedlinger K, Jaenisch R, Regev A, Lander E. Reconstruction of developmental landscapes by optimaltransport analysis of singlecell gene expression sheds light on cellular reprogramming. bioRxiv. 2017. https://doi.org/10.1101/191056. https://doi.org/10.1101/191056. https://www.biorxiv.org/content/early/2017/09/27/191056.full.pdf.
Ding J, Aronow BJ, Kaminski N, Kitzmiller J, Whitsett JA, BarJoseph Z. Reconstructing differentiation networks and their regulation from time series singlecell expression data. Genome Res. 2018; 28(3):383–95. https://doi.org/10.1101/gr.225979.11.
Moon KR, van Dijk D, Wang Z, Burkhardt D, Chen W, van den Elzen A, Hirn MJ, Coifman RR, Ivanova NB, Wolf G, Krishnaswamy S. Visualizing transitions and structure for high dimensional data exploration. bioRxiv. 2017. https://doi.org/10.1101/120378.
CarreiraPerpiñán MÁ. The elastic embedding algorithm for dimensionality reduction. In: Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel: 2010. p. 167–74. http://www.icml2010.org/papers/123.pdf.
Chen Z, An S, Bai X, Gong F, Ma L, Wan L. Densitypath: a levelset algorithm to visualize and reconstruct cell developmental trajectories for largescale singlecell rnaseq data. bioRxiv. 2018. https://doi.org/10.1101/276311. https://www.biorxiv.org/content/early/2018/03/05/276311.full.pdf.
Jacomy M, Venturini T, Heymann S, Bastian M. ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software. PLoS ONE. 2014; 9(6). https://doi.org/10.1371/journal.pone.009867.
Weinreb C, Wolock S, Klein AM. SPRING: a kinetic interface for visualizing high dimensional singlecell expression data. Bioinformatics. 2018; 34(7):1246–8. https://doi.org/10.1093/bioinformatics/btx79.
Wasserman L. Topological data analysis. Annu Rev Stat Appl. 2018; 5(1):501–32. https://doi.org/10.1146/annurevstatistics031017100045.
Mucha PJ, Richardson T, Macon K, Porter MA, Onnela JP. Community Structure in TimeDependent, Multiscale, and Multiplex Networks. Science. 2010; 328(5980):876–8. https://doi.org/10.1126/science.118481.
Vladymyrov M, CarreiraPerpiñán MÁ. Partialhessian strategies for fast learning of nonlinear embeddings. In: Proceedings of the 29th International Conference on Machine Learning (ICML12).2012. p. 345–352.
Vladymyrov M, CarreiraPerpinan M. Entropic affinities: Properties and efficient numerical computation In: Dasgupta S, McAllester D, editors. Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28. Atlanta, Georgia, USA: PMLR: 2013. p. 477–85. http://proceedings.mlr.press/v28/vladymyrov13.html.
Karaiskos N, Wahle P, Alles J, Boltengagen A, Ayoub S, Kipar C, Kocks C, Rajewsky N, Zinzen RP. The Drosophila embryo at singlecell transcriptome resolution. Science. 2017; 358(6360):194–9. https://doi.org/10.1126/science.aan323.
Kapp AV, Tibshirani R. Are clusters found in one dataset present in another dataset?. Biostatistics. 2007; 8(1):9–31. https://doi.org/10.1093/biostatistics/kxj02.
Kageyama R, Niwa Y, Isomura A, Gonzalez A, Harima Y. Oscillatory gene expression and somitogenesis. Wiley Interdiscip Rev Dev Biol. 2012; 1(5):629–41. https://doi.org/10.1002/wdev.4.
Wylie AD, Fleming JAGW, Whitener AE, Lekven AC. Posttranscriptional regulation of wnt8a is essential to zebrafish axis development. Dev Biol. 2014; 386(1):53–63. https://doi.org/10.1016/j.ydbio.2013.12.00.
Morrow ZT, Maxwell AM, Hoshijima K, Talbot JC, Grunwald DJ, Amacher SL. tbx6l and tbx16 are Redundantly Required for Posterior Paraxial Mesoderm Formation During Zebrafish Embryogenesis. Dev Dyn. 2017; 246(10):759–69. https://doi.org/10.1002/DVDY.2454.
Hagey DW, Muhr J. Sox2 Acts in a DoseDependent Fashion to Regulate Proliferation of Cortical Progenitors. Cell Rep. 2014; 9(5):1908–20. https://doi.org/10.1016/j.celrep.2014.11.01.
Rulands S, Lee HJ, Clark SJ, Angermueller C, Smallwood SA, Krueger F, Mohammed H, Dean W, Nichols J, RuggGunn P, Kelsey G, Stegle O, Simons BD, Reik W. GenomeScale Oscillations in DNA Methylation during Exit from Pluripotency. Cell Syst. 2018; 7(1):63. https://doi.org/10.1016/j.cels.2018.06.01.
Acknowledgements
We are grateful to Xiangqi Bai and Ziwei Chen for their helpful discussions.
Funding
Publication of this article was sponsored by NSFC grants (Nos.11571349, 91630314, 81673833), the Strategic Priority Research Program of Chinese Academy of Sciences (CAS) (XDB13050000), the NCMIS of CAS, the LSC of CAS, and the Youth Innovation Promotion Association of CAS. LW would like to thank the Mathematical Biosciences Institute (MBI) at Ohio State University for partially supporting this research. MBI receives its funding through NSF grant DMS 1440386.
Availability of data and materials
The data used in the current study are all publicly available. HPE data can be obtained at https://www.ebi.ac.uk/arrayexpress/experiments/EMTAB3929/, while Zebrafish data in UMI count format can be assessed at NCBI GEO with accession no.GSE106587, and the scRNAseq dataset of Drosophila embryos is available at https://shiny.mdcberlin.de/DVEX/.
The TSEE code is available at https://github.com/ShaokunAn/TSEE, and the oscillation genes discovered by TSEE among the genes related to embryogenesis can be found at https://github.com/ShaokunAn/TSEE/tree/Additionalfiles.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume20supplement2.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Author’s contributions
SA participated in method design and data analysis, wrote the program, and drafted the manuscript. LM conceived the project and participated in method design and writing. LW conceived and designed the study and participated in method design and writing. All authors read and approved the manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
An, S., Ma, L. & Wan, L. TSEE: an elastic embedding method to visualize the dynamic gene expression patterns of time series singlecell RNA sequencing data. BMC Genomics 20 (Suppl 2), 224 (2019). https://doi.org/10.1186/s1286401954778
Published:
DOI: https://doi.org/10.1186/s1286401954778
Keywords
 Nonlinear dimensionality reduction
 Elastic embedding
 Visualization
 Singlecell RNA sequencing
 Time series
 Cell fate decisions
 Gene expression pattern
 Oscillation
 Ingroup proportion