We start from an n×J matrix of normalized expression measures (e.g., read counts) for n single cells and J genes or features. Slingshot assumes that the n cells have been partitioned into K clusters, potentially corresponding to distinct cellular states. Although Slingshot can in principle be applied directly to the normalized expression measures, we strongly recommend a dimensionality reduction step before pseudotemporal reconstruction, as Slingshot’s curve-fitting step uses Euclidean (or related) distances, which can misbehave in high-dimensional spaces (cf. curse of dimensionality). Dimensionality reduction can also strengthen signal in the data and help with visualization. We denote the dimension of the reduced space by J′.
Before detailing Slingshot’s two main steps, we introduce some notation. First, denote by X=(X
ij
) the n×J′ reduced-dimensional matrix of gene expression measures, for cells i∈{1,…,n} and dimensions j∈{1,…,J′}. Let \(\left \{\mathcal {C}_{1},\ldots,\mathcal {C}_{K}\right \}\) denote the K cell clusters or states, i.e., disjoint subsets of cells, typically obtained by clustering the cells based on their gene expression measures. We then define a lineage as an ordered set of clusters and let L denote the total number of lineages. For a particular lineage, \(\mathcal {L}_{l}\), denote its length (i.e., the number of clusters in the lineage) by K
l
and the kth cluster by \(\mathcal {C}^{l}_{k}\), for l∈{1,…,L} and k∈{1,…,K
l
}. In particular, \(\mathcal {C}^{l}_{1}\) and \(\mathcal {C}^{l}_{K_{l}}\) correspond, respectively, to the initial and terminal states for the lth lineage. It is important to note that a cluster can belong to multiple lineages and that the ordering of the clusters within a lineage does not strictly determine the final relative orderings of cells in those clusters.
As a given cluster can belong to multiple lineages, so can a cell. We therefore allow cells to have distinct pseudotime values for each lineage they are a part of. The pseudotime value for cell i in lineage l is denoted by \(t^{l}_{i} \in \mathbb {R}_{\geq 0}\); if cell i does not belong to lineage l, i.e., \(i \notin \cup _{k=1}^{K_{l}} \mathcal {C}^{l}_{k}\), then set \(t^{l}_{i} = \emptyset \). The vector of pseudotime values for lineage l is denoted by \(\mathbf {t}^{l} = \left (t^{l}_{i}: i=1,\ldots,n\right)\).
Identification of cluster-based lineages
In its first step, Slingshot identifies lineages by treating clusters of cells as nodes in a graph and drawing a minimum spanning tree (MST) between the nodes, similar to the work of [7, 10]. Lineages are then defined as ordered sets of clusters created by tracing paths through the MST, starting from a given root node. Our method differs however in a number of important respects from those of [7, 10] including the distance measure used for drawing the tree and the (optional) incorporation of biologically meaningful supervision.
Shape-sensitive distance measure between cell clusters
Constructing an MST involves specifying a distance measure between nodes (in this case, cell clusters). We have found that a Mahalanobis-like distance, i.e., a covariance-scaled Euclidean distance, that accounts for cluster shape, works well in practice, but users have the option of specifying any type of distance measure (e.g., Euclidean, Manhattan). Specifically, the pairwise distance between clusters i and j, \(d(\mathcal {C}_{i},\mathcal {C}_{j})\), is defined as
$$ d^{2}(\mathcal{C}_{i},\mathcal{C}_{j}) \equiv (\bar{X}_{i} - \bar{X}_{j})^{T} (S_{i} + S_{j})^{-1} (\bar{X}_{i} - \bar{X}_{j}), $$
(1)
where \(\bar {X}_{i}\) represents the center (mean) of cluster i and S
i
its empirical covariance matrix in the reduced-dimensional space. This is essentially a multivariate t-statistic. By default, Slingshot uses the full covariance matrix of each cluster, allowing us to draw trees that are better covered by and representative of the cells in a dataset. However, in the presence of small clusters, the matrix S
i
+S
j
may not be invertible and we may replace the full covariance matrix with the corresponding diagonal covariance matrix.
Some clustering algorithms return probabilities of cluster membership rather than hard assignments. In these cases, the cluster membership probabilities can be naturally and readily incorporated as weights in most distance measures. For instance, for the Mahalanobis-like distance of Eq. 1, we would compute weighted means and covariance matrices.
Biologically meaningful supervision
Slingshot allows two forms of supervision during lineage identification: initial state and terminal states specification. Like other methods (TSCAN, Waterfall, Monocle 2), Slingshot requires the user to identify the initial cluster or root node. Subsequently, every direct path from this node to a leaf node (i.e., a cluster with only one edge) will be called a lineage. Indeed, all existing lineage inference methods explicitly or implicitly make the assumption that a starting state can be identified by the user: Monocle and Embeddr construct orderings for which the user must select the correct direction and Wishbone and DPT require the user to select an initial cell or group of cells. In the simple case where the MST constructed by Slingshot has only two leaf nodes and one is specified as the root, this results in a single lineage. If an interior (non-leaf) node is specified as the origin, this results in two lineages, one terminating in each leaf node. Clusters with more than two edges will create bifurcations and produce additional lineages.
Additionally, Slingshot optionally allows the user to provide further supervision in the inference of the lineages by selecting clusters known to represent terminal cell states, imposing a local constraint on the MST algorithm. The constrained MST is obtained by first constructing the MST on all non-selected clusters and then connecting each selected cluster to its nearest non-selected neighbor. Such local supervision results in more biologically meaningful lineages for situations where the data can be explained by many possible lineage structures. Identified lineages are by construction consistent with known biology and provide improved stability over less supervised methods. Although terminal state supervision is not required, in many settings researchers do have knowledge of the cell types present in their data and systematically incorporating this knowledge can provide more accurate and stable inference. Importantly, this type of local supervision does not prevent the discovery of novel lineages; it allows the incorporation of specific knowledge of cell clusters, without imposing restrictions on the global branching structure. Ultimately, detecting multiple lineages based on gene expression data is a difficult problem that benefits from such guidance, as we demonstrate in the “Results and discussion” section.
Identification of individual cell pseudotimes
The second stage of Slingshot is concerned with assigning pseudotimes to individual cells. For this purpose, we make use of principal curves [13] to draw a path through the gene expression space of each lineage. As we show in the “Results and discussion” section, principal curves give very robust pseudotimes when there is a single lineage. Multiple lineages demand more care and are handled using the simultaneous principal curves method proposed below. Indeed, just as clusters in the MST may belong to one or more lineages, the cells which constitute these clusters may be assigned to one or more lineages. In principle, we could construct standard principal curves for each lineage separately to arrive at pseudotimes. However, there is no guarantee that these curves would agree with each other in the neighborhood of clusters shared between lineages, so cells belonging to multiple lineages could be assigned very different pseudotime orderings by each curve. Since we assume a smooth differentiation process, this is potentially a violation and may be problematic in downstream analysis.
We therefore introduce a method of simultaneously fitting the principal curves of each lineage, which shrinks the curves to a consensus path in areas where lineages share many common cells, but allows the curves to separate as they share fewer and fewer cells. This ensures smooth bifurcations of the paths. We call the resulting curves simultaneous principal curves, as they are fit by an iterative procedure based on the principal curves algorithm of [13]. When there is only a single lineage (L=1), the pseudotimes of Slingshot are found by the standard principal curves algorithm, except that the initial curve is based on the lineage’s path through the MST found in the first stage (see below for details), rather than the first principal component. Additional file 1: Figure S19 illustrates the main steps in the simultaneous principal curves algorithm.
Standard principal curves algorithm. We first review the standard principal curves algorithm of [13] (for a single curve) in order to be clear about how we adapt it for simultaneous principal curves. After specification of an initial curve, the algorithm iteratively follows these steps:
-
1. Project all data points onto the curve and calculate the arc length from the beginning of the curve to each point’s projection. Setting the lowest value to zero, this produces pseudotimes.
-
2. For each dimension j, j∈{1,…,J′}, use the cells’ pseudotimes to predict their coordinates, typically with a smoothing spline. This produces a set of J′ functions which collectively map pseudotime values in \(\mathbb {R}_{\geq 0}\) into \(\mathbb {R}^{J'}\), thereby defining a smooth curve in J′ dimensions.
-
3. Repeat this process until convergence. We use the sum of squared distances between cells’ actual coordinates and their projections on the curves to determine convergence.
We note that these curves use the unit-speed parameterization, meaning that a principal curve defined by \(\mathbf {c}(t):\mathbb {R}_{\geq 0} \rightarrow \mathbb {R}^{J^{\prime }}\) will satisfy ||c′(t)||=1 at all points t in the domain of c. This property ensures the equivalence between arc length and pseudotime mentioned in Step 1.
In order to characterize multiple branching lineages, we modify the iterative principal curves algorithm in two ways: by incorporating cell weights representing their assignment to particular lineages and by adding a shrinkage procedure to ensure smooth branching events. The cell weights are added in Step 1, with each cell’s weight for a given lineage being based on its projection distance to the curve representing that lineage. The shrinkage is performed in Step 2, by first recursively constructing an average curve for each branching event, then recursively shrinking the branching lineage curves toward this average. Thus, as with the individual lineage curves, each average curve is a function of pseudotime and can, itself, be averaged and shrunk.
In the case of branching lineages, where L is the total number of lineages (i.e., terminal states), our goal is to infer, for each lineage l∈{1,…,L}, a vector of pseudotime values, \(\mathbf {t}^{l} = \left (t^{l}_{i}: i=1,\ldots,n\right)\), and a vector-valued function, \(\mathbf {c}_{l}: \mathbb {R}_{\geq 0} \rightarrow \mathbb {R}^{J^{\prime }}\), for the associated curve in the low-dimensional space.
Average curve construction for each branching event. Average curves are constructed in a recursive manner, from the latest branching events to the earliest (i.e., from the leaves of the MST to the root). Consider a branching event involving M curves (typically M=2), \(\mathbf {c}_{m}: \mathbb {R}_{\geq 0} \rightarrow \mathbb {R}^{J'}\), m=1,…,M, which may either be individual lineage curves (for branching events leading to leaf nodes) or average curves (for lineages differentiating in later branching events). The average curve is simply defined as
$$ \mathbf{c}_{\text{avg}}(t) \equiv \frac{1}{m} \sum_{m=1}^{M} \mathbf{c}_{m}(t), $$
(2)
for values of t in the domains of each curve being averaged. Because all lineages share the same root cluster, we ensure that the starting point of all average curves (cavg(0)) will be identical, as will the starting point of the resulting shrunken curves. For branching events leading to leaf nodes, the curves being averaged will be individual lineage curves, whereas earlier branching events may also involve averaging average curves.
This recursive procedure ensures that the average curves constructed for early branching events are blind to the number of lineages ultimately produced by each branch. Without this condition, an early bifurcation event between a lineage that ends in a single terminal state and another that gives rise to multiple terminal states would produce an average curve that was strongly biased toward the latter branch.
Shrinkage for each branching event. Next, we perform shrinkage for each branching event, bringing the branching curves into better agreement in regions of shared cells. Unlike the construction of the average curves, this recursive process starts with the root and works out toward the leaves, meaning that the earliest branching event is the first to be shrunk. Let cavg denote the average curve for a given branching event and let c
m
denote one of the M curves to be shrunk at this event. Again, it may be the case that the curves being shrunk together at this step represent single lineages or averages of other curves. To determine how much each curve is shrunk toward the average, we construct curve-specific weighting functions \(w_{m}: \mathbb {R}_{\geq 0} \rightarrow [0,1]\), as detailed below, with the constraint that w
m
must be non-increasing. Additionally, by specifying that w
m
(0)=1 (representing the maximum amount of shrinkage), we ensure that diverging curves always share the same initial point. These weighting functions allow us to shrink the diverging curves toward the average curve with the additional update in Step 2:
$$ \mathbf{c}_{m}^{\text{new}}(t) \equiv w_{m}(t)\mathbf{c}_{\text{avg}}(t) + (1-w_{m}(t))\mathbf{c}_{m}(t). $$
(3)
If all the weighting functions are smooth, this shrinkage step ensures that the final curves will follow a tree structure with smooth branching events.
Weighting functions for each branching event. Slingshot’s default weighting functions satisfy these conditions, i.e., are smooth, non-increasing, and take on a value of one at the origin. They are based on the distribution of pseudotimes for cells shared between the lineages corresponding to the branching curves. Specifically, for a branching event involving lineages {1,…,H}, we define a set of shared cells \(\left \{i: t_{i}^{1} \neq \emptyset,\ldots, t_{i}^{H} \neq \emptyset \right \}\) and, for one of the M curves c
m
to be shrunk at this event, we let \(t_{min}^{m}\) and \(t_{max}^{m}\) denote, respectively, the lowest and highest non-outlier pseudotimes for these cells along the curve (where outliers are defined by the 1.5 IQR rule of boxplots, where IQR stands for inter-quartile range). The weighting function for curve c
m
is then defined as:
$$ w_{m}(t) \equiv \left\{\begin{array}{ll} 1, & 0 \leq t < t_{min}^{m} \\ 1 - F_{K}\left(\frac{t}{t_{max}^{m}-t_{min}^{m}}-\frac{1}{2}\right), & t_{min}^{m} \leq t \leq t_{max}^{m} \\ 0, & t > t_{max}^{m} \end{array}\right., $$
(4)
where F
K
is the cumulative distribution function of a standard cosine kernel with a bandwidth of \(\frac {1}{6}\) (which places weight only on values between \(-\frac {1}{2}\) and \(\frac {1}{2}\)). The final curves are highly robust to the choice of kernel function (Additional file 1: Figure S20).
In both the single and branching lineage cases, final pseudotime values are derived from each point’s orthogonal projection onto the curves. In the latter case, assignment of cells to lineages is determined by cell weights, which are calculated in Step 1 of the algorithm, using a cell’s projection distance to each lineage curve. Cell weights may be useful in downstream analyses, such as the identification of genes that are differentially expressed along and between lineages.
Cells belonging to multiple lineages will have multiple pseudotime values, but these values will agree quite well for cells positioned before the lineage bifurcation. This is because all curves share a common point of origin, c
m
(0), and the weighting functions w
m
, which determine the amount of shrinkage, assign unit weight to the origin (complete shrinkage) and decrease smoothly throughout the neighborhood of shared cells. Therefore, in the region prior to the bifurcation, shrinkage forces the curves to closely follow their average curve and the pseudotimes obtained by projection onto these curves will be highly similar. Additional file 1: Figure S19e shows the improved agreement between simultaneous principal curves as compared to separate, standard principal curves.
Initialization of simultaneous principal curves algorithm. As mentioned above, we initialize the algorithm using the MST from the first stage. Specifically, for each lineage, we start with the piecewise linear path through the centers of the clusters contained in the lineage (in contrast, standard principal curves are initialized by the first principal component over all cells). Starting with the path through the cluster centers allows us to leverage the prior knowledge that went into lineage identification as well as to improve the speed and stability of the algorithm, though in practice, the two procedures typically converge to very similar curves.
Datasets
We demonstrate the performance of Slingshot by applying it to three previously published single-cell RNA-Seq datasets, each with a different number of terminal cell types.
HSMM dataset. The first dataset is a subset of the data used in [3], which assayed 271 human skeletal muscle myoblasts (HSMM) in order to study their development into mature myotubes. This is an example of data with only a single lineage. In their analysis, [3] identify a population of contaminating interstitial mesenchymal cells, which we omit from the dataset. This results in a sample of 212 cells believed to form a single, continuous developmental lineage. For our analysis, we used the cluster labels generated by Monocle as well as a set of labels obtained via k-means clustering for a range of values of k, and, as in the original paper, we represented the data in two dimensions obtained by ICA. The normalized data were downloaded from the NCBI GEO database (accession GSE52529).
qNSC dataset. The second dataset comes from [10], who assayed 132 hippocampal quiescent neural stem cells (qNSC) and their immediate progeny from adult mice, cells known to be involved in neurogenesis. Their goal was to assess cellular heterogeneity among this population and analyze continuous-time developmental dynamics. After removing a few outliers, their analysis focuses on 101 cells believed to represent the development of qNSCs into intermediate progenitor cells (IPC), a transitional state between qNSCs and mature neurons. However, they note an additional cluster of 23 cells branching off of this lineage, potentially representing an alternative terminal cell type. As in the original paper, we used the hierarchical clustering labels and the first two principal components as the reduced-dimensional space. Rather than focus solely on the IPC lineage though, we characterized the developmental trajectory of both proposed cell fates. The normalized data and code for preliminary analysis were downloaded from GEO (accession GSE71485).
OE dataset. The third dataset is that of [26], featuring 616 cells from the adult mouse olfactory epithelium (OE), tracing the development of quiescent stem cells into three unique terminal cell fates. The primary lineage maps the development of horizontal basal cells (HBC) into mature olfactory sensory neurons (mOSN) via a series of intermediate states. The secondary lineage gives rise to the support sustentacular (mSus) cells of this system and features fewer identifiable intermediates. A third lineage, which appears to split from the neuronal path, leads to a cluster of microvillous (MV) cells. Again, we relied on the cluster labels of the authors, who used a resampling-based sequential ensemble clustering (RSEC) approach [22], and represent cells by their coordinates along the first five principal components. The normalized data and cluster labels are available from GEO (accession GSE95601).
Simulation study
Simulation parameters. In order to examine the performance of Slingshot and other methods in a wide range of scenarios, we performed a simulation study using the Bioconductor R package splatter [28] to produce artificial single-cell RNA-Seq datasets. Many parameters can potentially be tuned to generate these datasets, including parameters determining the distribution of mean gene expression, library size, outlier expression, drop-out, and the biological coefficient of variation. In order to make our simulation study as realistic as possible, we used a published dataset [3] to learn properties of the marginal distributions of the expression measures for both genes and samples.
In the first part of the study, simulated datasets consisted of two branching lineages (Fig. 4a). The number of cells n was varied from 120 to 1500, by increments of 60 cells. Additionally, we adjusted the signal-to-noise ratio by varying the probability of a gene being differentially expressed (DE) along a path between 0.1 (weak signal) and 0.5 (strong signal), by increments of 0.1. For each combination of sample size and DE proportion, we simulated 10 datasets, for a total of 1,200. In the second part, simulated datasets consisted of five branching lineages (Fig. 4c). The number of cells n was varied between 220 and 1,320, by increments of 220. The DE proportion was varied between 0.1 and 0.5, as in the two-lineage setting. Since all methods under consideration can accommodate non-linear paths, the probability of non-linear DE patterns was set to 0.5, meaning that half of all DE genes’ true average expression level varied according to a non-linear function of pseudotime.
Clustering. We examined Slingshot’s robustness to the choice of clustering method by performing hierarchical clustering, k-means clustering, and Gaussian mixture modeling (GMM), to obtain K=3 to 10 clusters on the three-dimensional representation of each simulated dataset obtained by PCA. Fixing the dimensionality reduction technique allows us to focus on the effects of the clustering method for the dimensionality reduction technique used. In order to alleviate the potential impact of outliers, whenever any method identified a cluster consisting of 4 cells or fewer, that cluster was removed and the method was re-run on the remaining cells.
For the purpose of comparing Slingshot with other lineage inference methods, we again used the top three principal components and set the clustering technique to be the Gaussian mixture model which minimizes the Bayesian information criterion (BIC). This is the default behavior of the mclust R package [32] and similar to the approach taken by TSCAN, which uses a variable number of principal components inferred from the data.
Evaluation. Methods were evaluated according to the agreement between inferred and true pseudotime variables for each lineage, as measured by the Kendall rank correlation coefficient. The Kendall rank correlation coefficient assesses the ordinal association between inferred pseudotimes and true pseudotimes, making it more robust to outliers and non-linearity than the Pearson correlation coefficient. We use a slight variant of this measure designed to reflect errors in the assignment of cells to lineages. Defining \(\mathcal {S}_{0}\) as the set of cells along a true lineage and \(\mathcal {S}_{1}\) as the set of cells along an inferred lineage, we calculate:
$$ \tau \equiv \frac{(\text{\# of concordant pairs})-(\text{\# of discordant pairs})}{{|\mathcal{S}_{0} \cup \mathcal{S}_{1}| \choose 2}}, $$
(5)
where concordant and discordant pairs are defined strictly, not allowing for ties. Hence, only cells belonging to both the true and inferred lineages (i.e., in \(\mathcal {S}_{0} \cap \mathcal {S}_{1}\)) contribute to the numerator. Cells which are along the true lineage (i.e., elements of \(\mathcal {S}_{0}\)) and not assigned a pseudotime by the inferred lineage (not in \(\mathcal {S}_{1}\)) will only contribute to the denominator, bringing τ closer to 0. Similarly for extraneous cells which are included in \(\mathcal {S}_{1}\) but not in \(\mathcal {S}_{0}\).
For each true lineage, we take the maximum τ over all inferred lineages and average these values within a single dataset. This produces a bias in favor of methods that identify many, potentially spurious lineages, as there will be more values over which to take the maximum. We do not correct for this bias, but simply note that Monocle 2 and DPT-Full are the methods which seem to benefit the most from it.