Dictionary learning allows model-free pseudotime estimation of transcriptomic data

Background Pseudotime estimation from dynamic single-cell transcriptomic data enables characterisation and understanding of the underlying processes, for example developmental processes. Various pseudotime estimation methods have been proposed during the last years. Typically, these methods start with a dimension reduction step because the low-dimensional representation is usually easier to analyse. Approaches such as PCA, ICA or t-SNE belong to the most widely used methods for dimension reduction in pseudotime estimation methods. However, these methods usually make assumptions on the derived dimensions, which can result in important dataset properties being missed. In this paper, we suggest a new dictionary learning based approach, dynDLT, for dimension reduction and pseudotime estimation of dynamic transcriptomic data. Dictionary learning is a matrix factorisation approach that does not restrict the dependence of the derived dimensions. To evaluate the performance, we conduct a large simulation study and analyse 8 real-world datasets. Results The simulation studies reveal that firstly, dynDLT preserves the simulated patterns in low-dimension and the pseudotimes can be derived from the low-dimensional representation. Secondly, the results show that dynDLT is suitable for the detection of genes exhibiting the simulated dynamic patterns, thereby facilitating the interpretation of the compressed representation and thus the dynamic processes. For the real-world data analysis, we select datasets with samples that are taken at different time points throughout an experiment. The pseudotimes found by dynDLT have high correlations with the experimental times. We compare the results to other approaches used in pseudotime estimation, or those that are method-wise closely connected to dictionary learning: ICA, NMF, PCA, t-SNE, and UMAP. DynDLT has the best overall performance for the simulated and real-world datasets. Conclusions We introduce dynDLT, a method that is suitable for pseudotime estimation. Its main advantages are: (1) It presents a model-free approach, meaning that it does not restrict the dependence of the derived dimensions; (2) Genes that are relevant in the detected dynamic processes can be identified from the dictionary matrix; (3) By a restriction of the dictionary entries to positive values, the dictionary atoms are highly interpretable. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-08276-9).

Supplementary information -Dictionary learning allows model-free pseudotime estimation of transcriptomic data S1 Simulated data evaluations Figure S1 shows visualisations of the correlations for the ICA results.

S2.1 Data details
In the paper we present an anaylis of 8 real world time course datasets from different organisms. Datasets stem from Gene Expression Omnibus [3] and ArrayExpress [1]. These include samples from bulk and single cell experiments (for details see Table 2 in the main paper). 6 datasets contain samples from different subtypes (for details see Table S1).
To obtain datasets with the same number of samples per type, for each type, samples are randomly selected such that the number of samples per type is the same for all types and maximal given the data.  Figure S1: Spearman correlation for the sparse code values from ICA for each parameter. Shown are results for five simulation (sub-)pattern/ perturbation combinations in rows for three values of simulated genes exhibiting the pattern (|g sim |) in columns. The dataset labelled 'Increasing perturbed' is the one with high-noise-and-zero-counts perturbation, for the other datasets only noise perturbation is conducted. For the dataset with an increasing and a fluctuating pattern, the correlations for each subpattern are shown separately. The respective subpattern is given in the row description in italic letters. The x-axis of each plot shows the number of dimensions and the y-axis the dimension ID. Unlike for DiL, once a high correlation is reached for a certain number of dimensions it does rarely remain high for an increase in the number of dimensions. 2

S2.2 Outlier detection and normalisation
Note, that this section is taken from our previous publication: [4].
In [2] Clearly et al. suggest to normalise the data by a removal of genes for which the sum of counts is > 99.5th − percentil to "avoid performance statistics that are skewed by few genes with extremely high expression". We adapt this normalisation and perform the same normalisation for the samples. Additionally, to avoid for a bias of different experiments, each sample is normalised by division through the sum of all counts for this sample.
Subsequently, to provide numerical stability and allow greater interpretability of the dictionary entries, variables are centred to zero and scaled to have a standard deviation of one. This normalisation cannot be performed for NMF analysis as this results in negative and positive values. For NMF the values are rescaled to the interval [0, 1].

S2.3 Merge of correlation for data with different subtypes
In the second part of the real world data evaluation, data with samples from different subtypes is analysed. To obtain one value for each type from the correlation of each subtype with the experimental times the following steps are performed: 1. For each subtype, the correlation of the experimental times and sparse code values are computed for all samples belonging to the subtype.
2. To merge all subtype values for each feature and come up with a value for the entire feature, each of these correlations is scaled by the percentage of samples that belong to the subtype.
3. The resulting values are summed for each feature.
To be considered for this assessment the following restrictions have to be fulfilled: • For the feature to be considered 1a) More than half of the subtype levels have to be measured on 3 or more time points of the entire experiment 1b) The number of subtypes must not be larger than half the number of total samples • For the subtype to be considered 2a) The number of samples in the subtype is 5 or more 2b) The number of time points for the subtype is 3 or more Reasons for these restrictions are: 1a) In order to measure correlations, the data needs to consist of several time points; 1b) Only those features for which there are several subtypes with multiple samples are considered; 2a) Only those subtypes for which several measurements exist are considered; 2b) Only those subtypes which are measured on several time points are considered.

S2.4 Comparison of pseudotimes among method
To compare the methods among each other, we have performed a correlation analysis of the estimated pseudotimes among all methods evaluated. As for the evaluation of the pseudotimes when compares to the experimental times, the Spearman correlation is evaluated. Figure S2 shows the respective results. Pseudotimes are especially overlapping for the datasets in which samples stem from one phenotype (E-MTAB-2565 and GSE122380). For the other datasets a high correlation, e.g. > 0.7 appears only for some methods. One can also notice a difference between the linear methods (dynDLT, ICA, PCS, NMF) and the non-linear methods (t-SNE, UMAP) in some datasets. Figure S2: Visualisation of the correlation of the estimated pseudotimes among all methods. Shown are for each analysed dataset the correlation of the estimated pseudotimes among all methods evaluated. The correlations are highlighted based on their value. Results for the two datasets with samples from one phenotype are shown at the top. Pseudotimes are more similar for these two datasets compared to the datasets with samples from multiple phenotypes when assessed by correlation.