Methanol fixation does not affect nucleic acid integrity and preserves cell-to-cell similarities consistent with scRNA-seq technical variability
First, we wanted to determine whether there is any obvious degradation of RNA or changes to the transcriptomic profile caused by methanol fixation. To do so, we performed methanol fixation on two cell lines, HCT-116 and HepG2, such that any cell-type specific fixation effects can also be observed and compared across cell types (Fig. 1 A; for within cell-type comparisons, the result of the HCT-116 cell line is shown here for illustrative purposes. Results are consistent for both cell lines studied (Supplementary Figures 1, 2, 3, 4, 5, 6, 7). For both cell lines, we prepared RNA-seq libraries from live cells, as well as from fixed cells that were stored in methanol for one-week. We measured the size of single-cell cDNA libraries (Fig. 1B) and noted that although no significant change in fragment size distribution was observed for fixed cells, there is a slight decrease in the quantity of cDNA in the 1500-2000 bp. This result shows that fixation can largely preserve the RNA integrity such that high-quality cDNA can be obtained without severe degradation. After sequencing, raw data from both live and fixed cells were shown to be of high quality and suitable for further analysis; a small but observable reduction in mapping rates was observed in some of the fixed cells compared to live cells, but the mapping rate for all libraries are well within the typical range (Supplementary Figure 8 A-B).
Next, we performed a more detailed bioinformatic analysis to compare the transcriptomic profile between those samples. Since the cells subject to fixation were harvested from the same culture as the non-fixed cells, biological variation between the two datasets is expected to be small. If the methanol fixation indeed does not result in any significant changes to the RNA profile, then the correlation between the live and fixed transcriptomic datasets should be high, and comparable to within-dataset correlations. To validate this hypothesis, we first randomly selected three cells from each of the live and fixed datasets and made scatter plots to visualize the pairwise similarity between single cells at the gene level (Fig. 1 C). Indeed, scatter plots look as expected, with high expression genes between single cells correlating closely while low expression genes are more broadly dispersed, with generally good correlation across all genes [24]. We also calculated Pearson correlation coefficients for each pair. As expected, the r2 values are consistently high for both cells compared within live or fixed datasets, and between live and fixed datasets. These r2 values are also comparable to those found in other published single-cell cross-correlation analyses [25]. To further confirm these results, we then calculated the pairwise correlation for all the cells we profiled, visualizing the results in a heatmap (Fig. 1D). Overall, the correlation between all cells is high, between 0.7 and 0.9. The annotation bar indicates the label of each cell, live or fixed, and the intermixing of labels indicates that the degree of correlation is not clustered by sample type, suggesting that the methanol fixed cells do not show a major difference from the live cells. These results show preliminarily that methanol fixation does not result in any obvious changes to the transcriptomic profile of single cells.
Methanol fixation does not affect cell-type identification, clustering, and biological inference
We found that methanol fixation does not dramatically change single-cell RNA transcriptomic profiles, but scRNA-seq is most commonly used to perform cell-type identification and clustering, therefore we further explored our data using classification methods to ensure fixation does not affect these types of analyses and downstream biological inferences. Principal component analysis (PCA) is a commonly used technique in single-cell RNA-seq analysis [26]. It identifies the coordinate system that represents the greatest variance in the data, and projecting data points in this new coordinate system, thus is able to visualize the differences between groups of data points and cluster similar data points together. To see whether single cells could be grouped by their fixation treatment, which would indicate that there is the variance between the two treatment groups, we applied PCA on our data and checked the first several principal components (PCs) for separation between groups of cells. We found that the top three PCs show meaningful separations (Fig. 2 A): The first PC, which represents the greatest degree of variance, separates cells according to their cell type; the second PC appears to correlate with the cell cycle phase of each cell, and after normalizing for cell cycle effects, we observe that cells become clustered by their treatment condition (Fig. 2 A middle and bottom rows). This suggests that among all the factors for cell classification, inherent differences in cell type remain the most prominent, and when performing cell clustering analysis, any significant biological differences between cell types are unlikely to be obscured by the effects caused by fixation.
To determine the specific genes and possible pathways that are responsible for the separation between live and fixed cells, we performed PCA on each cell type separately, and as expected in this analysis PC1 showed separation between cells according to cell cycle phase whereas PC2 was by treatment conditions (Fig. 2B). We then extracted the top 500 highly variable genes from PC1 and PC2 in each cell line and performed Gene Ontology (GO) Analysis [27] on these genes (Fig. 2 C) (Supplementary Table S1). High contribution genes from PC1 correspond to biological pathways involved in cell cycle processes and control for both cell types analysed, which is expected based on our previous analysis. Genes that are heavily loaded in PC2, which separate the cells by their fixation treatment, did not correspond to any known biological pathways in GO. This result suggests that the separation between live and fixed cells is likely not regulated by any specific biological mechanisms, but rather by technical factors.
The biological complexity of true tissue samples is much greater than a cell line, thus, to verify our findings, we also re-analysed published live and fixed scRNA-seq data generated from primary peripheral blood mononuclear cell (PBMC) [28]. This published PBMC dataset consists of cells under different conditions: live, fixed for 3 h and fixed for three weeks. Our re-analysis of this dataset shows data generated after fixation is able to preserve the gene features for all subtypes recognized in live cell data (Supplementary Figure 9). In addition, for all subtypes of cells in PBMC, the proportion of each cell type is consistent across data generated from all conditions, which indicates that methanol fixation does not alter cell capture efficiencies in a cell-type specific manner (Supplementary Figure 10).
Genes that drive live and fixed separation show greater variation in expression level
To explore the PCs with the strongest variation in more detail, we studied the statistical features of the top 500 loading genes in PC1 and PC2. Two sets of genes from both PCs were extracted and their relative expression abundances were studied. Specifically looking at those genes with high loading in PC2 that are responsible for the separation of live and fixed groups in this PC, we compared their average expression between live and fixed cells and found that the key difference is that low-expression genes are generally less detected or less expressed in the fixed cells (Fig. 3 A). We do not observe this phenomenon with genes from PC1 (cell cycle), indicating that this is unlikely to be caused by any technical limit of detection (LOD) – a compromised LOD would affect all low-expression genes in the sample and therefore would appear in both PCs, which is not the case. In addition to the changes to the mean expression level of low-expression genes, we also observe differences in the variability of the gene expression level when comparing the genes from the two different PCs (Fig. 3B). The coefficient of variation (CV) across cells of the gene expression level for genes contributing to PC1 (cell cycle) is comparable between the fixed and live groups, suggesting that cell-cycle related genes are detected with similar consistency in each cell population regardless of the treatment condition. Genes contributing to PC2 (fixation effect), however, show notably higher variation in fixed cells than in live cells (Fig. 3B bottom panel). These results suggest that the effect of methanol fixation could be specific to those genes. The interpretation of this is that methanol fixation does not result in consistent signal lost for the whole transcriptome, but rather stochastically across all cells for genes involved in PC2 separation. Therefore, genes separating PC2 may share common features that make them specifically affected once fixed. Thus, the discrepancy between live and fixed cells is likely not due to any biological process of the cell that is induced by methanol treatment.
Since scRNA-seq is known to exhibit so-called “dropout” events in gene detection [2, 24, 25, 29,30,31], we wondered if fixation exaggerates this phenomenon. To better evaluate the dropout frequency over the entire transcriptome, we set a series of increasing gene expression level thresholds for defining detected genes. For each threshold, we used boxplots to visualize the number of genes with expression levels greater than this threshold (Fig. 3 C). As expected, when the threshold for gene filtering is low, live cells have more genes detected overall; but somewhat surprisingly, as the gene expression threshold gradually increases, a greater number of genes is detected in fixed cells. This result shows that fixed cells tend to have more dropout events for low expression genes but retain higher expression genes more robustly. We further illustrate this by extracting genes with either high or low expressions (gene expression (TPM) > 30 high or < 5 low), and for each group, visualizing the relative correlation between the mean expression level for each gene (Fig. 3D). The result shows low expression genes are more abundant in the live group than the fixed. The inset graph shows the quantitative comparison of gene numbers above or below the diagonal line. The trend was reversed for highly expressed genes that their expressions are more abundant in fixed cells. Based on these results, we concluded that the frequency of dropout and the relative quantitative expression are different between live and fixed cells. And the methanol treatment differentially affects genes with different expression levels.
Longer and higher GC transcripts are more severely affected by fixation
We sought to find features that are shared among those genes that are most affected, however, features other than abundance can only be described for transcripts, not genes. Abundance measurements at the gene level represent the contribution from multiple transcripts, potentially of widely varying lengths and sequence properties. Therefore, subsequent analyses used transcript level abundances to shed light on potential molecular features or mechanisms that lead to certain types of transcript molecules being affected more by methanol fixation.
First, we observed that the overall GC content of sequenced reads in the fixed cells was significantly higher than in live cells (Supplementary Figure 8 C). There have been no reports of direct methanol-induced conversion of adenosine and thymine to guanosine and cytosine, therefore it is unlikely that this increase in GC content is due to direct amination. Second, we noticed that the peak sizes of the cDNA libraries slightly differed between live and fixed samples and wondered whether fixation could be causing 3’ degradation of RNA molecules leading to changes in length. Thus, we performed further comparative analyses of transcript length and GC content between different treatment conditions. To visualize the GC and length level of specific transcripts against the rest of transcriptome, we sorted all transcripts by their length and GC content and made rank-order plots. In these plots, each dot can be located by a gene’s feature and its corresponding rank, in the increasing order. In the GC content plot, we highlighted top contribution genes from PC1 (cell cycle) and PC2 (fixation effect) using coloured dots, while remaining transcripts are plotted in grey (Fig. 4 A). Compared with PC2, PC1 genes have more even distribution along with the line plot compared to those from PC2. Most PC2 transcripts are restricted to the higher GC content part, which indicates that transcripts separating fixed cells from live ones have higher GC base-pairs in the sequence in general. A similar pattern was revealed when the same analysis was done for transcript length (Fig. 4B). To compare the length and GC content of transcripts from both groups, p-value was calculated for each using T-test, and a statistically significant difference was found between genes contributing to PC1 and those contributing to PC2 (Fig. 4 C). The fixation effect is more prominent for long and high GC transcripts, which are features of transcripts that are causing non-biological separation between live and fixed cells.
To visualize how transcript features correspond with the fixation effect an individual receives, we compared relative expression level and transcript detection number. For abundance comparison, we separated transcripts into 16 groups with equal size according to length (6 plots with increasing order of length were selected) (Fig. 4D, Supplementary Figure 6). We compared relative expression by correlation plot, and the comparison pattern differs as transcript length varies. Then for each group (16 in total), we counted transcript number above or below the diagonal line, which stands for if a transcript holds higher expression in live or fixed cells, to compare the number of transcripts that are enriched in either group (Fig. 4E). The gradually changing trend illustrates that shorter transcripts are more enriched in the fixed group, yet longer transcripts have more equal expressions for both groups. The transcripts detection number shows similar variations (Fig. 4 F). For shorter fragments, more transcripts are detected in fixed cells, whereas longer fragments are detected more frequently in live cells. These analyses suggest that transcripts receive different degrees of effects result from different lengths and GC contents. And transcripts with longer lengths and higher GC contents are more likely to be affected by methanol fixation.
Transcripts that are both long and GC-rich are the most affected by fixation
Thus far, we have shown that fixation differentially affects transcripts with different GC content and length, based on analysis of transcript abundance quantification, which is calculated from aligned read counts. To further shed light on the potential mechanism behind this effect, we next assessed transcript coverage for clues. Since Smart-seq2 is based on template switching after poly-A tail selection, we postulated that if the fixation effect is due to non-enzymatic RNA degradation, random nucleophilic attack to the 2’ hydroxyl of the RNA should result in random termination of transcripts, and we should observe loss of coverage from the 5’ end in fixed samples as compared to live samples. Thus, we separated transcripts into 10 groups with equal size based on length and performed mapping coverage analysis across the length of the transcript (Fig. 5 A, Supplementary Figure 7). Overall, the coverage for shorter transcripts is very similar between fixed and live samples; but in the longest transcripts, the coverage of bases that are greater than 1500 bp away from 3’ is dramatically lower in fixed samples (Fig. 5 A bottom right).
Next, we directly compared the coverage in transcript groups with different lengths or GC content and studied how the variance changes from 3’ to 5’ end. We computed the difference of coverage depth (coveragelive – coveragefixed) between live and fixed cells along the length of the transcript and plotted this difference by the transcript GC content (Fig. 5B, top) and by transcript length (Fig. 5B, bottom). We observed longer transcripts to have more discrepancy in coverage pattern once fixed (Fig. 5B, top). The same effect is observed for GC content (Fig. 5B, bottom). Except for in the longest transcripts, we did not observe the severe 3’ bias that is expected in degraded RNA. In conclusion, the coverage patterns of transcripts are affected by fixation, and this effect is stronger in transcripts that are longer and higher in GC content.
To further quantify the mapping completeness, we counted per-base mapping depth for each individual transcript and calculated a mapping ratio ranging from 0 to 1, and then plotted these ratios on a scatter plot to compare this ratio between live and fixed samples (Fig. 5 C). By highlighting the top 10 % longest and shortest transcripts, we can see that the shortest transcripts have a high degree of correlation between the live and fixed, whereas the longer transcripts show a skew towards the live axis, indicating higher mapping depth in the live group. The GC content plot also shows that high GC content transcripts have higher mapping depth in live cells, while low GC content transcripts seem to have higher mapping depth in fixed cells. To determine if these two factors co-occur, we get a quotient of mapping percentile for each transcript from live and fixed groups and binned those numbers into 10,000 groups by 100 × 100 combinations of transcript length and GC content. We plotted a heatmap using a relative mapping ratio, in which each axis is arranged by increasing length or GC content (Fig. 5D). The corner representing transcripts that are both long and are high in GC content shows higher mapping percentiles in live cells, which again indicates that the effect of methanol fixation effect is more severe for these transcripts that are both long and GC-rich.
Although the fixation effect can be revealed in PCA clustering, it is presented as the second largest variation. Since we concluded that the fixation separation was mainly caused by long and GC-rich transcripts, we then explored whether the fixation effect exhibited in the first PC could be revealed by only selecting those long or GC-rich transcripts. To do so, we binned all transcripts by their GC-content into five bins with increasing GC-richness; we also binned all transcripts by their length into five bins with increasing length. We then performed PCA using increasingly higher threshold cut-off for both GC and length, meaning the set of transcripts used for performing PCA were increasingly restricted to those that are long and have very high GC content. As the threshold for GC and length increased, the amount of variation between live and fixed groups that is explained by PC1 also increased (Fig. 6 A), compared to when all transcripts are included, PC1 predominantly shows variation arising from differences in the cell cycle. This is presumably due to the increased weight of transcripts with long length and high GC content that contributes to the separation of cells caused by fixation. To show this more clearly, from each PCA performed using different transcript length-GC selection thresholds, we plot PC1 loadings annotated by treatment condition, in which we observed that although the magnitude of separation is reduced when fewer high length-GC transcripts were used, the separation of fixed and live cells becomes less ambiguous in PC1 (Fig. 6B). We then identified the top 100 most highly loaded transcripts in PC1s and visualized their length and GC content. As the threshold for GC and length increased, the length and GC content of the top loaded transcripts also gradually increased (Fig. 6 C). This series of analyses illustrate that longer and higher GC transcripts indeed separate cells based on fixation, indicating that such types of transcript molecules are more likely to be affected by fixation during sample preparation.
The methanol fixation effect is less prominent if only sequencing the 3’ end
Since Smart-seq2 produces full-length cDNA libraries, the uneven influence received in different transcripts can be observed in mapping coverage and gene quantification. On the other hand, protocols preserving only the 3’ end of cDNA library such as Drop-seq may be less affected by methanol fixation [32], since only one end of the transcript is counted during quantification and analysis. Therefore, even if the RT and amplification are hindered by changed mRNA structure, the fixation effect will be less significant than that seen in Smart-seq2 data.
To validate our hypothesis, we first simulated Drop-seq data by mapping reads generated in Smart-seq2 to the 3’ end of transcriptome reference. By doing so we can produce a dataset resembling the features of Drop-seq since both methods are based on template switching strategy but Drop-seq only preserving 3’ end of the library. In the PCA plotted with 3’ end mapped data, although we still observe the distinction between cells processed with different treatments, the separation is much less clear (Fig. 7 A), and the two clusters appear merged into one. Compared with full-length Smart-seq2 data, the simulated 3’ biased data shows little fixation effect. To examine if the merging of two clusters is caused by different alignment procedures, we also mapped original reads to the 5’ end of the transcriptome. When performing PCA with 5’ end mapped data, the distance between two clusters enlarged and the separation is distinct (Fig. 7B). The result shows data similarity between live and fixed cells differs between 3’ and 5’ end of the transcriptome.
We also used published data to explore the fixation effect in a real Drop-seq experiment [14]. Three sets of data generated from the HEK cell line were analysed, which included live cells, fixed cells, fixed cells with three-week storage. In the PCA including live and fixed cells, although the separation still exists, two clusters are partially merged, which indicates the fixation effect is much weaker in Drop-seq compared with Smart-seq2 (Fig. 7 C). When we performed PCA using cell groups that are both fixed but with different storage duration, we saw once cells are fixed before sequencing, the similarity of their transcriptomes is high (Fig. 7D). This result indicates fixation effects are consistent and not affected by the cell storage time.
This result validates our hypothesis that fixation can affect data differently according to the methods used for library construction. For protocols utilizing template switching strategy, 3’ end data are less affected by fixation compared with full-length data.