The effect of methanol fixation on single-cell RNA sequencing data

Background Single-cell RNA sequencing (scRNA-seq) has led to remarkable progress in our understanding of tissue heterogeneity in health and disease. Recently, the need for scRNA-seq sample fixation has emerged in many scenarios, such as when samples need long-term transportation, or when experiments need to be temporally synchronized. Methanol fixation is a simple and gentle method that has been routinely applied in scRNA-sEq. Yet, concerns remain that fixation may result in biases which may change the RNA-seq outcome. Results We adapted an existing methanol fixation protocol and performed scRNA-seq on both live and methanol fixed cells. Analyses of the results show methanol fixation can faithfully preserve biological related signals, while the discrepancy caused by fixation is subtle and relevant to library construction methods. By grouping transcripts based on their lengths and GC content, we find that transcripts with different features are affected by fixation to different degrees in full-length sequencing data, while the effect is alleviated in Drop-seq result. Conclusions Our deep analysis reveals the effects of methanol fixation on sample RNA integrity and elucidates the potential consequences of using fixation in various scRNA-seq experiment designs. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07744-6.


Background
Since its emergence, single-cell RNA-seq (scRNA-seq) has revolutionized many biological fields due to its high resolution in deciphering tissue heterogeneity [1]. The mRNA input from one cell is quite little, thus it leads to more dropout in gene detection compared with bulk RNA-seq [2]. During single-cell library preparation, the reversetranscription (RT) step is crucial since any RNA molecules not captured in this step will forever be lost, and any biases in this step will be amplified downstream, severely affecting the inference of biological signal. For these reasons, it is of utmost importance to preserve the biological sample as much as possible to yield a high-quality transcriptome and a successful scRNA-seq experiment.
For projects including long-distance transportation of samples, cells or tissues may suffer the loss of viability from physical impact during transport or improper storage conditions. In some cases, sample preservation methods are required to allow more flexible experimental designs; specifically, it can help to store samples collected from different experimental conditions or time points and enable them to be consolidated [3]. Besides, researchers may also be interested in specific biological states that in some tissues may become altered as specific pathways can be activated by in vitro processing [4].
Fixation has been widely utilized for the preservation of biological samples from post-mortem decay. Various fixation protocols that use different chemicals have been developed for different purposes and applications, each method having their pros and cons, partially due to their different fixation mechanisms [5][6][7]. To preserve the desired biological features of tissues or cells, different fixatives play different roles depending on the desired features to be preserved. Crosslinking fixatives, such as formaldehyde, work by creating covalent chemical bonds between proteins in tissues, thereby stopping all enzymatic and macromolecular function in the tissue. This causes a complete arrest of all cellular activity, including cell apoptosis and molecular degradation; most macromolecules are even locked in the spatial position they were in at the time of fixation so that spatial relationships within the cell are also preserved. Formaldehyde specifically fixes tissues by cross-linking primarily the residues of the basic amino acid lysine in proteins and is an ideal fixative for immunohistochemistry (IHC) [8]; as all macromolecules are cross-linked, this kind of fixation offers the benefit of long-term storage and allows good tissue penetration by dyes and other small molecule chemicals required for downstream processing in IHC [9]. Another cross-linking fixative, PFA, can anchor soluble proteins to the cytoskeleton and lends additional rigidity to the tissue [10]. The FRISCR protocol based on PFA fixation can even integrate fluorescent dye staining, which allows researchers to apply fluorescence-activated cell sorting (FACS) analysis on this type of fixed sample and sort specific cellular subpopulations for further sequencing analysis [11]. This protocol is not, however, suitable for adaptation to high throughput scRNA-seq as it requires a reverse crosslinking step that can only be performed in tubes and is not compatible with most microfluidic scRNA-seq library preparation workflows.
Alcohol fixatives, such as ethanol and methanol, work by dehydration, causing proteins to denature and precipitate in-situ [12]. As such, the cellular structure will be damaged since the dehydrated environment changes protein conformation. Therefore, alcohol fixation alone is not ideal for preserving samples for imaging, but it is useful for nucleic acid preservation. Compared with fixation approaches used in histology, nucleic acid preserving methods for sequencing do not require the integrity of structural proteins, instead, they aim to prevent DNA or RNA from degradation. Methanol fixation has been widely utilized for its ease of operation and robust performance in preserving nucleic acids [13,14]. The dehydration effect can be reversed with a single, simple rehydration step, which can easily be incorporated into scRNA-seq workflows at the sample preparation step, with subsequent processing steps for cDNA library construction carried out normally without any additional changes [15]. Although methanol can be largely removed by PBS buffer washing to avoid contamination of downstream reactions, substantial changes occur in cells upon fixation due to dehydration. The cellular structure becomes damaged and normal cell functions are compromised due to loss of normal lipid and protein structure; how these changes affect the transcriptome and whether they will influence the sequencing profile remains understudied. In this study, we comprehensively evaluated the effect of methanol fixation on single-cell RNAseq results. We performed the analysis at gene and transcript levels and observed both similarities and inconsistencies between the transcriptomic profiles of live and fixed cells. Although it is often assumed that fixationassociated RNA degradation is the main reason for the discrepancies between live and fixed transcriptomic profiles, our results indicate the incomplete reverse transcription of mRNAs with more complex secondary structures during the library preparation step may be another important cause of the observed discrepancies.

Results
Methanol fixation does not affect nucleic acid integrity and preserves cell-to-cell similarities consistent with scRNA-seq technical variability First, we wanted to determine whether there is any obvious degradation of RNA or changes to the transcriptomic profile caused by methanol fixation. To do so, we performed methanol fixation on two cell lines, HCT-116 and HepG2, such that any cell-type specific fixation effects can also be observed and compared across cell types ( Fig. 1 A; for within cell-type comparisons, the result of the HCT-116 cell line is shown here for illustrative purposes. Results are consistent for both cell lines studied ( Supplementary Figures 1, 2, 3, 4, 5, 6, 7). For both cell lines, we prepared RNA-seq libraries from live cells, as well as from fixed cells that were stored in methanol for one-week. We measured the size of singlecell cDNA libraries (Fig. 1B) and noted that although no significant change in fragment size distribution was observed for fixed cells, there is a slight decrease in the quantity of cDNA in the 1500-2000 bp. This result shows that fixation can largely preserve the RNA integrity such that high-quality cDNA can be obtained without severe degradation. After sequencing, raw data from both live and fixed cells were shown to be of high quality and suitable for further analysis; a small but observable reduction in mapping rates was observed in some of the fixed cells compared to live cells, but the mapping rate for all libraries are well within the typical range (Supplementary Figure 8 A-B).
Next, we performed a more detailed bioinformatic analysis to compare the transcriptomic profile between those samples. Since the cells subject to fixation were harvested from the same culture as the non-fixed cells, biological variation between the two datasets is expected to be small. If the methanol fixation indeed does not result in any significant changes to the RNA profile, then the correlation between the live and fixed transcriptomic datasets should be high, and comparable to within-dataset correlations. To validate this hypothesis, we first randomly selected three cells from each of the live and fixed datasets and made scatter plots to visualize the pairwise similarity between single cells at the gene level ( Fig. 1 C). Indeed, scatter plots look as expected, with high expression genes between single cells correlating closely while low expression genes are more broadly dispersed, with generally good correlation across all genes [24]. We also calculated Pearson correlation coefficients for each pair. As expected, the r 2 values are consistently high for both cells compared within live or fixed datasets, and between live and fixed datasets. These r 2 values are also comparable to those found in other published single-cell cross-correlation analyses [25]. To further confirm these results, we then calculated the pairwise correlation for all the cells we profiled, visualizing the results in a heatmap (Fig. 1D). Overall, the correlation between all cells is high, between 0.7 and 0.9. The annotation bar indicates the label of each cell, live or fixed, and the intermixing of labels indicates that the degree of correlation is not clustered by sample type, suggesting that the methanol fixed cells do not show a major difference from the live cells. These results show preliminarily that methanol fixation does not result in any obvious changes to the transcriptomic profile of single cells.
Methanol fixation does not affect cell-type identification, clustering, and biological inference We found that methanol fixation does not dramatically change single-cell RNA transcriptomic profiles, but scRNA-seq is most commonly used to perform cell-type identification and clustering, therefore we further fixed (R 2 > 0.7). The mixed annotation bar indicates the transcriptome similarities do not distinguish cell treatments during sample preparation explored our data using classification methods to ensure fixation does not affect these types of analyses and downstream biological inferences. Principal component analysis (PCA) is a commonly used technique in singlecell RNA-seq analysis [26]. It identifies the coordinate system that represents the greatest variance in the data, and projecting data points in this new coordinate system, thus is able to visualize the differences between groups of data points and cluster similar data points together. To see whether single cells could be grouped by their fixation treatment, which would indicate that there is the variance between the two treatment groups, we applied PCA on our data and checked the first several principal components (PCs) for separation between groups of cells. We found that the top three PCs show meaningful separations (Fig. 2 A): The first PC, which represents the greatest degree of variance, separates cells according to their cell type; the second PC appears to correlate with the cell cycle phase of each cell, and after normalizing for cell cycle effects, we observe that cells become clustered by their treatment condition ( Fig. 2 A middle and bottom rows). This suggests that among all the factors for cell classification, inherent differences in cell type remain the most prominent, and when performing cell clustering analysis, any significant biological differences between cell types are unlikely to be obscured by the effects caused by fixation.
To determine the specific genes and possible pathways that are responsible for the separation between live and fixed cells, we performed PCA on each cell type separately, and as expected in this analysis PC1 showed separation between cells according to cell cycle phase whereas PC2 was by treatment conditions (Fig. 2B). We then extracted the top 500 highly variable genes from PC1 and PC2 in each cell line and performed Gene Ontology (GO) Analysis [27] on these genes (  Table S1). High contribution genes from PC1 correspond to biological pathways involved in cell cycle processes and control for both cell types analysed, which is expected based on our previous analysis. Genes that are heavily loaded in PC2, which separate the cells by their fixation treatment, did not correspond to any known biological pathways in GO. This result suggests that the separation between live and fixed cells is likely not regulated by any specific biological mechanisms, but rather by technical factors.
The biological complexity of true tissue samples is much greater than a cell line, thus, to verify our findings, we also re-analysed published live and fixed scRNA-seq data generated from primary peripheral blood mononuclear cell (PBMC) [28]. This published PBMC dataset consists of cells under different conditions: live, fixed for 3 h and fixed for three weeks. Our re-analysis of this dataset shows data generated after fixation is able to preserve the gene features for all subtypes recognized in live cell data (Supplementary Figure 9). In addition, for all subtypes of cells in PBMC, the proportion of each cell type is consistent across data generated from all conditions, which indicates that methanol fixation does not alter cell capture efficiencies in a cell-type specific manner (Supplementary Figure 10).

Genes that drive live and fixed separation show greater variation in expression level
To explore the PCs with the strongest variation in more detail, we studied the statistical features of the top 500 loading genes in PC1 and PC2. Two sets of genes from both PCs were extracted and their relative expression abundances were studied. Specifically looking at those genes with high loading in PC2 that are responsible for the separation of live and fixed groups in this PC, we compared their average expression between live and fixed cells and found that the key difference is that lowexpression genes are generally less detected or less expressed in the fixed cells ( Fig. 3 A). We do not observe this phenomenon with genes from PC1 (cell cycle), indicating that this is unlikely to be caused by any technical limit of detection (LOD)a compromised LOD would affect all low-expression genes in the sample and therefore would appear in both PCs, which is not the case. In addition to the changes to the mean expression level of low-expression genes, we also observe differences in the variability of the gene expression level when comparing the genes from the two different PCs (Fig. 3B). The coefficient of variation (CV) across cells of the gene expression level for genes contributing to PC1 (cell cycle) is comparable between the fixed and live groups, suggesting that cell-cycle related genes are detected with similar consistency in each cell population regardless of the treatment condition. Genes contributing to PC2 (fixation effect), however, show notably higher variation in fixed cells than in live cells ( Fig. 3B bottom panel). These results suggest that the effect of methanol fixation could be specific to those genes. The interpretation of this is that methanol fixation does not result in consistent signal lost for the whole transcriptome, but rather stochastically across all cells for genes involved in PC2 separation. Therefore, genes separating PC2 may share common features that make them specifically affected once fixed. Thus, the discrepancy between live and fixed cells is likely not due to any biological process of the cell that is induced by methanol treatment.
Since scRNA-seq is known to exhibit so-called "dropout" events in gene detection [2,24,25,[29][30][31], we wondered if fixation exaggerates this phenomenon. To better evaluate the dropout frequency over the entire transcriptome, we set a series of increasing gene expression level thresholds for defining detected genes. For each threshold, we used boxplots to visualize the number of genes with expression levels greater than this threshold ( Fig. 3 C). As expected, when the threshold for gene filtering is low, live cells have more genes detected overall; but somewhat surprisingly, as the gene expression threshold gradually increases, a greater number of genes is detected in fixed cells. This result shows that fixed cells tend to have more dropout events for low expression genes but retain higher expression genes more robustly. We further illustrate this by extracting genes with either high or low expressions (gene expression (TPM) > 30 high or < 5 low), and for each group, visualizing the relative correlation between the mean expression level for each gene (Fig. 3D). The result shows low expression genes are more abundant in the live group than the fixed. The inset graph shows the quantitative comparison of gene numbers above or below the diagonal line. The trend was reversed for highly expressed genes that their expressions are more abundant in fixed cells. Based on these results, we concluded that the frequency of dropout and the relative quantitative expression are different between live and fixed cells. And the methanol treatment differentially affects genes with different expression levels.
Longer and higher GC transcripts are more severely affected by fixation We sought to find features that are shared among those genes that are most affected, however, features other than abundance can only be described for transcripts, not genes. Abundance measurements at the gene level represent the contribution from multiple transcripts, potentially of widely varying lengths and sequence properties. Therefore, subsequent analyses used transcript level abundances to shed light on potential molecular features or mechanisms that lead to certain types of transcript molecules being affected more by methanol fixation. First, we observed that the overall GC content of sequenced reads in the fixed cells was significantly higher than in live cells (Supplementary Figure 8 C). There have been no reports of direct methanol-induced conversion of adenosine and thymine to guanosine and cytosine, therefore it is unlikely that this increase in GC content is due to direct amination. Second, we noticed that the peak sizes of the cDNA libraries slightly differed between live and fixed samples and wondered whether fixation could be causing 3' degradation of RNA molecules leading to changes in length. Thus, we performed further comparative analyses of transcript length and GC content between different treatment conditions. To visualize the GC and length level of specific transcripts against the rest of transcriptome, we sorted all transcripts by their length and GC content and made rankorder plots. In these plots, each dot can be located by a gene's feature and its corresponding rank, in the increasing order. In the GC content plot, we highlighted top contribution genes from PC1 (cell cycle) and PC2 (fixation effect) using coloured dots, while remaining transcripts are plotted in grey (Fig. 4 A). Compared with PC2, PC1 genes have more even distribution along with the line plot compared to those from PC2. Most PC2 transcripts are restricted to the higher GC content part, which indicates that transcripts separating fixed cells from live ones have higher GC base-pairs in the sequence in general. A similar pattern was revealed when the same analysis was done for transcript length (Fig. 4B). To compare the length and GC content of transcripts from both groups, p-value was calculated for each using T-test, and a statistically significant difference was found between genes contributing to PC1 and those contributing to PC2 (Fig. 4 C). The fixation effect is more prominent for long and high GC transcripts, which are features of transcripts that are causing non-biological separation between live and fixed cells.
To visualize how transcript features correspond with the fixation effect an individual receives, we compared relative expression level and transcript detection number. For abundance comparison, we separated transcripts into 16 groups with equal size according to length (6 plots with increasing order of length were selected) (Fig. 4D, Supplementary Figure 6). We compared relative expression by correlation plot, and the comparison pattern differs as transcript length varies. Then for each group (16 in total), we counted transcript number above or below the diagonal line, which stands for if a transcript holds higher expression in live or fixed cells, to compare the number of transcripts that are enriched in either group (Fig. 4E). The gradually changing trend illustrates that shorter transcripts are more enriched in the fixed group, yet longer transcripts have more equal expressions for both groups. The transcripts detection number shows similar variations (Fig. 4 F). For shorter fragments, more transcripts are detected in fixed cells, whereas longer fragments are detected more frequently in live cells. These analyses suggest that transcripts receive different degrees of effects result from different lengths and GC contents. And transcripts with longer lengths and higher GC contents are more likely to be affected by methanol fixation.
Transcripts that are both long and GC-rich are the most affected by fixation Thus far, we have shown that fixation differentially affects transcripts with different GC content and length, based on analysis of transcript abundance quantification, which is calculated from aligned read counts. To further shed light on the potential mechanism behind this effect, we next assessed transcript coverage for clues. Since Smart-seq2 is based on template switching after poly-A tail selection, we postulated that if the fixation effect is due to non-enzymatic RNA degradation, random nucleophilic attack to the 2' hydroxyl of the RNA should result in random termination of transcripts, and we should observe loss of coverage from the 5' end in fixed samples as compared to live samples. Thus, we separated transcripts into 10 groups with equal size based on length and performed mapping coverage analysis across the length of the transcript (Fig. 5 A, Supplementary Figure 7). Overall, the coverage for shorter transcripts is very similar between fixed and live samples; but in the longest transcripts, the coverage of bases that are greater than 1500 bp away from 3' is dramatically lower in fixed samples (Fig. 5 A bottom right). Next, we directly compared the coverage in transcript groups with different lengths or GC content and studied how the variance changes from 3' to 5' end. We computed the difference of coverage depth (coverage livecoverage fixed ) between live and fixed cells along the length of the transcript and plotted this difference by the transcript GC content (Fig. 5B, top) and by transcript length (Fig. 5B, bottom). We observed longer transcripts to have more discrepancy in coverage pattern once fixed (Fig. 5B, top). The same effect is observed for GC content (Fig. 5B, bottom). Except for in the longest transcripts, we did not observe the severe 3' bias that is expected in degraded RNA. In conclusion, the coverage patterns of transcripts are affected by fixation, and this effect is stronger in transcripts that are longer and higher in GC content.
To further quantify the mapping completeness, we counted per-base mapping depth for each individual transcript and calculated a mapping ratio ranging from 0 to 1, and then plotted these ratios on a scatter plot to compare this ratio between live and fixed samples (Fig. 5 C). By highlighting the top 10 % longest and shortest transcripts, we can see that the shortest transcripts have a high degree of correlation between the live and fixed, whereas the longer transcripts show a skew towards the live axis, indicating higher mapping depth in the live group. The GC content plot also shows that high GC content transcripts have higher mapping depth in live cells, while low GC content transcripts seem to have higher mapping depth in fixed cells. To determine if these two factors co-occur, we get a quotient of mapping percentile for each transcript from live and fixed groups and binned those numbers into 10,000 groups by 100 × 100 combinations of transcript length and GC content. We plotted a heatmap using a relative mapping ratio, in which each axis is arranged by increasing length or GC content (Fig. 5D). The corner representing transcripts that are both long and are high in GC content shows higher mapping percentiles in live cells, which again indicates that the effect of methanol fixation effect is more severe for these transcripts that are both long and GC-rich.
Although the fixation effect can be revealed in PCA clustering, it is presented as the second largest variation. Since we concluded that the fixation separation was mainly caused by long and GC-rich transcripts, we then explored whether the fixation effect exhibited in the first PC could be revealed by only selecting those long or GC-rich transcripts. To do so, we binned all transcripts by their GC-content into five bins with increasing GCrichness; we also binned all transcripts by their length into five bins with increasing length. We then performed (See figure on previous page.) Fig. 3 Differences in statistical features of genes with the top contribution in driving variation between live and fixed cells. (A) Comparison of relative expression of 500 genes with the top contribution in PC1 and PC2 between live and fixed cells. Expression of PC1 genes correlated well while in PC2 the trend was incoherent for genes with different expressions level, which indicates genes heavily loaded in PC2 may be responsible for the separation between two groups of cells. (B) Comparison of expression variation of genes with top contribution from PC1 and PC2. In the top panels of Fig. 3B, we take genes that are heavily loaded in PC1 respect for live cells and fixed cells. Then, we computed the coefficient of variation (CV) of each gene across all cells. The CVs for each gene are then plotted against that gene's mean expression level, separately for live (blue) and fixed (orange) cells. Genes with the top contribution in PC2 holds much higher variation compared with PC1 genes. (C) Comparison of gene detection number after expression filtering. A series of thresholds were set up for different sensitivity requirements. The detection number in fixed cells gradually surpass live cells once the threshold increased (nsP > 0.05, *P < 0.05,**P < 0.01,****P < 0.0001). (D) Relative abundances of genes with low (< 5 TPM) or high (> 30 TPM) expression, the inset bar charts compare the quantities of genes which have higher expression in either live (blue) and fixed (orange) cells. For low expression genes, they are generally more abundant in live cells. Genes with higher expression are more abundant in fixed cells PCA using increasingly higher threshold cut-off for both GC and length, meaning the set of transcripts used for performing PCA were increasingly restricted to those that are long and have very high GC content. As the threshold for GC and length increased, the amount of variation between live and fixed groups that is explained by PC1 also increased ( Fig. 6 A), compared to when all transcripts are included, PC1 predominantly shows variation arising from differences in the cell cycle. This is presumably due to the increased weight of transcripts with long length and high GC content that contributes to the separation of cells caused by fixation. To show this more clearly, from each PCA performed using different transcript length-GC selection thresholds, we plot PC1 loadings annotated by treatment condition, in which we observed that although the magnitude of separation is reduced when fewer high length-GC transcripts were used, the separation of fixed and live cells becomes less ambiguous in PC1 (Fig. 6B). We then identified the top 100 most highly loaded transcripts in PC1s and visualized their length and GC content. As the threshold for GC and length increased, the length and GC content of the top loaded transcripts also gradually increased (Fig. 6 C). This series of analyses illustrate that longer and higher GC transcripts indeed separate cells based on fixation, indicating that such types of transcript Transcripts are sorted and grouped by length and GC content; each unit represents an average ratio for those transcripts. In the corner containing transcripts with longer and GC-rich transcripts, live cells are shown to have more complete mapping compared with fixed molecules are more likely to be affected by fixation during sample preparation.

The methanol fixation effect is less prominent if only sequencing the 3' end
Since Smart-seq2 produces full-length cDNA libraries, the uneven influence received in different transcripts can be observed in mapping coverage and gene quantification. On the other hand, protocols preserving only the 3' end of cDNA library such as Drop-seq may be less affected by methanol fixation [32], since only one end of the transcript is counted during quantification and analysis. Therefore, even if the RT and amplification are hindered by changed mRNA structure, the fixation effect will be less significant than that seen in Smart-seq2 data.
To validate our hypothesis, we first simulated Dropseq data by mapping reads generated in Smart-seq2 to the 3' end of transcriptome reference. By doing so we can produce a dataset resembling the features of Dropseq since both methods are based on template switching strategy but Drop-seq only preserving 3' end of the library. In the PCA plotted with 3' end mapped data, although we still observe the distinction between cells processed with different treatments, the separation is much less clear (Fig. 7 A), and the two clusters appear merged into one. Compared with full-length Smart-seq2 data, the simulated 3' biased data shows little fixation effect. To examine if the merging of two clusters is caused by different alignment procedures, we also mapped original reads to the 5' end of the transcriptome. When performing PCA with 5' end mapped data, the distance between two clusters enlarged and the separation is distinct (Fig. 7B). The result shows data similarity between live and fixed cells differs between 3' and 5' end of the transcriptome.
We also used published data to explore the fixation effect in a real Drop-seq experiment [14]. Three sets of data generated from the HEK cell line were analysed, which included live cells, fixed cells, fixed cells with three-week storage. In the PCA including live and fixed cells, although the separation still exists, two clusters are partially merged, which indicates the fixation effect is much weaker in Drop-seq compared with Smart-seq2 (Fig. 7 C). When we performed PCA using cell groups that are both fixed but with different storage duration, we saw once cells are fixed before sequencing, the similarity of their transcriptomes is high (Fig. 7D). This result indicates fixation effects are consistent and not affected by the cell storage time.
This result validates our hypothesis that fixation can affect data differently according to the methods used for library construction. For protocols utilizing template switching strategy, 3' end data are less affected by fixation compared with full-length data.

Discussion
To elucidate the effect of methanol fixation on single-cell RNA-seq data, we performed a series of comparative analyses and proposed a potential mechanism for how the fixation effect occurs. For all comparisons carried out at the gene level, the results show that fixation data is capable of revealing biological insights that are typically sought in scRNA-seq experiments. Although subtle discrepancies were observed in part of our analysis, they did not obscure the key biological features of cells and key biological differences between cell types. We further investigated these effects at transcript-level to identify the source of the observed differences, using both expression abundance data and raw sequencing data to uncover more in-depth insights to account for the observed discrepancies. Using transcript-level information, we showed that length and GC are key properties that correlate with the degree of fixation effect each transcript receives. Specifically, longer fragments with higher GC content are shown to be more affected by fixation in quantification and mapping integrity. Based on structural considerations, we hypothesize that transcripts that are long and high in GC are more likely to have complex higher-order structures, which makes them more difficult to fully recover from methanol fixation even after rehydration [33]. Unlike mRNA with simple structures, those with more complex structures may be altered and hinder downstream reactions such as RT and amplification. An important insight from our results is that for users of scRNA-seq who wish to investigate differences in splice isoforms, live samples will be more reliable since the full-length information of transcriptome is better preserved; methanol fixation will result in skewed abundance readouts from those transcripts with high GC and long length. In addition, we also observed lower %GC in the quality control report, which indicates that GC-rich transcript may not be preserved well by methanol fixation (Supplementary Figure 8 C). In addition, fixation effects are more obvious in Smart-seq2 data compared with Drop-seq. In both simulated and real Drop-seq data (Fig. 7A C), we observed less separation between live and fixed cells compared with Smart-seq2, therefore illustrating that the fixation effect is not observable in 3' end sequencing. It is possible that since the oligo-dT primers bind to the poly-A tail at the 3' end to initiate the RT, the integrity of the 3' end is more likely to be protected and captured, whereas the subsequent template switching step that occurs at the 5' end is more likely to be affected. An inefficient template switch then leads to incomplete DNA elongation and finally affecting data quantification in fixed cells. Based on our observations and mechanistic hypothesis, methanol fixation influences the data by introducing barriers to RT, which will occur more often in transcripts with complex secondary structure, therefore the fixation effect will not be observed if sequencing only one end of the transcript.
Besides the loss of mRNA with more complex secondary structures, we also observed a general trend of more dropout of low expression genes in the fixed group. By checking the sequence features of the corresponding transcripts, we found the low expression is not related to GC content and length. Instead, this may indicate the dropout detection of low expression genes is exaggerated by fixation. For research focusing on transcripts with low abundance, methanol fixation can lead to reduced or even complete loss of the target, even though we and others have shown that this does not affect cell-type clustering results [14,28]. We also found that although the discrepancy between live and fixed sample is not primarily caused by RNA degradation, it is crucial to include protective reagents such as RNase inhibitors to all buffers and reagents during the methanol fixation and rehydration steps to prevent degradation. Methanol fixation can also lead to cell aggregation, thus, to avoid capturing many doublets or cell aggregates in the scRNA-seq experiment, proper gating conditions should be established during single cell sorting to remove these unwanted aggregates (Supplementary Figure 8DE).
Fixation induced expression changes are also seen when comparing relative abundance between live and fixed cells (for example in Figs. 3D and 4D). When using Smart-seq2 data for such analysis, some transcripts show higher expression in fixed cells rather than in live cells. This begs the question of whether some genes are enriched due to fixation. Although the elevated expression of some mRNA transcripts in fixed cells in some analyses seems contradictory, this phenomenon can be explained from the perspective of library construction and data normalization. During sequencing library construction and single-cell library pooling, the protocol aims to collect equal amounts of products from individual samples. Therefore, even though there are differences in cDNA yield after pre-amplification, it can be removed by sequencing library preparation since the equal quantity of cDNA were used for downstream processing. Moreover, after data normalization (e.g., calculation of TPM or CPM), the resulting library size will be equalized for individual cells. In fixed cells, transcripts affected by fixation have more dropout during library preparation, therefore loss of those transcripts means remaining transcripts are more likely to be captured, which could then be reflected as a higher expression level of these transcripts in fixed cells.
Our work determines the feasibility level of methanol fixation in different usage scenarios such as basic scRNA-seq compared with those focusing on transcript isoforms level studies, and we inform users of the types of biases that occur with methanol fixation in different experimental scenarios. Knowing how fixation affects the RNA-seq data is beneficial for researchers to make reasonable and appropriate experimental plans using methanol fixation.

Conclusions
In conclusion, we evaluated the effect of methanol fixation in single-cell RNA-seq from various aspects and observed largely similar results between data generated from live and fixed cells, while in-depth transcript level analyses revealed important differences between live and fixed cells in GC and length biases that could affect specific types of biological studies. This confirms the feasibility of using methanol fixation for sample preserving in most applications, while there remain subtle differences in transcript detection and coverage with increasing transcript length and GC content. Moreover, we found that the effect of methanol fixation is also different in distinct library construction strategies. These findings suggest that methanol fixation is generally a good method for sample fixation, but certain types of biological analyses that require consideration of full-length transcripts, or that may be influenced by the GC-content or transcript length need to use caution when interpreting the results from fixed data, to avoid misleading conclusions due to fixation. Overall, this study provides an in-depth investigation of the effects of methanol fixation on scRNA-seq data, and offers users additional guidance in selecting the appropriate sample preservation strategy for single-cell transcriptomic studies.

Cell line preparation and fixation
The HCT-116 and HepG2 cell lines used in this study were purchased from ATCC (HepG2, cat# HB-8065; HCT-116, cat# CCL-247). HCT-116 and HepG2 cells were cultured with Dulbecco's Modified Eagle's medium (DMEM) (Thermo Fisher Scientific, cat# 12,100,046) supplemented with 1 % Penicillin-Streptomycin (Thermo Fisher Scientific, cat# 15,070,063) and 10 % Fetal Bovine Serum (Thermo Fisher Scientific, cat# 16,000,044). We harvested cells at 70-80 % confluence, dissociated for 2 min at 37°C using 0.25 % trypsin-EDTA (Invitrogen, cat# 25,200,072), and quenched with growth medium. To prepare cells for different treatments, the cell suspension was separated into two parts with equal volume. We centrifuged at 300 g for 5 min to wash the cells. The supernatant was removed, and the cell pellet was resuspended using phosphate-buffered saline (PBS) (Invitrogen, cat# 10,010,023). The resuspension volume was chosen such that the final cell concentration is 10 6 to 10 7 cell/mL. A portion of the cells were taken for processing immediately to generate the "live cells" sequencing library; the rest of the cells were kept for fixation.
Fixation and rehydration steps were performed following a previously published protocol [14]. Briefly, ice cold 20 % PBS was added to resuspend the cell pellet and 80 % pre-chilled methanol was added dropwise, until the final cell concentration reached 10 6 to 10 7 cell/mL. After mixing PBS and methanol by gently pipetting up and down, the tube with the fixed cells was placed on ice for 20 min and then transferred to -80°C for longer storage. Fixed samples were kept for one week before rehydrating, cell sorting, and library construction.
The HepG2 experiments were performed in two separate replicates, while the HCT-116 experiments were performed once.
Cell rehydration, FACS sorting, library construction, and sequencing Cells stored in 80 % methanol was transferred from − 80°C to ice and centrifuged at 1500 g for 3 min. We discarded supernatant and collected the cell pellet. PBS in presence with 0.01 % Bovine Serum Albumin Fraction V (BSA, Thermo Fisher Scientific, cat# 15,260,037) and 1U/ul RNase OUT (Thermo Fisher Scientific, cat# 10, 777,019) was used for resuspending and washing the cell pellet. Fixed cells were washed with the same washing buffer twice to remove methanol thoroughly. After washing, fixed cells were kept in the washing buffer and ready for sorting. For live cell sorting, cells were stained with propidium iodide (PI, Sigma-Aldrich, car# P4864-10ML) solution at room temperature for 10 min. Both live and fixed cells were filtered through 70 μm filter to prevent cell clumping.
Both live cells and fixed cells were sorted into 96 well plates containing cell lysis buffer using BD Aria IIIu sorter (BD Biosciences). FSC parameters were used for singlet selection (Supplementary Figure 8D). PE-Cy5 signal was used for removing dead cells. Prior to sorting, we visually observed more and larger cell aggregates in fixed cell samples under the cell counter, as compared to the live cells. By visual inspection, sorting was able to remove the aggregates (Supplementary Figure 8E). The plates with sorted cells were vortexed and spun down at 4°C. Single-cell cDNA library was constructed using Smart-seq2 protocol [16]. After obtaining cDNA libraries, cDNA library size was checked using Fragment Analyzer HS NGS Fragment Kit (1-6000 bp) (Agilent, formerly Advanced Analytical, cat# DNF-474-1000). Concentrations were quantified by Qubit 3.0 fluorometer (Thermo Fisher Scientific, cat# Q33216). High quality cDNA libraries were used for sequencing library construction.

Data processing and analysis
Raw sequencing data were demultiplexed with adapters trimmed on Basespace (Illumina). Quality of all raw *fastq.gz files was checked using Fastqc [17]. Reads are mapped to ENSEMBL human reference genome GRCh38 using Kallisto [18] for quantification. The integration of single-cell data was done by tximport [19]. Cells with more than 4000 genes detected were preserved for downstream processing, resulting in 151 cells for HepG2, and 183 cells for HCT-116. For data normalization, we logged the TPM result provided by Kallisto. Matrices of gene and transcript abundance were obtained by adjusting "tx2gene" parameter in tximport.
The correlation matrix was plotted with "chart.Correlation" implemented in the "PerformanceAnalytics" package. Cell-cycle related analyses were done using Seurat R package [20]. We performed Principal Components Analysis using FactomineR package and visualizations were done using either R built-in functions, ggplot2, or ggpubr. We performed Gene Ontology analysis using Gene Ontology website(http://geneontology. org). The remaining analyses used custom scripts, downloadable at https://github.com/wangxinlei217/Effect-ofmethanol-fixation-on-single-cell-RNA-sequencing-data.
Length and GC information of individual transcript was calculated according to GRCh38 transcriptome sequence using a custom R script. For all transcript separation based on either length or GC content, the thresholds were set to make sure that all groups have the equal number of transcripts. In the analysis focusing on transcript coverage, raw reads were mapped to GRCh38 by STAR [21]. To analyze mapping coverage for the individual transcript, Aligned *bam files were converted to depth files with samtools depth function in SAMtools [22], which shows the number of reads mapped to each base pair on each chromosome. Reads in depth file were then assigned to specific transcripts using bedtools intersect function in BEDTools [23]. Downstream analysis and visualization were performed in R.
To map Smart-seq2 reads to the 3' end and 5'end of the transcriptome, we first trimmed the GRCh38_RNA_ latest.fna to build the reference, then we performed alignment and quantification with kallisto. Gene expression matrices of Drop-seq data were downloaded from GEO (accession numbers GSM2359002, GSM2359003, GSM 2,359,005).  Figure 6. Expression correlation of transcripts with different lengths. Transcripts are equally grouped to 16 according to the length. IDs in each plot represent the transcripts group included in that plot. With the increase of the ID number, the average length of the transcript group also increased. Both results of HCT-116 (top) and HepG2 (bottom) are shown. Supplementary Figure 7. Comparison of mapping coverage between live and fixed cells. Transcripts are equally grouped to 10 according to the length. IDs in each plot represent the transcripts group included in that plot. With the increase of the ID number, the average length of that transcript group also increased. Both results of HCT-116 (top) and HepG2 (bottom) are shown.Supplementary