Generation of a gene expression matrix for ddSEQ-based data
In order to use the pre-processing pipeline, the user needs to provide a reference genome with an annotation and the raw FASTQ files (Read 1 and Read 2) generated with BioRad’s ddSeq or 10x Genomics protocol. For this, a dedicated directory structure is used, that can be obtained from the git repository (Figure S1). After preparing the input data, the Snakemake pipeline can be started by a single command line, with the two mandatory parameters being the number of cores to be used and the scRNA-seq protocol.
Following the Snakemake run, the result data is visualized using the dedicated Shiny application. At the beginning, the user has to select the “results directory” generated with the Snakemake workflow. Subsequently, a “sample summary” page is presented (Fig. 3a), showing an overview of quality metrics for each processed sample. These quality metrics are separated into three blocks showing the information as a table and a stacked barplot: 1) The first block contains information about the number of reads with and without a valid barcode and the number of extracted barcodes. 2) The next block shows information regarding the STAR mapping results, such as uniquely mapped reads, unassigned reads, etc. 3) The last block shows information about featureCounts results, such as the number of reads assigned to a single feature, reads assigned to multiple features, etc. A menu on the left side of the web pages allows the user to navigate through other result pages:
-
FastQC Report:
This page provides users with the quality control results generated with FastQC. To provide an overview of the sequence quality of the samples, graphics generated with FastQC are shown on this page.
-
Mapping Rates (Fig. 3b):
This page shows the fraction of uniquely mapped reads, unmapped reads, and reads mapped to multiple features for each cell using stacked bar plots. By changing the amount of cells to analyze, the plots are dynamically re-calculated. Additionally, users can browse through the data corresponding to each barcode in tabular form for the mapping statistics of uniquely mapped reads, reads mapped to multiple loci, and unmapped reads.
-
Gene Counts:
Similar to the “Mapping Rates”, this page shows fractions of the read data assigned to a feature or to different categories of unassigned reads. Again, users can dynamically re-calculate the stacked barplot by changing the number of cells to analyze or by selecting the categories of reads to be shown. At the bottom of the page, users can browse identified barcodes in tabular form.
-
Valid Cells (Fig. 3c):
As mentioned before (“Inspection of pre-processing results and estimation of valid barcodes”, Implementation), a major challenge in Drop-Seq-based protocols is the discrimination of barcodes added to free mRNA fragments from barcodes added to mRNA fragments of a whole cell. This page shows the ‘knee plot’ and the automatically computed cutoff. Additionally, pie charts below the knee plot show the distribution of mapped reads and identified genes in the selected barcodes. When the number of valid barcodes or expected number of cells is changed, these plots are updated. Also, users can browse through the barcodes in tabular form to get a detailed view of identified UMIs and genes per barcode.
After selecting a number of barcodes to be used for further analysis, users can simply download the gene expression matrix and a list of used parameters by clicking a button (Figure S2). The gene expression matrix can then be used with the post-processing WASP application for additional analysis.
Customizable analysis and dynamic visualization of scRNA-seq post-processing
To make the WASP software also applicable for datasets which were created using a protocol not supported by the pre-processing pipeline, the post-processing was designed as a stand-alone module. The post-processing Shiny application can be started either as a Docker container or as a standalone application on a Windows-based PC. A main advantage of the latter is that it does not require the installation of any additional software or packages, only a standard web browser is needed. After the application is started, the user is provided with a web page similar to the Shiny pre-processing page. This page also contains examples of how the input data has to be structured. As a first step, the user has to upload the gene expression matrix file. Where applicable, it is possible to upload an additional annotation file containing all cell barcodes and their assigned type. This can be useful if data from several experiments are combined to label each cell with its origin. Another option is to add already known cell type information, e.g. if cells have previously been selected from specific tissues. When an annotation file is provided, WASP shows users the added cell information in plots allowing to check for e.g. batch effects. After the upload is completed, users can choose between an automatic and a manual analysis mode. The first mode was integrated to allow less experienced users to gain easy insight into their data. For this mode, we evaluated parameters from previous analyses to select default values suitable for most data sets. In the automatic analysis mode, WASP calculates all analysis steps and presents the graphical results on a new web page. The manual mode however allows users to change parameters before each step. Therefore, WASP calculates the results step-by-step and presents the user only with the results of the current step, such that the user can decide which parameters to use for the next step. To simplify the usage, default parameters will be used if the user continues to the next step without changing the values. Similar to the automatic mode, WASP presents all generated visualizations on one page after all processing steps have been executed. A menu on the left side allows users to select each analysis step and to re-calculate the analysis with new parameters from this step on. Since WASP uses the newest version 3.1 of the Seurat R package internally for many analyses, its workflow is highly similar to the Seurat workflow. In addition to Seurat, WASP uses the R packages ‘SingleCellExperiment‘and ‘scater‘to provide users with additional analyses and information about their data set. For scientists without a certain level of expertise in R programming the usage of the three mentioned packages would be challenging, even more so if they wanted to combine the results of several packages to enhance their analysis results. The R Shiny GUI-based web interface of WASP offers users an easy-to-use tool to perform post-processing scRNA-seq analysis (Fig. 4).
The analysis steps cover a typical scRNA-seq workflow, beginning with quality control to remove low quality cells from the data, e.g. cells with a very low amount of UMIs or detected genes (Fig. 4a). To account for technical noise and focus on biological variability, the data is normalized, scaled, and genes with a high variation are used for dimensionality reduction of the data set via principal component analysis. During the manual workflow, a user has to decide how many of the calculated components will be used in the following analyses. This step is important, as selecting a value too low could negatively affect the results and might cause a loss of important data for subsequent clustering. However, in some cases a too large value could also lead to an increase of noise in the data. Several plots aid the user in this decision, primarily Seurat’s so-called elbow plot, which shows the standard deviation of each principal component. Additional to Seurat, a cutoff is calculated in WASP by fitting a line to the elbow plot curve connecting the points with the highest and lowest standard deviation, identifying the perpendicular line for each point to the connecting line and selecting the point farthest to the connecting line. This cutoff is detected to recommend how many dimensions should be included in the following analyses (Fig. 4b). This calculation is also used to select a cutoff during the automatic workflow. Based on the selected value, cells are clustered according to Seurat’s graph-based clustering algorithm. The clustered cells are shown utilizing a variety of sophisticated visualizations including t-SNE (Fig. 4c) and UMAP. As an additional feature, WASP combines the clustering information with data about the cell size. Finally, genes that are differentially expressed between clusters are identified. During the manual workflow, users are able to select custom p- and log2foldchange-values for this analysis. WASP includes a variety of visualizations such as t-SNE, UMAP, violin plots, ridge plots, dot plots and heatmaps, allowing users to dynamically check the expression levels of genes between clusters. Users simply need to type in the name of a gene of interest to calculate plots, showing its expression among the different clusters. Furthermore, WASP provides an interactive table containing all genes identified as differentially expressed in a cluster with corresponding p-values and log2foldchange-values, allowing users also to search genes by name. On the last page, users are provided with all generated visualizations including the option to download these along with the calculated normalization table, the list of marker genes for each cluster and a summary table containing the used parameters to support reproducible analyses.
Successful processing of ddSEQ and 10x genomics data sets
To assess the performance of WASP, we performed a test run with a ddSEQ-based data set of roughly 283 million reads separated into three samples (ddSEQ5, ddSEQ6, ddSEQ7) originating from Mus musculus cells. The raw data is available at the NCBI Short Read Archive (SRA) repository under the SRA accession number SRP149565. For the processing we provided 12 CPU cores of the type Intel Xeon E5–4650 running with 2.70 GHz each. The peak memory usage monitored was 28 GB RAM, and the total processing time was approximately 3.5 h for the previously mentioned steps of FastQC, mapping, feature extraction and UMI counting. Subsequently, results were processed using WASP’s pre-processing Shiny application to select a number of cells to be used for generation of the gene expression matrix. Using this method, WASP automatically selected 346, 399 and 176 valid barcodes for ddSEQ5, ddSEQ6 and ddSEQ7, respectively. We combined these cells to further evaluate the post-processing application.
Subsequently, the resulting gene expression matrix was analyzed with the post-processing module of WASP. This test run was performed with the Windows standalone version of the Shiny app on a laptop with 8 GB RAM and 2 CPU cores (Intel Core i5-4510H) running with 2.90 GHz each. The total processing time of the data for all 935 cells from initial import of the gene expression matrix to generating cluster results including visualizations such as t-SNE and UMAP and differentially expressed gene tables was less than 3 min. The results of this analysis are discussed in detail in Vazquez-Armendariz et al., EMBO J. 2020 [19].
We performed a second test to assess the performance of WASP for 10x Genomics data sets as well. In order to validate the recently implemented pre-processing of 10x barcoded single cell data we’ve used a publicly available data set of roughly 132 million reads combined from two sequencing runs originating from Ciona robusta larval brain cells. The raw data set is accessible under the SRA accession number SRX4938158. Using the same hardware as mentioned above in the ddSeq test, the pre-processing had a total run time of approximately 2 h with a peak memory usage of 5.3 GB RAM. Following the run, WASP’s Shiny application detected 2816 valid barcodes. In a second test, we’ve validated WASP’s performance on an externally generated expression matrix based on the 10x protocol. For this test, we used a gene expression matrix containing 2700 peripheral blood mononuclear cells. This data set is publicly available under: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k. For this test, we used the same setup as mentioned above with the Windows standalone version. The total processing time was 6 min, which took longer than the ddSEQ data set as the 10x data set is about 2.8 times larger than the ddSEQ test data set. Following the analysis, we compared the results to an exemplary analysis of the same data set on the Seurat homepage. WASP was able to identify similarly composed clusters with the corresponding marker genes.