ENCoRE software for NGS processing
A typical CRISPR/Cas9 library screening workflow involves three basic steps: introduction of the pooled sgRNA library into cells stably expressing the Cas9 nuclease, followed by a phenotypic selection and finally sequencing of the lentiviral guide sequences from the genomic DNA of selected cells (Fig. 1).
The final output is generally PCR-amplified sgRNA sequence performed on isolated DNA with NGS technology specific adapters. As the guide sequences in the selected cells are presumed to contain mutations in the corresponding gene, sequencing of sgRNAs from the cells provides a simple proxy to identify candidate genes in the selected process. Most libraries contain several unique guide sequences per gene to increase the likelihood of mutation and statistical reliability. Following sequencing, FASTQ data files generally comprising several gigabytes of data are produced.
In order to retrieve valid sequences the data files must be manipulated and filtered prior to matching against a reference library. Several programs use a console-based input and are therefore cumbersome for inexperienced users. ENCoRE addresses this problem by serving up a slim and convenient graphical user interface (GUI), mimicking the look and feel of the current operating system (Fig. 2). Two separate modules are embedded: a FASTQ Filter module (FFM) for sequence manipulation and quality control, and a CRISPR Report module (CRM) that compares the filtered sequences to a reference file and displays the results.
Filtering and curating large FASTQ datasets represents the core preprocessing module of ENCoRE. The module provides extensive filters for sequences like crop, cut, replace, search and quality analysis. These filters can be arranged in a pipeline to sequentially alter every dataset in large FASTQ files. To facilitate the assembly of a filter workflow, ENCoRE provides the user with a live preview of the produced output sequences for every workflow step (Fig. 2). To ensure consistent filtering workstreams for multiple FASTQ files, workflow settings can be exported for use on other datasets. Possibly the most useful function is a generalized string search that allows for rapid identification and cropping of library anchor and priming sequences. Thus, primer and library tag sequences can be efficiently identified for removal while barcodes can be recognized to deconvolute multiplexed experiments. The final processing is done using a multithreaded worker-consumer pattern that scales to the local machine and fully exploits its processing power. This way it is possible to process 15 gigabyte data files within a few minutes on a typical desktop. The program runs fully localized as required for sensitive data (i.e., involving human subjects or intellectual property) without external databases. The output is saved to a filtered FASTQ file for further use, e.g., quality control or processing in the report module.
After appropriate filters are applied, the CRISPR Report Module is able to match the resulting sgRNA ‘tags’ to a corresponding reference file for a given sgRNA library. These files unfortunately lack a standardized format. However, the CRM has the power to adapt to any present reference file prepared in a (CSV) format including headers. In this way, the CRM is compatible with virtually all present sgRNA libraries but also with future CRISPR iterations using tailored sgRNA libraries or ones not based on Streptococcus pyogenes Cas9 [24,25,26]. To display numerical proportion of genes and the ratio of successful sgRNAs, a display chart can be generated using the guide information of a filtered FASTQ file. The CRM enables the user to show basic information with a pie chart representing all sequenced reads and their corresponding absolute number of isolated guide sequences (Fig. 2). Inputs derive from reference file column headers and allow for visualization by all criteria in the reference library (i.e., guide sequence, gene name (aggregated or unaggregated), chromosomal position). A slider allows for selective data zooming and automatically emphasizes enriched hits with relevant count information. The import and processing are performed in a separate thread, keeping the GUI active and enabling the user to interact with the chart even though data is still processed. An export function allows for saving the read counts linked with the corresponding guide in a CSV file for further analysis.
The relative change in sgRNA frequency (sgRNAexperimental/sgRNAcontrol) found for a given gene corresponds to its likelihood of being involved in the tested process [4]. Therefore, CRM combines the fold change for all guides corresponding to a given gene and compares those to an unselected control sample. To compare experimental frequencies (counts) with control counts specific for several guides, it is convenient to assume negative binomially distributed counts and to carry out a generalized regression, which estimates and tests the main treatment effects adjusted for the dummy variables representing the guides. This is essentially equivalent to a paired comparison, similar to the well-known paired t-test. However, a straightforward alternative and approximate solution to this is a normal z-test directly applied to the means of the log2[fold change] of aggregated guides per gene. Comparison of this method to analysis software MAGeCK [22] and edgeR [20] led to comparable overlaps of top ranked sgRNAs for published datasets, and particularly for data from this study (Additional file 3, Fig. 3b). To compensate for missing or underperforming guides, either due to incomplete NGS data or failure for the guide to generate a phenotypic change, a median imputation strategy was defined prior to the z-test. This means that missing counts are replaced by the median of the counts. The median imputation is less susceptible to outlier impact as compared to mean imputation. The CRM finally displays a simple graphical output in form of a volcano plot of −log2[p-value] where p-values are derived from a z-test versus the log2[fold change]. The threshold for a Bonferroni correction considering about 20,000 genes (i.e. aggregated guides) can be calculated by 0.05/20000 = 0.0000025. In order to highlight also guides that show clear significance a Bonferroni threshold was chosen by red dots in the volcano plot. This graph allows simple visualization of genes with increased relevance and is useful when relative sgRNA increases are small, particularly under imperfect selection conditions such as for TNFa (Fig. 3; below). This allows the user to quickly examine expected results and define candidates for further tests also with the individual aggregated fraction of total guides. Mouseover, zoom and capture are also possible for data presentation. Finally, after processing and importation, the identifier data with statistical output are stored in a CSV file for further analysis in other software. ENCoRE supports multiple group inputs, enabling studies with repeated measures to be analyzed and exported as with single experimental studies.
Extrinsic apoptosis pathway validation
In order to demonstrate the power of ENCoRE, we conducted a CRISPR screen using a library of ~88,000 sgRNAs [4] in mouse fibroblasts to identify members of the extrinsic apoptosis pathway induced by TNFa, a well-characterized extrinsic apoptosis inducing agent. In our hands, TNFa incompletely induces apoptosis in a cell population, thus substantial background sgRNAs would be expected to confound results. Indeed, we observed only approximately 60% cell death at the highest concentration tested (40 ng/mL) following treatment, shown in [Additional file 4]. Therefore, we sought to test if meaningful results could be inferred from a screen conducted under imperfect selection conditions.
Using this methodology, we sought to validate ENCoRE with the identification of canonical members of the extrinsic apoptosis pathway. Immortalized mouse fibroblasts expressing stable Cas9 [27] were infected with a lentiviral genome-wide CRISPR guide library directed against coding sequences with up to five unique guides per gene [4]. After TNFa (20 ng/mL) treatment was applied, mutant cell pools were harvested and genomic DNA was extracted. Genomic DNA from unselected cells served as a control population as well as pure plasmid preparation of the sgRNA library. Primers were designed to bind to the lentiviral vector backbone and amplify a small (194 bp) DNA fragment including the variable guide region to generate a library suitable for next generation sequencing. ENCoRE was then used to process the resulting library file and extract gene information as described below.
ENCoRE uses a classical method to represent significance versus the frequency of gene mutations (log2[p-value] versus –log2[fold change]; Fig. 3a). This metric allows for simple classification of productive sgRNA hits and produces a traditional volcano plot result by highlighting p-values that are significant (Padj < 0.05) after Bonferroni correction. ENCoRE successfully identified canonical members within the top genes, including Tnfrsf1a (0.20% of all sgRNA sequences, Padj < 0.0000001), Casp8 (0.91% guides, Padj < 0.0000001), Bid (0.17% guides, Padj = 0.00000006) and Tnfrsf1b (0.54% guides, Padj = 0.02277213) (Fig. 3a). Comparison to MAGeCK and edgeR showed substantial overlap among the top 20 discovered genes (12 and 6 genes, respectively) supporting the role of these genes in cell death signaling and ENCoRE methodology (Fig. 3b). The discovery of other apoptosis-promoting factors was not unexpected, for example the heterogeneous nuclear ribonucleoprotein F protein (Hnrnpf; Padj = 0.0000279) that regulates splicing of the B-cell lymphoma/leukemia-2 (Bcl-2)-related family member Bcl-x (RNA) [28]. However, the data also revealed a surprising result for two new genes previously uncharacterized in extrinsic apoptosis: the nonsense mediated mRNA decay factor, suppressor with morphological defects in genitalia 7, (Smg7; 0.33% guides, Padj = 0.00356165), and the carboxylesterase Ces2a (0.11% guides, Padj = 0.0041534), which should not to be confused with the ces-2 transcription factor from C. elegans also involved in cell death. We directed our attention to these genes in further experiments to determine their roles in regulating cell death signaling.
To directly test whether Smg7 and Ces2a contribute to extrinsic apoptosis, we generated independent indel null mutations each of the above genes using CRISPR, shown in [Additional file 5], in parallel with mutations in Tnfrsf1a and Hnrnpf as positive controls. Following characterization of clones, we tested individual cell lines containing mutations in both alleles for resistance to TNFa challenge (Fig. 4). Mutations in all cell lines showed significant differences (P < 0.001, two-way, repeated measures ANOVA) from the parental cell line upon exposure to a range of TNFa concentrations with Tnfrsf1a −/− cells showing even an increase in cell number, consistent with a separate role for TNFa in survival and proliferation when apoptosis is not activated [29]. Smg7 −/− and Ces2a −/− cells also displayed robust protection compared to other cell lines at all concentrations tested. The control pan-caspase inhibitor z-VAD-FMK (zVAD) showed significant protection against TNFa challenge, demonstrating that the four discovered genes are bona fide factors that promote caspase-dependent extrinsic apoptosis.
Next, to test whether the Smg7 and Ces2a specifically function in regulation of extrinsic apoptosis we induced cell death by other modalities. First, we tested for resistance to the ferroptosis inducer, Erastin [30]. Smg7 −/− and Tnfrsf1a −/− cells showed significant protection against Erastin (P < 0.001) at concentrations up to 200 nM whereas Ces2a −/− and Hnrnpf −/− were more sensitive than control cells at this concentration. Minor protection against doxorubicin induced cell death was also observed in Smg7 −/− cells at 20 μM and Smg7 −/− and Ces2a −/− cells were both found to be partially resistant to staurosporine at low doses. However, Smg7 −/− cells were more sensitive than control cells to vinblastine and Taxol, and both cell lines were sensitive to low doses of Deoxycholic acid (DCA). Hnrnpf −/− cells showed significant resistance at several concentrations to Taxol (200, 800 nM; P < 0.001) likely resulting from protein interaction with HNRNPA2, a prognosticator for Taxol resistance in ovarian cells [31, 32]. Cumulatively, these results show that mutation of either Smg7 or Ces2a robustly protects against extrinsic apoptosis induced by TNFa challenge. In addition, mutation of Smg7 can also partially protect against cell death induced by ferroptosis and doxorubicin, whereas both Smg7 and Ces2a mutant cell lines showed partial resistance against staurosporine at low concentrations but not against other drugs used in oncology.
Next, we tested if mutations in Smg7 and Ces2a activate the protective NF-κB transcriptional Complex I and thereby increase survival. Basal levels as well as induced levels of nuclear p65 compared to the cytosolic pool following TNFa addition did not significantly differ from the parental line except for Tnfrsf1a −/− cells which showed as expected no response to TNFa (Fig. 5a). The transcriptional response as indicated by an NF-κB reporter assay was also unchanged (Fig. 5b). We then sought to determine if mutant cells could be re-sensitized to TNFa. Cycloheximide (CHX) is thought to sensitize cells to TNFa by translational inhibition of the short-lived c-FLIP (CFLAR) protein and subsequent activation of caspases, in particular Caspase-8. Following CHX treatment, all cell lines except Tnfrsf1a −/− were sensitive to TNFa with Smg7 −/− showing significantly less sensitivity than the parental cell line (Fig. 5c). Thus susceptibility to apoptotic challenge can be restored indicating that that the TNFa cell death pathway is not irreversibly blocked in Smg7 −/− and Ces2a −/− cells and that increased caspase activity can overcome this inhibition.
We also investigated the effect of TNFa treatment on core components of apoptosis in mutant cells. p53 (TRP53) is activated following DNA damage and SMG7 was recently reported to stabilize p53 following doxorubicin treatment [33]. We evaluated p53 levels in cells treated with TNFa and found that compared to the parental cell line Smg7 −/− cells showed a dramatic increase in p53 levels (Fig. 5d). Ces2a −/− cells also showed an increase in basal p53 levels that did not increase following TNFa treatment, while Tnfrsf1a −/− cells showed a decrease in basal p53 levels. Since elevated p53 is associated with cell death and Smg7 is reportedly associated with p53 in the cytoplasm we investigated if p53 localization may be affected in mutant cells. Following TNFa induction, the fraction of p53 in the nucleus compared to the cytoplasm increases significantly in control cells, whereas this ratio was constant in Smg7 −/− cells (Fig. 5e, f). Together these results show that p53 levels are increased in Ces2a −/− and Smg7 −/− cells and in the latter that p53 is decreased in the nucleus compared to controls.
Further applications
ENCoRE was also tested for processing publicly available data. Bassett and colleagues [21] performed a CRISPR viability screen in Drosophila S2R+ cells and reported a lack of enrichment of essential genes in their dataset. Upon re-examination with ENCoRE, at least two highly scoring genes emerged that implicate cell cycle control (Z600, FANCI, see [Additional file 6] for volcano plots) in the enriched fraction, suggesting cell cycle dysregulation in the resulting mutant cells and accumulation of the associated sgRNAs. Other enriched or depleted guides have minimal overlap with the reported genes and could be retested on a candidate basis for viability impact. The differences in the analysis results of ENCoRE and DESeq2 [34] used by Bassett et al. may be due to different methods and filtering options applied to the data. DESeq2 was used to identify statistically significant changes in sgRNA counts for genes ≥3 sgRNAs based on a negative binomial distribution of the mean sgRNA counts. In contrast, ENCoRE performs a median imputation strategy with a subsequent z-test. The resulting ordered p-values correspond to a conservative ranking from a paired t-test. Especially the usage of the mean of sgRNA counts per gene can be problematic if outliers are present that cause artificial low or high mean values.
In addition, we tested the open-source software MAGeCK [22] and compared its negatively and positively selected genes from a CRISPR/Cas9 screening on mouse embryonic stem cells versus plasmid conditions with the analysis results of ENCoRE in [Additional file 6]. After median-normalization and estimation of the variance of the read counts MAGeCK tests whether sgRNAs differ significantly between treatment and control assuming a negative binomial distribution. Then MAGeCK ranks sgRNAs based on the calculated p-values and uses a modified robust ranking aggregation (RRA) algorithm afterwards. Although different methods are employed, ENCoRE identifies the same significantly negatively selected ribosomal genes Rps5, GTF2B, KIF18B and Rpl19. For positively selected genes Zfp945 was identified by both programs, however ENCoRE and MAGeCK differed on their identification of Trp53 which was found to be significant only by the latter in [Additional file 6, with step 5 of the MAGeCK documentation webpage]. We attribute the difference in these selected genes to the more stringent filtering performed by the ENCoRE FFM, which used a specific search string to identify guide sequences.
In summary, comparison of analyses of publicly available datasets and established software tools demonstrates that ENCoRE is able to identify essential genes from genome-scale CRISPR/Cas9 knockout screens by a relatively simple statistical model. In contrast to command line based programs like MAGeCK or web tools based on Z-scores such as ATARIS [35], ENCoRE serves up a simple and intuitive graphical user interface designed concisely for CRISPR screens. In addition, the processing of primary sequencing data is included in the same package and is very accurate due to the search function to identify search strings (e.g., barcodes) leading to a compact process throughput.