OffScan: a universal and fast CRISPR off-target sites detection tool

Background The Type II clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated proteins (Cas) is a powerful genome editing technology, which is more and more popular in gene function analysis. In CRISPR/Cas, RNA guides Cas nuclease to the target site to perform DNA modification. Results The performance of CRISPR/Cas depends on well-designed single guide RNA (sgRNA). However, the off-target effect of sgRNA leads to undesired mutations in genome and limits the use of CRISPR/Cas. Here, we present OffScan, a universal and fast CRISPR off-target detection tool. Conclusions OffScan is not limited by the number of mismatches and allows custom protospacer-adjacent motif (PAM), which is the target site by Cas protein. Besides, OffScan adopts the FM-index, which efficiently improves query speed and reduce memory consumption.

tools cannot consider the PAM sequence when searching mismatched sites. The off-target sites returned by alignment tools contain a number of sequences that are not adjacent to PAM. Accordingly, we have to filter the returned off-target sites and remove those false off-targets [15].
CasOT [16] is a Perl script that searches for potential off-target sites in any given genome, with user-specified PAM. It allows mismatches in the seed and non-seed regions and provides both a single-gRNA searching mode and a paired-gRNA searching mode. CasOT can identify potential off-target sites in an acceptable period of time.
Cas-OFFinder [17] can search for potential off-target sites in a given genome or set of user-defined sequences. It is not limited by the number of mismatches and allows various PAM sequences. The tool is partly written in OpenCL, enabling operations using an accelerator such as GPU, which can significantly speed up the searching process.
Perez et al. constructed a trie to store all candidate sgRNA (k-mer) in K, and thus to search off-target sites by traversing the trie rather than the entire genome [18]. It is necessary to traverse the trie once for each k-mer to detect its off-target sites within several mismatches; consequently, this involves a large amount of redundant computations. In addition, a trie is a space-consuming data structure, and the number of k-mers adjacent to PAM in the genome is usually huge [15]. Detecting off-target sites for large genomes will thus consume hundreds of GB of memory.
Accordingly, in this paper, we present OffScan, a universal and fast CRISPR/Cas off-target site detection tool. OffScan is not limited by a number of mismatches and/or PAM. It adopts FM-index [19] to assist in off-target searching, which efficiently reduces memory footprint and query time and enables the design of highly specific sgRNA.

CRISPR target sites scan
The PAM scan model is used to generate candidate sgRNAs. We tested the PAM scan model of OffScan on four genomes: hg38 (human), mm10 (mouse), danRer7 (zebrafish), and ce10 (C. elegans). Since GuideScan [18] does not include a PAM scan model, we used the results of a popular sgRNA design tool, CRISPR-DO [15], for comparison. The total number of k-mers in the millions of target sequences identified by 5′-NGG-3′ is shown in Table 1. OffScan can find a comparable number of candidate target sites compared with CRISPR-DO.

Off-target detection module
We test the performance of OffScan in hg38 with 1, 3, and 10 mismatches and mm10 with 3 mismatches. The sgRNA length is 20 bp. We arbitrarily chose 1000 SpCas9 targets and run OffScan to detect offtarget sites. The CPU version is E7-8890 v3. As shown in Table 2, the number of mismatches and genome size will affect the time taken to detect offtarget sites.
We compared the performance of OffScan with GuideScan under different number of mismatches on four genomes used in the previous test. As shown in Fig. 2, when the number of mismatches is small, the performance is similar. As the number of mismatches increases, the performance gap between the two becomes larger and larger. In the fuzzy matching part, a bounded traversal strategy is adopted.  The lower bound of the mismatch number of the query string Q and the original string X are estimated before the traversal, so that the program can return the traversal of the branch in advance, effectively reducing the search space in suffix tree. The depth of the downward extension increases the efficiency of the comparison. In addition, since OffScan is designed for general sgRNA design, the program behaves similarly on different species.

Conclusions
OffScan enables off-target site detection in any genome, without limitation of the PAM sequence or number of mismatches. Moreover, OffScan utilizes the FM-index to support fast, memory-efficient, and highly specific sgRNA design. The sgRNA design and off-target site detection can be made more convenient using OffScan. In future, we will incorporate more features into OffScan and provide useful tools for the research community.

CRISPR target sites scan
To facilitate sgRNA design, we implemented a PAM scan module in OffScan. The Cas proteins require a PAM sequence to bind to, so the PAM scan is implemented to scan the entire genome in order to find the sequences with this PAM (e.g. 5′-NGG-3′ for the Cas9 enzyme). The module supports the use of a custom PAM sequence, which can be used for general-purpose sgRNA design. In the majority of situations, we simply need to scan the entire genome once to find the candidate sgRNA with PAM; accordingly, we did not use the FM-index in this module.

Off-target sites detection
The off-target site detection problem can be divided into two problems: namely, off-target sites with and without mismatches. We adopt the backward search algorithm [20] to find sites without mismatch, which is a pattern matching problem. Let ∑ be an alphabet. A sentinel symbol $ is not present in it and is lexicographically smaller than all the symbols in ∑. A string X = a 0 a 1 ...a n − 1 is terminated with symbol $ (i.e. a n − 1 = $) and this symbol only appears at the end. The length of string X is |X| = n. X[i] = a i is the i-th symbol of X and X[i,j] is the substring a i …a j .
The suffix array data structure is a succinct representation of the lexographic ordering of all the suffixes of a string [21]. The suffix array of string X, denoted as SA(X), is actually the permutation of the indices {1, 2, …, n-1} of X that iif X[j,n-1] is the i-th lexographically smallest suffix of X.
For example, if X = banana$, the SA(X) = [6,5,3,1,0,4,2], as shown in Fig. 3a. Since the suffixes in SA are sorted in lexographic order, the start positions of all the instances of a pattern Q in X should be an interval of SA, that is called a suffix array interval, denoted as a pair of integers [sp,ep]. Thus, the pattern matching of Q in X is equivalent to find the suffix array interval of Q in X.
Ferragina and Manzini developed a data structure called FM-index [19], which can determine the suffix array interval [sp,ep] of pattern Q in O(|Q|) time and requires much less memory than a suffix array. FM-index is a careful combination of a compression algorithm Burrows-Wheeler transform (BWT) [22] and suffix array. The BWT of X, denoted as B(X), is a permutation of the string that That is to say, B[i] is the symbol preceding the first symbol of the suffix starting at SA[i]. For example, if X = banana$, the B(X) = annb$aa, as presented in Fig. 3a.
To enable pattern matching with FM-index, Ferragina and Manzini added two more data structures: C array and Occ matrix. As illustrated in Fig. 3b, for a symbol a, C[a] is the number of occurrences of symbols that are lexographically smaller than a. Occ[a,i] records the number of occurrences of the character a in B[0,i]. For a query Q whose suffix array interval in X is [sp,ep], the interval of string aQ can be calculated with C and Occ arrays as follows: From the above eq. (3), we notice that the pattern matching should start from the last character of Q, that is backward search algorithm. Algorithm 1 presents the details of searching procedure. Firstly, the suffix array interval of the last symbol of Q is calculated from C array. Then the interval is calculated iteratively based on Eq. (3), as shown in Fig. 3c. For the returned values of backward search algorithm, if sp < ep, Q has more than one instances in X; if sp = ep, Q has just one instance in X; if sp > ep, Q is not included in X. The backward search algorithm updates the suffix array interval at most |Q| times, so the time complexity is O(|Q|). The time complexity of backward search is linear of query string length |Q|, equal to the exact matching in a trie; however, the FM-index is more space-efficient than a trie. Besides, the backward search algorithm actually realizes a top-down traversal of a trie.
Since sgRNA allows several mismatches when binding to DNA, an off-target detection algorithm supporting only exact match is not sufficient. OffScan also implements search with mismatches. A naïve method is to traverse the trie to search sgRNA with mismatches, but this process is rather time-consuming. To avoid unnecessary comparisons, we introduce a bounded-search for mismatches in OffScan. For a query of Q in X, we define LB[i] as the lower bound of the number of differences between Q[i, |Q|-1] and X. If the current allowing number of mismatches is smaller than LB[i], the search process will stop and backtrack to other branches. Algorithm 2 presents the procedure of calculating LB[i].

FM-index construction
The off-target site detection module used in OffScan supports any number of mismatches, including no mismatches for exact matching. In terms of the general procedure, we have to detect off-target sites for each kmer. To avoid the redundant computations involved in scanning of the entire genome, we detect off-targets only within the candidate sgRNA set K.
To improve the query efficiency and reduce space requirements, we adopt the FM-index to aid in off-target site detection. While this data structure is originally designed for single-string queries, we want to extend it to multiple sequence queries. To solve this problem, we add a "$" character to the end of each k-mer and concatenate all k-mers as a long string. We then construct an FM-index for the concatenation.