Our algorithm ECC consists of three stages: Distance matrix construction for chromosome sequences, chromosome sequences clustering and chromosome sequences compression. A schematic diagram of the method is shown in Fig. 1.
Construction of distance matrix for a set of chromosome sequences
Let \(\mathcal {S} = \{S_{1}, S_{2}, \cdots, S_{n}\}\) be a collection of genomic sequences (i.e., a genome database or a chromosome database). We use a MinHash toolkit called Mash [22] to compute pairwise sketch distances of the sequences to form a distance matrix. By the tool Mash, a sequence Si is firstly transformed into the set of its constituent k-mers, then all the k-mers are mapped to distinct 32-bit or 64-bit hash values by a hash function. Denote the hash values set of the constituent k-mers set from Si as H(Si), and denote the set of q minimal hash values as Sk(H(Si),q), which is a size-reduced representative of H(Si), and is called a sketch of H(Si). For two hash-value sets A and B, the Jaccard index ofA and B is defined as \(J(A,B) = \frac {|A \cap B|}{|A \cup B|}\), and it can be estimated by \(J^{\prime }(A,B) = \frac {|Sk(A \cup B,q) \cap Sk(A,q) \cap Sk(B,q)|}{|Sk(A\cup B,q)|}\). The sketch distance dsk between two sequences Si and Sj is defined as
$$ d_{sk}(S_{i}, S_{j}) = -\frac{1}{k}\ln{\frac{2*J^{\prime}(H(S_{i}),H(S_{j})) }{1+J^{\prime}(H(S_{i}),H(S_{j}))}} $$
(1)
where the Jaccard index between Si and Sj is approximately computed using the sketches of H(Si) and H(Sj). We construct a distance matrix M for sequence set \(\mathcal {S}\) with size n. M is a square matrix with dimension n×n that contains all the pairwise sketch distances between these genomic sequences. The elements of M are defined as:
$$ \begin{aligned} M_{ij} = \left\{\begin{array}{lc} 0\qquad \qquad \qquad \qquad \quad i=j \\ d_{sk}(S_{i}, S_{j}) \qquad \qquad \quad i \neq j \\ \end{array}\right. \\ i,j \in [1,n] \end{aligned} $$
(2)
It is clear that M is a symmetric matrix (i.e., Mij=Mji). It can also be understood that the calculation of the sketch distance between two long sequences is much more efficient than the calculation by using k-mer feature vector direct comparison. The efficiency becomes significant, especially in the construction of the whole distance matrix M.
Clustering of chromosomes from the distance matrix
Clustering is the process of grouping a set of samples into a number of subgroups such that similar samples are placed in the same subgroup. Here our clustering is to ensure a higher similarity between each reference-target pair for achieving an outstanding compression performance. An important step in the process of clustering is to determine the number of clusters in the data. We take a subtractive clustering approach [31, 32] to decide the number of clusters in the distance matrix M, and then use the K-medoids clustering method [33] to group the n number of genomic sequences into K number of clusters.
Subtractive clustering to determine the number of clusters K
Most clustering algorithms require the number of clusters as a parameter. However, the cluster number for a set of genomic sequences is normally unknown. We utilize a modified subtractive clustering algorithm to specify the cluster number.
Subtractive clustering is an extension of the Mountain method [34]. It estimates cluster centroid based on the density of points in the data space. We apply the exponential function for the Mountain Value Calculation. Given a sequence set \(\mathcal {S}\), the corresponding sketch distance matrix M with dimension n×n and a threshold percentage ε∈(0,1), the process to determine the number of clusters is:
- 1
Create the empty cluster centroid set \(\mathcal {O}\). Compute the mountain value of each sample Si:
\(Mt(S_{i}) = \sum _{j=1}^{n} e^{-M_{ij}}\)
- 2
Let \(o= \text {argmax}_{i=1}^{n} Mt(S_{i})\), add So to \(\mathcal {O}\).
- 3
Update the mountain value of each remaining sequence by:
\(\phantom {\dot {i}\!}Mt(S_{i}) = Mt(S_{i}) - e^{-M_{io}}\)
- 4
Repeat step 2 and 3 until Mt(Si)<εMtmax or \(|\mathcal {O}|\geq \sqrt {n}\).
- 5
Return centroids set \(\mathcal {O}\) and cluster number K\(=|\mathcal {O}|\).
K-medoids clustering of the collection of n genomic sequences
K-medoids is a partition-based cluster analysis method. K-medoids iteratively finds the K centroids and assigns every sample to its nearest centroid [33], which is similar to K-means [35] but more effective for handling outliers. It divides the data set \(\mathcal {S}\) into K non-overlapping subgroups \(\mathcal {C}\) that contains every element of \(\mathcal {S}\) and select a centroid sequence Oi from each subgroup:
Definition 1
For a set of sequence \(\mathcal {S} = \{S_{1},\cdots,S_{n}\}\), the corresponding cluster set \(\mathcal {C}=\{C_{1},C_{2},\cdots,C_{K}\}\) and centroid sequence set \(\mathcal {O}=\{O_{1},O_{2},\cdots,O_{K}\}\) satisfies the following requirements:\(C_{i} \subseteq \mathcal {S},C_{1} \cup C_{2} \cup \cdots \cup C_{K} = \mathcal {S},C_{i} \cap C_{j} = \emptyset \) for i≠j, Oi∈Ci.
The cluster set \(\mathcal {C}\) is determined via minimizing the cost function λ as follows:
$$\lambda(\mathcal{S}) = \sum_{i=1}^{K}\sum_{S_{a} \in C_{i}}d_{sk}({S}_{a},{O}_{i}) $$
Though K-medoids is efficient, it has some drawbacks. The clustering result highly depends on the setting of the initial centroids. To improve the stability and quality of clustering result, instead of arbitrarily selecting the initial centroids by the standard K-medoids, we use the centroid set \(\mathcal {O}\) as computed by subtractive clustering in previous section.
Given a sequence set \(\mathcal {S}\), sketch distance matrix M, cluster number K and centroid sequence set \(\mathcal {O}\), the K-medoids proceeds by the following steps:
- 1
Set \(\mathcal {O}\) as the initial centroid sequence set.
- 2
Associate each Si to the centroid Oj with minimum sketch distance, also associate Si to cluster Cj.
- 3
Recalculate the new centroid of each cluster based on its elements :
$$O_{j} = \underset{S_{a} \in C_{j}}{\text{argmin}} \sum_{S_{b} \in C_{j}}d_{sk}(S_{a},S_{b}) $$
- 4
Repeat steps 2 and 3 until \(\mathcal {C}\) and \(\mathcal {O}\) no longer change or reach a pre-set number of iterations.
- 5
Return cluster set \(\mathcal {C}\) and cluster centroid set \(\mathcal {O}\).
Compression
Chromosome sequences set \(\mathcal {S}\) is compressed based on the cluster set \(\mathcal {C}\) and centroids set \(\mathcal {O}\) computed by K-medoids. First, use Oi as the reference sequence for the other sequences in cluster Ci. Then select a final reference R from the centroid set as the reference for the other centroid sequences:
$$r= \underset{O_{i} \in \mathcal{O}}{\text{argmin}} \sum_{O_{j} \in \mathcal{O}}d_{sk}(O_{i},O_{j}) $$
In detail, all the sequences in cluster Ci is compressed using Oi as the reference sequence except Oi itself. Then all the reference sequences except R is compressed using R as the reference sequence. The final reference R can be compressed by the block-sorting compression (bsc) algorithm (http://libbsc.com/) or other reference-free compression algorithms.
All non-centroids sequences will be compressed with centroid sequences as reference and centroid sequences (except R) will be compressed with R as reference, only one final reference sequence R will remain uncompressed. It is clear that the same number of sequences is compressed in ECC as in straightforward approach.
All reference-based compression algorithms can take this clustering approach to compress a set of genomic sequences. The pseudo-code of our compression method is presented in Algorithm 1.
Decompression
The decompression process is the reversion process of compression. All the sequences except R require a reference to decompress. Firstly, R is decompressed; then the reference sequence of each cluster is decompressed by R, all the remaining sequences in the cluster are decompressed by the reference sequence in its cluster. As the process is invertible, the compression scheme is lossless as long as the used reference-based compression algorithm is lossless.
Data
To assess the performance of our proposed method ECC, we compare the compression ratio based on ECC result with the reference-fixed compression approach on multiple genome databases.
These include: a set of 60 human genome sequences (denoted by dataset-60) from National Center for Biotechnology Information (NCBI) with a file size of 171 GB, a set of 1152 human genome sequences (dataset-1152) from the 1000 Genomes Project [2] and NCBI having a file size of 3128 GB, and a set of 2818 rice genomes (dataset-2818) from the 3000-rice project [36] having a file size of 1012 GB.