ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels

Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.


Background
Of the estimated 8.7 million (±1.3 million) species existing on Earth [1], only around 1.5 million distinct eukaryotes have been catalogued and classified so far [2], leaving 86% of existing species on Earth and 91% of marine species still unclassified. To address the grand challenge of all species identification and classification, a multitude of techniques have been proposed for genomic sequence analysis and comparison. These methods can be broadly classified into alignment-based and alignment-free. Alignment-based methods and software tools are numerous, and include, e.g., MEGA7 [3] with sequence alignment using MUSCLE [4], or CLUSTALW [5,6]. Though alignment-based methods have been used with significant success for genome classification, they have limitations [7] such as the heavy time/memory computational cost for multiple alignment in multigenome scale sequence data, the need for continuous homologous sequences, and the dependence on a priori assumptions on, e.g., the gap penalty and threshold values for statistical parameters [8]. In addition, with next-generation sequencing (NGS) playing an increasingly important role, it may not always be possible to align many short reads coming from different parts of genomes [9]. To address situations where alignment-based methods fail or are insufficient, alignment-free methods have been proposed [10], including approaches based on Chaos Game Representation of DNA sequences [11][12][13], random walk [14], graph theory [15], iterated maps [16], information theory [17], category-position-frequency [18], spaced-words frequencies [19], Markov-model [20], thermal melting profiles [21], word analysis [22], among others. Software implementations of alignment-free methods also exist, among them COMET [23], CASTOR [24], SCUEAL [25], REGA [26], KAMERIS [27], and FFP (Feature Frequency Profile) [28]. While alignment-free methods address some of the issues faced by alignmentbased methods, [7] identified the following challenges they face: (i) Lack of software implementation: Most of the existing alignment-free methods are still exploring technical foundations and lack software implementation, which is necessary for methods to be compared on common datasets. (ii) Use of simulated sequences or very small real world datasets: The majority of the existing alignment-free methods are tested using simulated sequences or very small real-world datasets. This makes it hard for experts to pick one tool over the others. (iii) Memory overhead: Scalability to multigenome data can cause memory overhead in word-based methods, especially when long k -mers are used.
To overcome these challenges, we propose ML-DSP, a novel combination of supervised Machine Learning with Digital Signal Processing of the input DNA sequences, as a general-purpose alignment-free method and software tool for genomic DNA sequence classification at all taxonomic levels.
The main contribution of ML-DSP is the feature vector that we propose to be used by the supervised learning algorithms. Given a genomic DNA sequence, its feature vector consists of the pairwise Pearson Correlation Coefficient (PCC) between (a) the magnitude spectrum of the Discrete Fourier Transform (DFT) of the digital signal obtained from the given sequence by some suitable numerical encoding of the letters A, C, G, T into numbers, and (b) the magnitude spectra of the DFT of all the other genomic sequences in the training set. The use of this new feature vector, which has not previously been used in conjunction with machine learning algorithms, allows ML-DSP to significantly outperform existing methods in terms of speed, while achieving an average classification accuracy of > 97%. This substantial performance improvement allows ML-DSP to scale up and successfully classify much larger datasets than existing studies. Indeed, in contrast with previous benchmark datasets, each comprising less than fifty sequences, this study accurately classifies thousands of genomes from a variety of species: eukaryotic (7396 complete mitochondrial genomes), viral (4271 genomes), and bacterial (4710 genomes). In addition, this study provides the first comprehensive analysis and comparison of all thirteen one-dimensional numerical representations of DNA sequences used in the Genomic Signal Processing (GSP: digital signal processing applied to genomes) literature for classification purposes. We conclude that the "Purine/Pyrimidine (PP)", "Just-A", and "Real" numerical representations are the top three performers in terms of classification accuracy of ML-DSP for our main dataset. This is surprising given that these three numerical representations do not appear to contain sufficient biological information for the accuracy attained. For example, the numerical representation "Just-A" (encoding A as "1", and G, C, T as "0") retains the incidence and spacing for A, but not individually for the other three nucleotides.

Numerical representations of DNA sequences
Digital Signal Processing (DSP) can be employed in the context of comparative genomics because genomic sequences can be numerically represented as discrete numerical sequences and hence treated as digital signals. Several numerical representations of DNA sequences, that use numbers assigned to individual nucleotides, have been proposed in the literature [29], e.g., based on a fixed mapping of each nucleotide to a number, without biological significance; using mappings of nucleotides to numerical values deduced from their physio-chemical properties; or using numerical values deduced from the doublets or codons that the individual nucleotide was part of [29,30]. In [31,32] three physio-chemical based representations of DNA sequences (atomic, molecular mass, and Electron-Ion Interaction Potential, EIIP) were considered for genomic analysis, and the authors concluded that the choice of numerical representation did not have any effect on the results. A recent study comparing different numerical representation techniques on a small dataset [33] concluded that multi-dimensional representations (such as Chaos Game Representation) yielded better genomic comparison results than some one-dimensional representations. However, in general there is no agreement on whether or not the choice of numerical representation for DNA sequences makes a difference on the genome comparison results, or on which numerical representations are best suited for analyzing genomic data. We address this issue by providing a comprehensive analysis and comparison of thirteen one-dimensional numerical representations, for suitability in genome analysis.

Digital signal processing
Following the choice of a suitable numerical representation for DNA sequences, DSP techniques can be applied to the resulting discrete numerical sequences, and the whole process has been termed Genomic Signal Processing (GSP) [30]. DSP techniques have previously been used for DNA sequence comparison, e.g., to distinguish coding regions from non-coding regions [34][35][36], to align genomic signals for classification of biological sequences [37], for whole genome phylogenetic analysis [38], and to analyze other properties of genomic sequences [39]. In our approach, genomic sequences are represented as discrete numerical sequences, treated as digital signals, transformed via DFT into corresponding magnitude spectra, and compared via Pearson Correlation Coefficient (PCC) to create a pairwise distance matrix.

Supervised machine learning
Machine learning has been used in small-scale genomic analysis studies [40][41][42], and classification analyses associated with microarray gene expression data [43][44][45]. In this vein, ML-DSP focusses on the use of the primary DNA sequence data for taxonomic classification, and is based on a novel combination of supervised machine learning with feature vectors consisting of the pairwise distances between the magnitude spectrum of the DFT obtained from the digital signal generated from a DNA sequence, and the magnitude spectra of the DFT of the digital signals generated from all other sequences in the training set. The taxonomic labels of sequences are provided for training purposes. Six supervised machine learning classifiers (Linear Discriminant, Linear SVM, Quadratic SVM, Fine KNN, Subspace Discriminant, and Subspace KNN) are trained on these pairwise distance vectors, and then used to classify new sequences. Independently, classical MultiDimensional Scaling (MDS) generates a 3D visualization, called Molecular Distance Map (MoDMap) [46], of the interrelationships among all sequences.
For our computational experiments, we used a large dataset of 7396 complete mtDNA sequences, and six different classifiers, to compare one-dimensional numerical representations for DNA sequences used in the literature for classification purposes. For this dataset, we concluded that the "PP", "Just-A", and "Real" numerical representations were the best numerical representations. We analyzed the performance of ML-DSP in classifying the aforementioned genomic mtDNA sequences, from the highest level (domain into kingdoms) to lower level (family into genera) taxonomical ranks. The average classification accuracy of ML-DSP was > 97% when using the "PP", "Just-A", and "Real" numerical representations.
To evaluate our method, we compared its performance (accuracy and speed) on three datasets: two previously used small benchmark datasets [47], and a large real world dataset of 4322 complete vertebrate mtDNA sequences. We found that ML-DSP had significantly better accuracy scores than the alignment-free method FFP on all datasets. When compared to the state-of-theart alignment-based method MEGA7 (with alignment using MUSCLE or CLUSTALW), ML-DSP achieved similar accuracy but superior processing times (2250 to 67,600 times faster) for the small benchmark dataset of 41 mammalian genomes. The contrast in running time was even more extreme for the large dataset of 4322 mtDNA genomes, where ML-DSP took 28 s, while MEGA7(MUSCLE/CLUSTALW) could not complete the computation after 2 h/6 h and had to be terminated.
Lastly, we provide preliminary computational experiments that indicate the potential of ML-DSP to successfully classify viral genomes (4271 complete dengue virus genomes into four subtypes) and bacterial genomes (4710 complete bacterial genomes into three phyla).

Methods and implementation
The main idea behind ML-DSP is to combine supervised machine learning techniques with digital signal processing, for the purpose of DNA sequence classification. More precisely, for a given set S = {S 1 , S 2 , . . . , S n } of n DNA sequences, ML-DSP uses: -DNA numerical representations to obtain a set N = {N 1 , N 2 , . . . , N n } where N i is a discrete numerical representation of the sequence S i , 1 ≤ i ≤ n.
-Discrete Fourier Transform (DFT) applied to the length-normalized digital signals N i , to obtain the frequency distribution; the magnitude spectrum M i of this frequency distribution is then obtained. -Pearson Correlation Coefficient (PCC) to compute the distance matrix of all pairwise distances for each pair of magnitude spectra (M i , M j ), where 1 ≤ i, j ≤ n. -Supervised Machine Learning classifiers which take the pairwise distance matrix for a set of sequences, together with their respective taxonomic labels, in a training set, and output the taxonomic classification of a new DNA sequence. To measure the performance of such a classifier, we use the 10-fold cross-validation technique. -Independently, Classical Multidimensional Scaling (MDS) takes the distance matrix as input and returns an (n × q) coordinate matrix, where n is the number of points (each point represents a unique sequence from set S ) and q is the number of dimensions. The first three dimensions are used to display a MoDMap, which is the simultaneous visualization of all points in 3D-space.

DNA numerical representations
To apply digital signal processing techniques to genomic data, genomic sequences are first mapped into discrete numerical representations of genomic sequences, called genomic signals [48]. In our analysis of various numerical representations for DNA sequences (Table 1), we considered only 1D numerical representations, that is, those which produce a single output numerical sequence, called also indicator sequence, for a given input DNA sequence.
We did not consider other numerical representations, such as binary [29], or nearest dissimilar nucleotide [49], because those generate four numerical sequences for each genomic sequence, and would thus not be scalable to classifications of thousands of complete genomes.

Discrete Fourier Transform (DFT)
Our alignment-free classification method of DNA sequences makes use of the DFT magnitude spectra of the discrete numerical sequences (discrete digital signals) that represent DNA sequences. In some sense, these DFT magnitude spectra reflect the nucleotide distribution of the originating DNA sequences.
To start with, assuming that all input DNA sequences have the same length p, for each DNA sequence is the value under the numerical representation f of the nucleotide in the position k of the DNA sequence S i .
Then, the DFT of the signal N i is computed as the vector The magnitude vector corresponding to the signal N i can now be defined as the vector M i where, for each  Numerical representations of DNA sequences analyzed for usability in genomic classification with ML-DSP. The second column lists the numerical representation name, the third column describes the rule it uses, and the fourth is the output of this rule for the input DNA sequence S 1 = CGAT. For the nearest-neighbor based doublet representation and codon representation, the DNA sequence is considered to be wrapped (the last position is followed by the first) The magnitude vector M i is also called the magnitude spectrum of the digital signal N i and, by extension, of the DNA sequence S i . For example, if the numerical representation f is Integer (row 1 in Table 1), then for the sequence S 1 = CGAT, the corresponding numerical representation is N 1 = (1, 3, 2, 0), the result of applying DFT is F 1 = (6, −1−3i, 0, −1+3i) and its magnitude spectrum is M 1 = (6, 3.1623, 0, 3.1623). Figure 1a shows the discrete digital signal (using the "PP" numerical representation, row 6 of Table 1) of the DNA sequence consisting of the first 100 bp of the mtDNA genome of Branta canadensis (Canada goose, NCBI accession number NC_007011.1), and of the DNA sequence consisting of the first 100 bp of the mtDNA genome of Castor fiber (European beaver; NCBI accession number NC_028625.1). Figure 1b shows the DFT magnitude spectra of the same two signals/sequences. As can be seen in Fig. 1b, these mtDNA sequences exhibit different DFT magnitude spectrum patterns, and this can be used to distinguish them computationally by using. e.g., the Pearson Correlation Coefficient, as described in the next subsection. Other techniques have also been used for genome similarity analysis, for example comparing the phase spectra of the DFT of digital signals of full mtDNA genomes, as seen in Fig. 2 and [50,51].
Note that, with the exception of the example in Fig. 1, all of the computational experiments in this paper use full genomes.

Pearson Correlation Coefficient (PCC)
Consider two variables X and Y (here X and Y are the magnitude spectra M i and M j of two signals), each of length p, that is, The Pearson Correlation Coefficient r XY between X and Y is the ratio of their covariance (measure of how much X and Y vary together) to the product of their standard deviations [52,53], that is,  The Pearson Correlation Coefficient between X and Y is a measure of their linear correlation, and has a value between +1 (total positive linear correlation) and −1 (total negative linear correlation); 0 is no linear correlation. We normalized the results, by taking (1 − r XY ) /2, to obtain distance values between 0 and 1 (value 0 for identical signals, and 1 for negatively correlated signals). For our data sets, the PCC values between any two digital signals of DNA sequences ranged between 0 and 0.6.
For each pairwise distance calculation, the Pearson Correlation Coefficient requires the input variables (that is, the magnitude spectra of the two sequences) to have the same length. The length of a magnitude spectrum is equal to the length of corresponding numerical digital signal, which in turn is equal to the length of the originating DNA sequence. Given that genome sequences are typically of different lengths, it follows that their corresponding digital signals need to be length-normalized, if we are to be able to use the Pearson Correlation Coefficient. Hoang et al. avoided normalization and considered only the first few mathematical moments constructed from the power spectra for comparison, after applying DFT [54]. The limitation of this method is that one loses information that may be necessary for a meaningful comparison. This is especially important when the genomes compared are very similar to each other.
Different methods for length-normalizing digital signals were tested: down-sampling [55], up-sampling to the maximum length using zero padding [30], even scaling extension [56], periodic extension, symmetric padding, or anti-symmetric padding [57]. For example, zero-padding, which adds zeroes to all of the sequences shorter than the maximum length, was used in [30], e.g., for taxonomic classifications of ribosomal S18 subunit genes from twelve organisms. While this method may work for datasets of sequences of similar lengths, it is not suitable for datasets of sequences of very different lengths (our study: fungi mtDNA genomes dataset -1364 bp to 235,849 bp; plant mtDNA genomes dataset -12,998 bp to 1,999,595 bp; protist mtDNA genomes dataset -5882 bp to 77,356 bp). In such cases, zero-padding acts as a tag and may lead to inadvertent classification of sequences based on their length rather than based on their sequence composition. Thus, we employed instead anti-symmetric padding, whereby, starting from the last position of the signal, boundary values are replicated in an anti-symmetric manner. We also considered two possible ways of employing anti-symmetric padding: normalization to the maximum length (where shorter sequences are extended to the maximum sequence length by antisymmetric padding) vs. normalization to the median length (where shorter sequences are extended by antisymmetric padding to the median length, while longer sequences are truncated after the median length).

Supervised machine learning
In this paper we used the Linear discriminant, Linear SVM, Quadratic SVM, Fine KNN, Subspace discriminant and Subspace KNN classifiers from the Classification Learner application of MATLAB (Statistics and Machine Learning Toolbox). The default MATLAB parameters were used.
To assess the performance of the classifiers, we used 10-fold cross validation. In this approach, the dataset is randomly partitioned into 10 equal-size subsets. The classifier is trained using 9 of the subsets, and the accuracy of its prediction is tested on the remaining subset. As part of the supervised learning, taxonomic labels are supplied for the DNA sequences in the 9 subsets used for training. The process is repeated 10 times, and the accuracy score of the classifier is then computed as the average of the accuracies obtained in the 10 separate experiments. The standard algorithms were modified so that no information about sequences in the testing set (that is, no distance matrix entries containing distances to/from any sequence in the testing set to any other sequence) was available during the training stage.

Classical multidimensional scaling (MDS)
Classical multidimensional scaling takes a pairwise distance matrix (n × n matrix, for n input items) as input, and produces n points in a q-dimensional Euclidean space, where q ≤ n − 1. More specifically, the output is an n × q coordinate matrix, where each row corresponds to one of the n input items, and that row contains the q coordinates of the corresponding item-representing point [11]. The Euclidean distance between each pair of points is meant to approximate the distance between the corresponding two items in the original distance matrix.
These points can then be simultaneously visualized in a 2-or 3-dimensional space by taking the first 2, respectively 3, coordinates (out of q) of the coordinate matrix. The result is a Molecular Distance Map [46], and the MoDMap of a genomic dataset represents a visualization of the simultaneous interrelationships among all DNA sequences in the dataset.

Software implementation
The algorithms for ML-DSP were implemented using the software package MATLAB R2017A, license no. 964054, as well as the open-source toolbox Fathom Toolbox for MATLAB [58] for distance computation. All software can be downloaded from https://github. com/grandhawa/MLDSP. The user can use this code to reproduce all results in this paper, and also has the option to input their own dataset and use it as training set for the purpose of classifying new genomic DNA sequences.
All experiments were performed on an ASUS ROG G752VS computer with 4 cores (8 threads) of a 2.7GHz Intel Core i7 6820HK processor and 64GB DD4 2400MHz SDRAM.

Datasets
All datasets in this paper can be found at https://github. com/grandhawa/MLDSP in the "DataBase" directory. The mitochondrial dataset comprises all of the 7396 complete reference mtDNA sequences available in the NCBI Reference Sequence Database RefSeq on June 17, 2017. We performed computational experiments on several different subsets of this dataset. The bacteria dataset comprises all 4710 complete bacterial genomes with lengths between 20,000 bp and 500,000 bp, available in the aforementioned NCBI database on the same date. The dengue virus dataset contained all 4721 dengue virus genomes available in the NCBI database on August 10, 2017. Note that any letters "N" in these DNA sequences were deleted.
For the performance comparison between ML-DSP and other alignment-free and alignment-based methods we also used the benchmark datasets of 38 influenza virus sequences, and 41 mammalian complete mtDNA sequences from [47].

Results and discussion
Following the design and implementation of the ML-DSP genomic sequence classification tool prototype, we investigated which type of length-normalization and which type of distance were most suitable for genome classification using this method. We then conducted a comprehensive analysis of the various numerical representations of DNA sequences used in the literature, and determined the top three performers. Having set the main parameters (length-normalization method, distance, and numerical representation), we tested ML-DSP's ability to classify mtDNA genomes at taxonomic levels ranging from the domain level down to the genus level, and obtained average levels of classification accuracy of > 97%. Finally, we compared ML-DSP with other alignment-based and alignment-free genome classification methods, and showed that ML-DSP achieved higher accuracy and significantly higher speeds.

Analysis of distances and of length normalization approaches
To decide which distance measure and which length normalization method were most suitable for genome comparisons with ML-DSP, we used nine different subsets of full mtDNA sequences from our dataset. These subsets were selected to include most of the available complete mtDNA genomes (Vertebrates dataset of 4322 mtDNA sequences), as well as subsets containing similar sequences, of similar length (Primates dataset of 148 mtDNA sequences), and subsets containing mtDNA genomes showing large differences in length (Plants dataset of 174 mtDNA sequences).
The classification accuracy scores obtained using the two considered distance measures (Euclidean and Pearson Correlation Coefficient) and two different lengthnormalization approaches (normalization to maximum length and normalization to median length) on several datasets are listed in Table 2. The classification accuracy scores are slightly higher for PCC, but sufficiently close to those obtained when using the Euclidean distance to be inconclusive.
In the remainder of this paper we chose the Pearson Correlation Coefficient because it is scale independent (unlike the Euclidean distance, which is, e.g., sensitive to the offset of the signal, whereby signals with the same shape but different starting points are regarded as dissimilar [59]), and the length-normalization to median length because it is economic in terms of memory usage.
The datasets used for this analysis were the same as those in Table 2. The supervised machine learning classifiers used for this analysis were the six classifiers listed in the Methods and Implementation section, with the exception of the datasets with more than 2000 sequences where two of the classifiers (Subspace Discriminant and Subspace KNN) were omitted as being too slow. The results and the average accuracy scores for all these numerical representations, classifiers and datasets are summarized in Table 3.
As can be observed from Table 3, for all numerical representations, the table average accuracy scores (last row: average of averages, first over the six classifiers for each dataset, and then over all datasets), are high. Surprisingly, even using a single nucleotide numerical representation, which treats three of the nucleotides as being the same, and singles out only one of them ("Just-A"),   results in an average accuracy of 91.9%. The best accuracy, for these datasets, is achieved when using the "PP" representation, which yields an average accuracy of 92.3%.
For subsequent experiments we selected the top three representations in terms of accuracy scores: "PP", "Just-A", and "Real" numerical representations.

ML-DSP for three classes of vertebrates
As an application of ML-DSP using the "PP" numerical representation for DNA sequences, we analyzed the set of vertebrate mtDNA genomes (median length 16,606 bp). The MoDMap, i.e., the multi-dimensional scaling 3D visualization of the genome interrelationships as described by the distances in the distance matrix, is illustrated in Fig. 3. The dataset contains 3740 complete mtDNA genomes: 553 bird genomes, 2313 fish genomes, and 874 mammalian genomes. Quantitatively, the classification accuracy score obtained by the Quadratic SVM classifier was 100%.

Classifying genomes with ML-DSP, at all taxonomic levels
We tested the ability of ML-DSP to classify complete mtDNA sequences at various taxonomic levels. For every dataset, we tested using the "PP", "Just-A", and "Real" numerical representations. The accuracy of the ML-DSP classification into three classes, using the Quadratic SVM classifier, with the "PP" numerical representation, and PCC between magnitude spectra of DFT, was 100% The starting point was domain Eukaryota (7396 sequences), which was classified into kingdoms, then kingdom Animalia was classified into phyla, etc. At each level, we picked the cluster with the highest number of sequences and then classified it into the next taxonomic level sub-clusters. The lowest level classified was family Cyprinidae (81 sequences) into its six genera. For each dataset, we tested all six classifiers, and the maximum of these six classification accuracy scores for each dataset are shown in Table 4.
Note that, at each taxonomic level, the maximum classification accuracy scores (among the six classifiers) for each of the three numerical representations considered are high, ranging from 91.4% to 100%, with only three scores under 95%. As this analysis also did not reveal a clear winner among the top three numerical representations, the question then arose whether the numerical representation we use mattered at all. To answer this question, we performed two additional experiments, that exploit the fact that the Pearson correlation coefficient is scale independent, and only looks for a pattern while comparing signals. For the first experiment we selected the top three numerical representations ("PP", "Just-A", and "Real") and, for each sequence in a given dataset, a numerical representation among these three was randomly chosen, with equal probability, to be the digital signal that represents it. The results are shown under the column "Random3" in Table 4: The maximum accuracy score over all the datasets is 96%. This is almost the same as the accuracy obtained when one particular numerical representation was used (1% lower, which is well within experimental error). We then repeated this experiment, this time picking randomly from any of the thirteen numerical representations considered. The results are shown under the column "Random13" in Table 4, with the table average accuracy score being 88.1%.
Overall, our results suggest that all three numerical representations "PP", "Just-A", and "Real" have very high classifications accuracy scores (average >97%), and even a random choice of one of these representations for each sequence in the dataset does not significantly affect the classification accuracy score of ML-DSP (average 96%).
We also note that, in addition to being highly accurate in its classifications, ML-DSP is ultrafast. Indeed, even for the largest dataset in Table 2, subphylum Vertebrata (4322 complete mtDNA genomes, average length 16,806 bp), the distance matrix computation (which is the bulk of the classification computation) lasted under 5 s. Classifying a new primate mtDNA genome took 0.06 s when trained on 148 primate mtDNA genomes, and classifying a new vertebrate mtDNA genome took 7 s when trained on the 4322 vertebrate mtDNA genomes. The result was updated with an experiment whereby QSVM was trained on the 4322 complete vertebrate genomes in Table 2, and querried on the 694 new vertebrate mtDNA genomes uploaded on NCBI between June 17, 2017 and January 7, 2019. The accuracy of classification was 99.6%, with only three reptile mtDNA genomes mis-classified as amphibian genomes: Bavayia robusta, robust forest bavayia -a species of gecko, NC_034780, Mesoclemmys hogei, Hoge's toadhead turtle, NC_036346, and Gonatodes albogularis, yellow-headed gecko, NC_035153.

MoDMap visualization vs. ML-DSP quantitative classification results
The hypothesis tested by the next experiments was that the quantitative accuracy of the classification of DNA sequences by ML-DSP would be significantly higher than suggested by the visual clustering of taxa in the MoDMap produced with the same pairwise distance matrix.
As an example, the MoDMap in Fig. 4a, visualizes the distance matrix of mtDNA genomes from family Cyprinidae (81 genomes) with its genera Acheilognathus (10 genomes), Rhodeus (11 genomes), Schizothorax (19 genomes), Labeo (19 genomes), Acrossocheilus (12 genomes), Onychostoma (10 genomes); only the genera with at least 10 genomes are considered. The MoDMap seems to indicate an overlap between the clusters Acheilognathus and Rhodeus, which is biologically plausible as these genera belong to the same sub-family Acheilognathinae. However, when zooming in by plotting a MoDMap of only these two genera, as shown in Fig. 4b, one can see that the clusters are clearly separated visually. This separation is confirmed by the fact that the accuracy score of the Quadratic SVM classifier for the dataset in Fig. 4b is 100%. The same quantitative accuracy score for the classification of the dataset in Fig. 4a with Quadratic SVM is 91.8%, which intuitively is much better than the corresponding MoDMap would suggest. This is likely due to the fact that the MoDMap is a three-dimensional approximation of the positions of the genome-representing points in a multi-dimensional space (the number of dimensions is (n − 1), where n is the number of sequences).
This being said, MoDMaps can still serve for exploratory purposes. For example, the MoDMap in Fig. 4a suggests that species of the genus Onychostoma (subfamily listed "unknown" in NCBI) (yellow), may be genetically related to species of the genus Acrossocheilus (subfamily Barbinae) (magenta). Upon further exploration of the distance matrix, one finds that indeed the distance between the centroids of these two clusters is lower than the distance between each of these two cluster-centroids to the other cluster-centroids. This supports the hypotheses, based on morphological evidence [60], that genus Onychostoma belongs to the subfamily Barbinae, respectively that genus Onychostoma and genus Acrossocheilus are closely related [61]. Note that this exploration, suggested by MoDMap and confirmed by calculations based on the distance matrix, could not have been initiated based on ML-DSP alone (or other supervised machine learning algorithms), as ML-DSP only predicts the classification of new genomes into one of the taxa that it was trained on, and does not provide any other additional information.
As another comparison point between MoDMaps and supervised machine learning outputs, Fig. 5a shows the MoDMap of the superorder Ostariophysi with its orders Cypriniformes (643 genomes), Characiformes (31 genomes) and Siluriformes (107 genomes). The MoDMap shows the clusters as overlapping, but the Quadratic SVM classifier that quantitatively classifies these genomes has an accuracy of 99%. Indeed, the confusion matrix in Fig. 5b shows that Quadratic SVM mis-classifies only 8 sequences out of 781 (recall that, for m clusters, the m × m confusion matrix has its rows labelled by the true classes and columns labelled by the predicted classes; the cell (i, j) shows the number of sequences that belong to the true class i, and have been predicted to be of class j). This indicates that when the visual representation in a MoDMap shows cluster overlaps, this may only be due to the dimensionality reduction to three dimensions, while ML-DSP actually provides a much better quantitative classification based on the same distance matrix.

Applications to other genomic datasets
The two experiments in this section indicate that the applicability of our method is not limited to mitochondrial DNA sequences. The first experiment, Fig. 6a

Comparison of ML-DSP with state-of-the-art alignment-based and alignment-free tools
The computational experiments in this section compare ML-DSP with three state-of-the-art alignment-based and alignment-free methods: the alignment-based tool MEGA7 [3] with alignment using MUSCLE [4] and CLUSTALW [5,6], and the alignment-free method FFP (Feature Frequency Profiles) [28].
For this performance analysis we selected three datasets. The first two datasets are benchmark datasets used in other genetic sequence comparison studies [47]: The first dataset comprises 38 influenza viral genomes, and the second dataset comprises 41 mammalian complete mtDNA sequences. The third dataset, of our choice, is much larger, consisting of 4,322 vertebrate complete mtDNA sequences, and was selected to compare scalability.
For the alignment-based methods, we used the distance matrix calculated in MEGA7 from sequences aligned with either MUSCLE or CLUSTALW. For the alignment-free FFP, we used the default value of k = 5 for k-mers (a k-mer is any DNA sequence of length k; any increase in the value of the parameter k, for the first dataset, resulted in a lower classification accuracy score for FFP). For ML-DSP we chose the Integer numerical representation and computed the average classification accuracy over all six classifiers for the first two datasets, and over all classifiers except Subspace Discriminant and Subspace KNN for the third dataset. Table 5 shows the performance comparison (classification accuracy and processing time) of these four methods. The processing time included all computations, starting from reading the datasets to the completion of the distance matrix -the common element of all four methods. The listed processing times do not include the time needed for the computation of phylogenetic trees, MoDMap visualizations, or classification.
As seen in Table 5 (columns 3, 4, and 6) ML-DSP overwhelmingly outperforms the alignment-based software MEGA7(MUSCLE/CLUSTALW) in terms of processing time. In terms of accuracy, for the smaller virus and mammalian benchmark datasets, the average accuracies of ML-DSP and MEGA7(MUSCLE/CLUSTALW) were comparable, probably due to the small size of the training set for ML-DSP. The advantage of ML-DSP over the alignment-based tools became more apparent for the larger vertebrate dataset, where the accuracies of ML-DSP and the alignment-based tools could not even be compared, as the alignment-based tools were so slow that they had to be terminated. In contrast, ML-DSP classified the entire set of 4322 vertebrate mtDNA genomes in 28 s, with average classification accuracy 98.3%. This indicates that ML-DSP is significantly more scalable than the alignment-based MEGA7(MUSCLE/CLUSTALW), as it can speedily and accurately classify datasets which alignment-based tools cannot even process. As seen in Table 5 (columns 5 and 6), ML-DSP significantly outperforms the alignment-free software FFP in terms of accuracy (average classification accuracy 98.3% for ML-DSP vs. 48.3% for FFP, for the large vertebrate dataset), while at the same time being overall faster.
This comparison also indicates that, for these datasets, both alignment-free methods (ML-DSP and FFP) have an overwhelming advantage over the alignment-based methods (MEGA7 (MUSCLE/CLUSTALW)) in terms of processing time. Furthermore, when comparing the two alignment-free methods with each other, ML-DSP significantly outperforms FFP in terms of classification accuracy.
As another angle of comparison, Fig. 7 displays the MoDMaps of the first benchmark dataset (38 influenza virus genomes) produced from the distance matrices generated by FFP, MEGA7 (MUSCLE), MEGA7 (CLUSTALW), and ML-DSP respectively. Figure 7a shows that with FFP it is difficult to observe any visual separation of the dataset into subtype clusters. Finally Figs. 8 and 9 display the phylogenetic trees generated by each of the four methods considered. Figure 8a, the tree generated by FFP, has many misclassified genomes, which was expected given the MoDMap visualization of its distance matrix in Fig. 7a. Figure 9a displays the phylogenetic tree generated by MEGA7, which was the same for both MUSCLE and CLUSTALW: It has only one incorrectly classified H5N1 genome, placed in middle of H1N1 genomes. Figures 8b and 9b display the phylogenetic tree generated using the distance produced by ML-DSP (shown twice, in parallel with the other trees, for ease of comparison). ML-DSP classified all genomes correctly.

Discussion
The computational efficiency of ML-DSP is due to the fact that it is alignment-free (hence it does not need ML-DSP is not without limitations. We anticipate that the need for equal length sequences and use of length normalization could introduce issues with examination of small fragments of larger genome sequences. Usually genomes vary in length and thus length normalization always results in adding (up-sampling) or losing (down-sampling) some information. Although the Pearson Correlation Coefficient can distinguish the signal patterns even in small sequence fragments, and we did not find any considerable disadvantage while considering complete mitochondrial DNA genomes with their inevitable length variations, length normalization may cause issues when we deal with the fragments of genomes, and the much larger nuclear genome sequences.
Lastly, ML-DSP has two drawbacks, inherent in any supervised machine learning algorithm. The first is that ML-DSP is a black-box method which, while producing a highly accurate classification prediction, does not offer a (biological) explanation for its output. The second is that it relies on the existence of a training set from which it draws its "knowledge", that is, a set consisting of known genomic sequences and their taxonomic labels. ML-DSP uses such a training set to "learn" how to classify new sequences into one of the taxonomic classes that it was trained on, but it is not able to assign it to a taxon that it has not been exposed to.

Conclusions
We proposed ML-DSP, an ultrafast and accurate alignment-free supervised machine learning classification method based on digital signal processing of DNA sequences (and its software implementation). ML-DSP successfully addresses the limitations of alignment-free methods identified in [7], as follows:    Table 5, based on the four methods. The points represent viral genomes of subtypes H1N1 (red, 13 genomes), H2N2 (black, 3 genomes), H5N1 (blue, 11 genomes), H7N3 (magenta, 5 genomes), H7N9 (green, 6 genomes); ModMaps are generated using distance matrices computed with (a) FFP; (b) MEGA7(MUSCLE); (c) MEGA7(CLUSTALW); (d) ML-DSP (iii) Memory overhead: ML-DSP uses neither k -mers nor any compression algorithms. Thus, scalability does not cause an exponential memory overhead, and a high classification accuracy is preserved with large datasets.
In addition, we provided a comprehensive quantitative analysis of all 13 one-dimensional numerical representations of DNA sequences used in the Genomic Signal Processing literature and found that, on average, the "PP", "Just-A", and "Real" representations performed better than others. We also showed that the classification accuracy of ML-DSP was significantly higher than the corresponding MoDMap visualizations of the dataset would indicate, likely due to the inherent dimensionality limitations of the latter. Lastly, we showed the potential for ML-DSP to be used for classifications of other DNA sequence genomic datasets, such as large datasets of complete viral or bacterial genomes.