Skip to main content

Table 1 Numerical representations of DNA sequences

From: ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels

#

Representation

Rules

Output for S1 = CGAT

1

Integer

T=0, C=1, A=2, G=3

[ 1 3 2 0]

2

Integer (other variant)

T=1, C=2, A=3, G=4

[ 2 4 3 1]

3

Real

T=−1.5, C=0.5, A=1.5, G=−0.5

[ 0.5 −0.5 1.5 −1.5]

4

Atomic

T=6, C=58, A=70, G=78

[ 58 78 70 6]

5

EIIP (electron-ion interaction potential)

T=0.1335, C=0.1340, A=0.1260, G=0.0806

[ 0.1340 0.8060 0.1260 0.1335]

6

PP (purine/pyrimidine)

T/C=1, A/G=−1

[ 1 −1 −1 1]

7

Paired numeric

T/A=1, C/G=−1

[ −1 −1 1 1]

8

Nearest-neighbor based doublet

0−15 for all possible doublets

[ 14 8 1 7]

9

Codon

0−63 for all possible 64 Codons

[ 2 35 22 44]

10

Just-A

A=1, rest=0

[ 0 0 1 0]

11

Just-C

C=1, rest=0

[ 1 0 0 0]

12

Just-G

G=1, rest=0

[ 0 1 0 0]

13

Just-T

T=1, rest=0

[ 0 0 0 1]

  1. Numerical representations of DNA sequences analyzed for usability in genomic classification with ML-DSP. The second column lists the numerical representation name, the third column describes the rule it uses, and the fourth is the output of this rule for the input DNA sequence S1=CGAT. For the nearest-neighbor based doublet representation and codon representation, the DNA sequence is considered to be wrapped (the last position is followed by the first)