Skip to main content

Table 2 Statistics of the synthetic training data sets. To construct a synthetic data set, a specific number of template random sequences are synthesized. The length of a template is chosen at random between minimum and maximum lengths. A random number (between minimum and maximum numbers) of mutated copies are generated from each template. All clusters in the same data set have the same minimum identity score. For example, members comprising the clusters of the Short-97 data set are 97.00–99.99% identical to the templates, from which these members were generated. Identity scores among templates in the same data set are at most 10% less than the provided minimum identity score. Length is measured in base pairs (bp)

From: MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

Data set

Template avg. length (bp)

Template min. length (bp)

Template max. length (bp)

Cluster avg. size

Cluster min. size

Cluster max. size

Cluster count

Sequence count

Short-97

288

202

396

202

12

400

100

20,195

Short-95

307

200

400

177

5

400

100

17,734

Short-90

298

204

399

199

9

400

100

19,877

Short-80

302

200

400

204

6

392

100

20,423

Short-70

299

205

400

202

7

395

100

20,230

Short-60

304

200

399

195

9

395

100

19,539

Medium-97

1,394

752

1,998

192

13

390

100

19,215

Medium-95

1,358

750

1,968

203

7

396

100

20,315

Medium-90

1,405

759

1,977

194

5

400

100

19,393

Medium-80

1,434

760

2,000

222

14

398

100

22,208

Medium-70

1,345

768

1,999

212

8

398

100

21,184

Medium-60

1,387

771

1,993

202

13

398

100

20,211

Long-97

2,677

1,520

3,983

210

5

398

100

20,994

Long-95

2,611

1,508

3,959

206

10

400

100

20,565

Long-90

2,677

1,530

3,969

196

5

400

100

19,622

Long-80

2,859

1,528

3,990

194

5

398

100

19,424

Long-70

2,830

1,512

3,993

224

19

399

100

22,396

Long-60

2,630

1,519

3,977

207

7

398

100

20,699

Numerous-97

272

171

372

203

5

400

5,000

1,012,543

Numerous-95

272

171

372

203

5

400

5,000

1,012,528

Numerous-90

271

171

372

204

5

400

5,000

1,018,681

Numerous-80

271

171

372

203

5

400

5,000

1,016,997