Skip to main content

Table 4 Top features used in theAQ5 classifiers

From: Machine learning approaches to predict the Plant-associated phenotype of Xanthomonas strains

Top features enriched in pathogens

 

Domain

Description

RF

Lasso

CART

P

NP

Enrichment

 

PF13855

Leucine-rich repeat

100.0

74.9

100.0

0.62

0.04

2.14e-08

 

PF09613

Type III secretion system, HrpB1/HrpK

52.1

14.3

75.6

0.83

0.30

3.83e-06

*

PF05932

Tir chaperone protein (CesT) family

   

0.82

0.28

3.96e-06

*

PF09483

Type III secretion protein HpaP

   

0.83

0.30

3.83e-06

*

PF09486

Type III secretion protein HrpB7

   

0.83

0.30

3.83e-06

*

PF09487

Type III secretion protein HrpB2

   

0.83

0.30

3.83e-06

*

PF09502

Type III secretion protein HrpB4

   

0.83

0.30

3.83e-06

*

PF05819

NolX

   

0.69

0.30

3.34e-03

 

PF09994

Domain of unknown function DUF2235

19.0

13.0

33.3

0.54

0.09

8.84e-05

 

PF13276

HTH-like domain

23.0

8.3

26.8

0.91

0.49

1.82e-04

 

PF13333

Integrase, catalytic core

13.4

3.6

12.2

0.51

0.09

2.39e-04

 

PF13579

Glycosyltransferase subfamily 4-like

16.0

100.0

4.4

0.88

0.53

2.89e-03

 

PF14341

Type 4 fimbrial biogenesis protein PilX

15.2

18.3

4.5

0.85

0.49

4.25e-03

 

PF01382

Avidin/streptavidin

6.3

33.6

0.0

0.26

0.04

2.74e-02

 

PF10117

5-methylcytosine restriction system component

17.0

77.6

3.8

0.32

0.08

3.30e-02

 

PF12161

N6 adenine-specific DNA methyltransferase

5.5

27.7

0.1

0.88

0.62

4.15e-02

*

PF01420

Restriction endonuclease, type I, HsdS

   

0.85

0.60

5.74e-02

Top features enriched in non-pathogens

 

Domain

Description

RF

Lasso

CART

P

NP

Enrichment

 

PF12840

Helix-turn-helix domain

46.7

67.3

70.1

0.25

0.75

1.87e-05

 

PF13570

Pyrrolo-quinoline quinone-like domain

12.7

4.3

18.2

0.60

0.98

1.01e-04

 

PF03552

Cellulose synthase

15.5

0.0

25.5

0.35

0.81

1.80e-04

*

PF03170

Cellulose synthase BcsB, bacterial

   

0.37

0.83

1.01e-04

*

PF05420

Cellulose synthase operon C, C-terminal

   

0.38

0.81

4.28e-04

*

PF01270

Glycoside hydrolase, family 8

   

0.37

0.79

7.33e-04

 

PF13424

Tetratricopeptide repeat

16.9

26.0

26.1

0.51

0.91

4.28e-04

*

PF12823

Domain of unknown function DUF3817

   

0.52

0.92

2.85e-04

 

PF06629

MltA-interacting MipA

14.5

47.7

4.6

0.54

0.85

1.57e-02

 

PF00656

Caspase domain

6.1

0.5

19.6

0.58

0.85

4.45e-02

 

PF13391

HNH nuclease

8.5

55.6

1.4

0.08

0.30

5.35e-02

 

PF10013

Uncharacterised conserved protein UCP037205

4.0

27.1

0.3

0.31

0.57

8.03e-02

  1. Domain: Pfam accession number; RF: Random Forest scaled variable importance aggregated over all nested CV outer-loop models; Lasso: Lasso aggregated scaled variable importance; CART: CART aggregated scaled variable importance; P: domain persistence in pathogens; NP: domain persistence in non-pathogens; Enrichment: p-value domain enrichment based on a two-sided Fisher exact test with Benjamini-Hochberg multiple testing correction; * left square bracket: Highly correlated domains removed in matrix optimisation