Skip to main content

Table 4 The top 25 words in Introns

From: The word landscape of the non-coding segments of the Arabidopsis thaliana genome

 

Unmasked

Masked

Unmasked

Word

S

ES

O

EO

SlnSES

S

ES

O

EO

SlnSES

RevComp

RC_Pos

Pal

PValues

TTTTTGTT

10048

9365.74

11094

10679.8

706.524

9819

9103.26

10783

10355.3

743.17

TTTTTGTT

10048

9365.74

3.44E-05

TTTTTCTT

9144

8495.68

10021

9609.91

672.454

8939

8293.57

9751

9363.74

669.915

TTTTTCTT

9144

8495.68

1.58E-05

CTTTTTTC

2764

2170.42

2821

2314.32

668.224

2713

2187.97

2767

2333.43

583.515

CTTTTTTC

2764

2170.42

8.88E-16

GTTTTTGA

2673

2105.13

2742

2243.33

638.372

2631

2056.65

2696

2190.66

647.973

GTTTTTGA

2673

2105.13

-2.22E-16

TTTTGCAG

3505

2959.4

3523

3179.19

593.06

3452

2920.63

3470

3136.4

577.016

TTTTGCAG

3505

2959.4

1.07E-09

TTTTTTGT

7618

7067.97

8198

7889.79

570.901

7400

6823.86

7922

7600.06

599.8

TTTTTTGT

7618

7067.97

0.000286

TTTTTTGG

3765

3238.3

3942

3487.94

567.378

3635

3124.76

3795

3362.05

549.804

TTTTTTGG

3765

3238.3

2.62E-14

TTTTCTTT

9256

8733.23

10299

9900.39

538.109

9041

8500.1

9994

9615.3

557.761

TTTTCTTT

9256

8733.23

3.48E-05

TGTTTTTT

7487

6984.58

8028

7790.67

520.072

7254

6759.65

7750

7524.05

512

TGTTTTTT

7487

6984.58

0.003768

CTCTCTTT

3193

2716.79

3289

2911.9

515.697

3086

2625.01

3165

2811.09

499.291

CTCTCTTT

3193

2716.79

3.97E-12

ATTTTTTA

2508

2044.78

2645

2177.76

512.128

2383

2003.78

2486

2133.28

413.027

ATTTTTTA

2508

2044.78

3.33E-16

TTTTTTCC

3166

2702.47

3253

2896.16

501.186

3086

2616.31

3161

2801.55

509.528

TTTTTTCC

3166

2702.47

4.13E-11

TGTTTCAG

2215

1790.21

2239

1902.05

471.614

2153

1745.3

2177

1853.55

451.987

TGTTTCAG

2215

1790.21

3.01E-14

GGTTTTTG

2029

1611.17

2092

1708.92

467.851

1997

1584.97

2058

1680.71

461.47

GGTTTTTG

2029

1611.17

1.11E-16

TTTTGTTT

12142

11689.3

13879

13619.2

461.327

11843

11368.1

13438

13205.7

484.659

TTTTGTTT

12142

11689.3

0.013306

TTTGTTTT

11017

10569.9

12527

12188.1

456.39

10729

10259.7

12106

11796.5

479.827

TTTGTTTT

11017

10569.9

0.00113

CTTTTTTA

2234

1828.76

2282

1943.72

447.149

2178

1816.31

2220

1930.26

395.524

CTTTTTTA

2234

1828.76

4.17E-14

AATATATT

2022

1642.55

2143

1742.72

420.253

1925

1679.14

2019

1782.16

263.038

AATATATT

2022

1642.55

4.44E-16

ATTTTTCA

2411

2030.35

2467

2162.1

414.291

2349

1971.89

2398

2098.68

411.073

ATTTTTCA

2411

2030.35

7.51E-11

ATTTTTTC

2810

2425.9

2881

2592.99

413.021

2736

2412.96

2800

2578.85

343.758

ATTTTTTC

2810

2425.9

1.43E-08

CAATTTTT

2402

2023.84

2481

2155.04

411.472

2320

1952.98

2388

2078.19

399.534

CAATTTTT

2402

2023.84

3.73E-12

TTTTTTCT

7674

7280.17

8254

8142.69

404.295

7476

7074.7

8001

7897.8

412.475

TTTTTTCT

7674

7280.17

0.109849

TGTTGCAG

1922

1563.72

1933

1657.84

396.507

1891

1543.21

1902

1635.78

384.332

TGTTGCAG

1922

1563.72

2.42E-11

TTTCATTT

4636

4258.39

4840

4630.74

393.879

4538

4169.05

4731

4529.8

384.813

TTTCATTT

4636

4258.39

0.001152

TTTTTATT

5647

5276.08

6142

5792.21

383.658

5417

5037.47

5842

5517.96

393.481

TTTTTATT

5647

5276.08

2.72E-06

  1. Top 25 overrepresented words for the Introns in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).
  2. Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).
  3. Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.