Skip to main content

Advertisement

Table 8 The top 25 words in the entire genome

From: The word landscape of the non-coding segments of the Arabidopsis thaliana genome

  Unmasked Masked Unmasked
Word S ES O EO OlnOEO S ES O EO OlnOEO RevComp RC_Pos Pal PValues
AAAAAAAA 5 5 128631 119310 9675.67 5 5 101229 95334 6073.66 TTTTTTTT 1 No 0
TTTTTTTT 5 5 126533 117302 9585.11 5 5 98883 93091.2 5968.36 AAAAAAAA 0 No 1.67E-15
TATATATA 5 5 58215 49385.7 9575.32 5 5 29264 27159.9 2183.54 TATATATA 2 Yes 3.89E-15
ATATATAT 5 5 59429 53453 6298.28 5 5 30192 29596.8 601.111 ATATATAT 3 Yes 3.00E-15
TAAAAAAT 5 5 14823 11276.3 4053.8 5 5 11492 9148.23 2621.21 ATTTTTTA 5 No 4.44E-16
ATTTTTTA 5 5 14743 11385.1 3810.52 5 5 11392 9219.87 2409.99 TAAAAAAT 4 No 3.33E-16
GAAGAAGA 5 5 30102 26908.7 3375.68 5 5 22784 20523.6 2380.53 TCTTCTTC 7 No 0
TCTTCTTC 5 5 30267 27090.3 3356.11 5 5 23044 20902.7 2247.42 GAAGAAGA 6 No 0
TTTTAAAA 5 5 29354 26314.9 3208.24 5 5 19409 17519.9 1987.46 TTTTAAAA 8 Yes 2.55E-15
AATATATT 5 5 14170 11353.5 3140.06 5 5 11168 10179.5 1035.06 AATATATT 9 Yes 1.11E-16
TTTTCTTT 5 5 31066 28174.8 3034.69 5 5 26876 24423.6 2571.58 AAAGAAAA 11 No 0
AAAGAAAA 5 5 31033 28187.3 2984.8 5 5 26861 24502.1 2469 TTTTCTTT 10 No 1.11E-16
AGAGAGAG 5 5 19376 16630.5 2960.63 5 5 12615 11397.8 1280.05 CTCTCTCT 16 No 1.11E-16
TCTCTCTC 5 5 19179 16519.7 2862.73 5 5 12912 11634.1 1345.64 GAGAGAGA 14 No 4.44E-16
GAGAGAGA 5 5 20064 17413.4 2842.81 5 5 13136 11970.7 1220.21 TCTCTCTC 13 No 1.89E-15
AAGAAGAA 5 5 32397 29731.9 2781.12 5 5 24352 23296.2 1079.35 TTCTTCTT 19 No 0
CTCTCTCT 5 5 18513 15956.1 2751.61 5 5 12312 11212.7 1151.45 AGAGAGAG 12 No 1.11E-16
AGAAGAAG 5 5 26477 24049.7 2545.91 5 5 19161 18013.6 1183.17 CTTCTTCT 20 No 8.88E-16
TTATATAA 5 5 11402 9138.11 2523.66 5 5 9262 8518.12 775.46 TTATATAA 18 Yes 1.11E-15
TTCTTCTT 5 5 32333 29910 2518.58 5 5 24550 23579.9 989.811 AAGAAGAA 15 No 0
CTTCTTCT 5 5 26463 24183.9 2383.23 5 5 19432 18332.3 1132.03 AGAAGAAG 17 No 0
TTTTTCTT 5 5 30561 28331 2315.57 5 5 26516 24717.1 1862.84 AAGAAAAA 22 No 0
AAGAAAAA 5 5 30461 28234.7 2311.9 5 5 26488 24756.8 1790.32 TTTTTCTT 21 No 4.44E-16
TTTGTTTT 5 5 32141 29931 2289.6 5 5 27813 26102.2 1765.71 AAAACAAA 36 No 8.88E-16
  1. Top 25 overrepresented words for the entire genome of Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of chromosomes a word occurs in and the number of chromosomes the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score OlnOEO describes a statistical overrepresentation of the word in the genome and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked OlnOEO score).
  2. Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).
  3. Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.