The word landscape of the non-coding segments of the Arabidopsis thaliana genome

Lichtenberg, Jens; Yilmaz, Alper; Welch, Joshua D; Kurz, Kyle; Liang, Xiaoyu; Drews, Frank; Ecker, Klaus; Lee, Stephen S; Geisler, Matt; Grotewold, Erich; Welch, Lonnie R

doi:10.1186/1471-2164-10-463

BMC Genomics

Table 4 The top 25 words in Introns

From: The word landscape of the non-coding segments of the Arabidopsis thaliana genome

	Unmasked					Masked					Unmasked
Word	S	ES	O	EO	SlnSES	S	ES	O	EO	SlnSES	RevComp	RC_Pos	Pal	PValues
TTTTTGTT	10048	9365.74	11094	10679.8	706.524	9819	9103.26	10783	10355.3	743.17	TTTTTGTT	10048	9365.74	3.44E-05
TTTTTCTT	9144	8495.68	10021	9609.91	672.454	8939	8293.57	9751	9363.74	669.915	TTTTTCTT	9144	8495.68	1.58E-05
CTTTTTTC	2764	2170.42	2821	2314.32	668.224	2713	2187.97	2767	2333.43	583.515	CTTTTTTC	2764	2170.42	8.88E-16
GTTTTTGA	2673	2105.13	2742	2243.33	638.372	2631	2056.65	2696	2190.66	647.973	GTTTTTGA	2673	2105.13	-2.22E-16
TTTTGCAG	3505	2959.4	3523	3179.19	593.06	3452	2920.63	3470	3136.4	577.016	TTTTGCAG	3505	2959.4	1.07E-09
TTTTTTGT	7618	7067.97	8198	7889.79	570.901	7400	6823.86	7922	7600.06	599.8	TTTTTTGT	7618	7067.97	0.000286
TTTTTTGG	3765	3238.3	3942	3487.94	567.378	3635	3124.76	3795	3362.05	549.804	TTTTTTGG	3765	3238.3	2.62E-14
TTTTCTTT	9256	8733.23	10299	9900.39	538.109	9041	8500.1	9994	9615.3	557.761	TTTTCTTT	9256	8733.23	3.48E-05
TGTTTTTT	7487	6984.58	8028	7790.67	520.072	7254	6759.65	7750	7524.05	512	TGTTTTTT	7487	6984.58	0.003768
CTCTCTTT	3193	2716.79	3289	2911.9	515.697	3086	2625.01	3165	2811.09	499.291	CTCTCTTT	3193	2716.79	3.97E-12
ATTTTTTA	2508	2044.78	2645	2177.76	512.128	2383	2003.78	2486	2133.28	413.027	ATTTTTTA	2508	2044.78	3.33E-16
TTTTTTCC	3166	2702.47	3253	2896.16	501.186	3086	2616.31	3161	2801.55	509.528	TTTTTTCC	3166	2702.47	4.13E-11
TGTTTCAG	2215	1790.21	2239	1902.05	471.614	2153	1745.3	2177	1853.55	451.987	TGTTTCAG	2215	1790.21	3.01E-14
GGTTTTTG	2029	1611.17	2092	1708.92	467.851	1997	1584.97	2058	1680.71	461.47	GGTTTTTG	2029	1611.17	1.11E-16
TTTTGTTT	12142	11689.3	13879	13619.2	461.327	11843	11368.1	13438	13205.7	484.659	TTTTGTTT	12142	11689.3	0.013306
TTTGTTTT	11017	10569.9	12527	12188.1	456.39	10729	10259.7	12106	11796.5	479.827	TTTGTTTT	11017	10569.9	0.00113
CTTTTTTA	2234	1828.76	2282	1943.72	447.149	2178	1816.31	2220	1930.26	395.524	CTTTTTTA	2234	1828.76	4.17E-14
AATATATT	2022	1642.55	2143	1742.72	420.253	1925	1679.14	2019	1782.16	263.038	AATATATT	2022	1642.55	4.44E-16
ATTTTTCA	2411	2030.35	2467	2162.1	414.291	2349	1971.89	2398	2098.68	411.073	ATTTTTCA	2411	2030.35	7.51E-11
ATTTTTTC	2810	2425.9	2881	2592.99	413.021	2736	2412.96	2800	2578.85	343.758	ATTTTTTC	2810	2425.9	1.43E-08
CAATTTTT	2402	2023.84	2481	2155.04	411.472	2320	1952.98	2388	2078.19	399.534	CAATTTTT	2402	2023.84	3.73E-12
TTTTTTCT	7674	7280.17	8254	8142.69	404.295	7476	7074.7	8001	7897.8	412.475	TTTTTTCT	7674	7280.17	0.109849
TGTTGCAG	1922	1563.72	1933	1657.84	396.507	1891	1543.21	1902	1635.78	384.332	TGTTGCAG	1922	1563.72	2.42E-11
TTTCATTT	4636	4258.39	4840	4630.74	393.879	4538	4169.05	4731	4529.8	384.813	TTTCATTT	4636	4258.39	0.001152
TTTTTATT	5647	5276.08	6142	5792.21	383.658	5417	5037.47	5842	5517.96	393.481	TTTTTATT	5647	5276.08	2.72E-06

Top 25 overrepresented words for the Introns in Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of sequences a word occurs in and the number of sequences the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score SlnSES describes a statistical coverage of the sequences analyzed in the set and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked SlnSES score).
Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).
Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Back to article page

ISSN: 1471-2164

Contact us

Submission enquiries: bmcgenomics@biomedcentral.com
General enquiries: ORSupport@springernature.com