Improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies

BMC Genomics

Table 1 Identification errors of homopolymer length with different methods

No	Nt	Pos	Count	Errors (%)
				KNN	Torrent suite	Bayesian	Reference	Proposed approach
				KNN	Torrent suite	Bayesian	Reference	Weight	Errors
1	A	1–75	144230	7.002	1.119	2.296	0.298	0.28	0.250
2	A	76–150	112776	12.121	1.651	4.722	0.489	0.34	0.453
3	A	151–225	97568	18.733	2.926	8.150	0.423	0.14	0.421
4	A	226–300	48033	22.292	4.655	10.259	0.535	0.24	0.510
5	C	1–75	88732	6.534	1.843	2.779	0.034	0.14	0.027
6	C	76–150	77650	10.382	2.489	4.595	0.556	0.36	0.121
7	C	151–225	63658	18.581	3.187	6.383	0.545	0.28	0.542
8	C	226–300	35736	17.910	4.600	6.159	0.926	0.30	0.923
9	G	1–75	97493	4.141	1.422	1.826	0.609	0.30	0.376
10	G	76–150	78192	14.874	1.623	3.864	0.322	0.32	0.152
11	G	151–225	64680	16.868	2.273	5.683	1.062	0.14	1.062
12	G	226–300	34116	18.754	2.492	7.985	0.147	0.12	0.147
13	T	1–75	156550	5.186	1.106	2.504	0.076	0.14	0.054
14	T	76–150	152034	11.446	1.571	5.780	0.342	0.30	0.297
15	T	151–225	111090	14.720	2.331	7.290	0.419	0.32	0.362
16	T	226–300	68448	13.912	3.315	8.240	0.723	0.28	0.599

“Count” means the number of each class of homopolymers. “KNN” means the method of K nearest neighbors. “Reference” means only reference information is used in the designed model(Weight = 0)

ISSN: 1471-2164