Microbial comparative pan-genomics using binomial mixture models

BMC Genomics

Table 1 Effect of gene predictions

Data set	Observed	ORFans	Chao	Bin. mix.
Original NCBI	12599	5438	26614	42640
Reduced 10%	11273	4470	22549	32528
Reduced 50%	9336	3272	17083	27456
Easygene	9211	3121	17041	29818

The number of observed gene families in data set, the number of ORFans (gene families found in 1 genome only), Chao estimates and binomial mixture estimates of pan-genome size for the original E. coli data as well as reduced data sets. "Reduced 10%" means the 10% shortest hypothetical proteins were removed from the original data set, and correspondingly for "Reduced 50%". "Easygene" is a new data set with genes predicted by the Easygene gene prediction tool.

ISSN: 1471-2164