Effect of growing E. coli data set. Sample (black) and estimated population (red and blue) pan-genomes sizes for E. coli, as a function of number of genomes sampled. In blue is our mixture-model estimate, in red the Chao lower-bound estimate and the black is the observed size. All of these values are averages over 22 data sets. Note that for the lower number of genomes, the estimates tend to have larger variability, due to the larger number of ways to sample a small number of genomes out of a pool of 22 genomes; at the other end of the scale, the 22 possible combinations of 21 genomes are very similar to each other.