These analyses extend those in Chen et al.
 in three ways: firstly, by using larger sample such that these analyses cover almost all taxonomic virus genera; secondly, by making the data more comprehensive because the genome size varies greatly, ranging from 1682 bp (S170-(−)ssRNA-31, Hepatitis delta virus, NC_001653) to 407339 bp (S42-dsDNA-42, Emiliania huxleyi virus 86, NC_007346), (
Additional file 1); and thirdly, by applying statistically significant methods. The above extension made it possible to investigate the relationship between repetitiveness of microsatellites and genome size more fully and deeply.
The previous analysis
 simply considered the correlation between microsatellites and genome size based on relatively small sample with 54 complete Hepatitis C virus (HCV) genomes, and they found that the number of SSRs is weakly correlated with genome size. We believe that Chen's result is lacking of statistical significance due to the relatively small sample size and uniform genome length. Here, the sample made up of 257 representative virus genome sequences was designed to investigate the relationship between SSRs and genome size on the level of the whole virus. The result of our data showed a very strong and significant positive relationship between the occurrence, or length of SSRs and genome size with the value of R2 = 0.919, P < 0.001 (Figure
2A) and R2 = 0.915, P < 0.001 (Figure
3A), respectively. That is, the longer the virus genome sequence, the more SSRs extracted. Hancock
[15, 35, 36] confirmed that the simple sequence repeats were positively and significantly correlated with the genome size in both archaea and eubacteria, and SSRs accumulate preferentially in organisms with larger genomes. Moreover, there is evidence proved that short SSRs (1–4 bp length) exist in reduced genomes, but long SSRs (5–11 bp length) consist in larger genomes in prokaryotes
. The overall level of repetition in genomes is related to genome size and to the degree of repetition, and the entire genome accepts simple sequences in a concerted manner when its size increases
[36, 37]. A relative scarcity of repeating DNA is a major factor in causing the relatively compact size of the avian genome
[38, 39]. What's more, differences in genome size account for approximately 10% of the variance in genomic repetition in archaea and eubacteria
, suggesting that other factors can also play important roles. DNA structure and base-stacking determined the number and length distributions of microsatellites in vertebrate genomes over evolutionary time
. Hosts are responsible for the variances of SSRs content to a certain degree. For example, with the similar genome size, viruses infecting vertebrates and invertebrates tend to be higher than viruses attacking bacteria in SSRs content, relative abundance and relative density of SSRs overall (
Additional file 15). This can be explained by the following statements. Genomes of reptiles are estimated to consist of about 30-50% repeats, birds have been estimated to consist of 15-20% of repeats
[40, 41], Mus musculus of 26.1%
[42, 43], and 44.9% of human genome were occupied by repeats
[44, 45]. While SSR tracts make up 2.4% of the E. coli genome
, significantly less than vertebrates'. SSRs have been reported to be hot spots for recombination as well as sites for random integration
[25, 26]. Thus, the increase of viral SSRs content is maybe due to combining partial genome sequences of hosts in the process of infecting vertebrates and invertebrates. As we know, hosts evolved a number of defense systems in response to the challenge from parasites. Meanwhile, the parasites evolved multiple counter-defense mechanism as well under the selection pressure from hosts. Bacteria have developed CRISPR/Cas (CRISPR, Clustered regularly interspaced short palindromic repeats; Cas, CRISPR-associated) immune system to defend against bacteriophages by cleaving their DNA
. Antagonistic coevolution between bacteria and their ubiquitous parasites, bacteriophage (phage), is well known
[48, 49]. The genomic regions of CRISPR/Cas are hot spot of recombination, and CRISPR/Cas modules underwent rapid evolution in natural environments because of recurrent selection pressure exerted by coevolving viruses
. Meanwhile, viruses may combine partial CRISPR/Cas sequence in response to the counter-defense of bacteria. Therefore, it is no coincidence that SSRs content is high in both viruses that infect vertebrates and invertebrates and these hosts themselves. The recombination enhanced the virus's ability of infection and anti-immunity to a certain extent. Evolutionarily speaking, it is the result of selection in the process of interaction between viruses and hosts. It has proposed that reduced genome size represents an adaptation to the high rate of oxidative metabolism in birds, which results primarily from the demands of flight, and the relatively small genome size of birds in general may reflect the selective pressure to minimize the amount of repetitive DNA
Overall, the longer genome sequence, the stronger capability the genome holding long SSRs. Each type of repeat unit is distributed in a certain length range of genomes. Mono- and di- SSRs were observed in almost all analyzed virus genomes; tri- repeats appeared to widely distribute in all virus genomes but it's number is obviously less than mono- and di- SSRs; tetra- SSRs as a common component consist in genomes with size more than 100 kb (94.4% of the genomes contain tetra- SSRs in group of genome > 100 kb). In contrast, it is relatively rare in genomes with the size < 100 kb; genomes containing penta- and hexa- SSRs are not more than 50% in < 100 kb group. Moreover, the number of tetra-, penta- and hexa- SSRs is very small (Table
1). Dinucleotide and trinucleotide SSRs were observed in all analyzed HIV genomes (genome size approximately 9 kb), but almost no tetra-, penta- and hexanucleotide SSRs were found
. Tetranucleotide SSRs are contained in 26.7% of the analyzed Potyvirus genomes (genome size approximately 10 kb), but the number of tetranucleotide SSRs is small
. The data of tetra-, penta- and hexanucleotide SSRs are also rare in Mycoplasma, but they are relatively sufficient in bacterial
[46, 55], fungal
[39, 41] and human
[58, 59]. Those results confirmed that SSRs distribution is closely related to the genome size, indeed. The accumulation of simple sequence repeats would be attributed to the results of selection in the process of evolution. It has been well known that viruses such as influenza virus, hepatitis virus and human immunodeficiency virus (HIV) have a higher mutation rate to resist drugs, vaccines and so on during the process of replication and (or) recombination, which is one of the reasons for curing flu, hepatitis and acquired immunodeficiency syndrome (AIDS) with difficulty. Moreover, viruses lack complete repair mechanisms. Therefore, long SSRs can be poorly found in viruses. In the opinion of Mrázek et al.
, small genomes have a strong negative selection against long SSRs due to their strong constraints against expansion.