Relative entropy in chromosomes, plasmids, phages and GIs
Chromosomes were, on average, the most biased sequences (i.e. least similar to a random sequence) and therefore presumably the most subjected to selective pressures of the sequences examined here. In terms of D
KL
there was a small, but significant difference between GIs and chromosomes. This difference is expected since GIs are found within chromosomes and have ameliorated over time, which, in base compositional terms, tend towards that of the host chromosome [27]. Hence, a number of studies indicate that GIs consist of horizontally acquired mobile genetic fragments [22, 28], but our data does not identify what type of vector has brought these GIs to their respective chromosomes.
The reduced D
KL
of phages compared to plasmids was small but statistically significant. In contrast to phages, plasmids exist independently of the host chromosome and are generally non-lethal [29]. When the phenotypic features of the plasmid are not required for bacterial survival, the plasmid will exist only in a small minority of the total microbial population [30]. In this way the forces of selective pressure are reduced compared to the host chromosome. Phages also exist independently of bacterial chromosomes but rely on the bacterial machinery for replication [29, 30]. However, those phages that are lytic will be under greater selective pressure than plasmids. What particular features of phages that result in the reduced information content remains to be clarified.
It should be noted that the comparisons were between all deposited DNA sequences, which means that the results reflect the distributions of chromosomes, GIs, phages and plasmids that initially have been originally selected and sequenced for a purpose. The effect of this bias is not clear.
Association between DKL and AT content
Figure 2 shows that decreased relative entropy (D
KL
) is associated with increasing AT content. An example of this was demonstrated in Figure 3, where the more AT rich M. leprae was found to have lower D
KL
in genes that are also shared with the more GC rich M. tuberculosis.
Although the coefficient of determination, R2 , varied between GIs, phages, plasmids and chromosomes, Figure 2 shows that the trend remained for all DNA sequences examined. Phages obtained a surprisingly high coefficient of determination, R2 = 0.56, implying that relative entropy was more linked to changes in AT content in these organisms.
DKL variation within chromosomes
The D
KL
profile of the B. cereus chromosome may imply that areas of low relative entropy (low D
KL
) might be indicators of genetic regions especially prone to rearrangement. This propensity for re-arrangements may be due to the increased stacking energy, position preference and amount of quasi-palindromes observed in the region, all of which are determinants of genomic re-arrangement. The relatively high occurrence of both palindromes and quasi-palindromes in the region of B. cereus with low relative entropy may indicate that the mechanisms leading to quasi palindrome correction have not been operating properly in these regions as compared to the chromosome in general [31] possibly resulting also in a higher number of accumulated mutations [17]. A similar region has been found for all sequenced members of the B. cereus-group, which implies that the genetic region has been selected and kept possibly due to some unknown advantage. As can be seen from Figure 6, the region is predominantly gene coding. Since the genomes of the B. cereus group are relatively large compared with the distantly related B. subtilis it can be speculated that the region is an acquired phage or plasmid.
Connections between DNA sequence and structure
Although relative entropy has some mathematical associations with thermodynamics the two concepts are, in general, independent of each other [18]. However, it is known that greater energy is required to melt GC rich sequences than AT rich sequences [32]. Considering our results found a negative correlation between D
KL
and AT content it is possible that DNA structure energetics and DNA sequence relative entropy may be connected and provides a link between DNA structure and sequence. This is supported by the findings shown in Figure 6 where a genetic region of low relative entropy was found to have more intrinsic DNA structural curvature, increased stacking energies and higher position preference. Hence, our findings may point to possible DNA structural differences between bacterial chromosomes, plasmids and phages that could have implications for how these biological entities are integrated into their hosts.
Phylogenetic influences on relative entropy
Our measure of relative entropy revealed that approximately 21% of the variation in D
KL
could be explained by a close phylogenetic relationship. This value compares well with the 22% in variation that is explained by GC content. Thus, D
KL
appears to be as much influenced by phyla as GC content is, while almost 80% is accounted for by other factors. Using a method that is strongly associated with relative entropy (OUV, oligonucleotide usage variance), 55% of the variance could be explained by environment, phyla and AT content [17]. If non-coding regions were excluded 67% of the variance could be explained using environment, phylum and AT content. The above mentioned study also discusses possible influences between environmental factors and possible implications of high and low OUV for a number of microbes that is relevant to the present exposition. The difference between OUV and relative entropy is explained in the Methods section.
Relation between relative entropy and DNA sequence size
Although a possible link between plasmid size and ecology has been reported [29], and a correlation between microbial chromosome size and GC content has been established previously, to the best of our knowledge no such correlation has been reported between plasmid size and GC content. It can also be seen from Figure 7 that plasmid sizes vary considerably more with respect to AT content than chromosomes, which could indicate that the DNA sequences of plasmids are less stable and more prone to genetic exchange than the DNA sequences of chromosomes.
Lack of correlation between relative entropy and DNA sequence size
Although a correlation between DNA sequence size and D
KL
in bacterial chromosomes and plasmids could be expected due to the correlation found between these factors and genomic AT content, no such correlation was found. This may imply that the relation between genomic AT content and DNA sequence size is independent of the relation between genomic AT content and relative entropy. In other words, genomic AT content may be differently related to DNA sequence size than to relative entropy in bacterial chromosomes and plasmids (no correlation was found between AT content and DNA sequence size in GIs and phages). This claim was further strengthen by a linear regression analysis, which indicated that the variance explained increased additively with DNA sequence size and relative entropy added as predictors. Hence, our models indicate that the mechanisms connecting AT with DNA sequence size are unrelated and different to the mechanisms linking AT content with relative entropy.
Connections to other studies
By using BLAST and graph/network analyses it has been found that the different groups, i.e. chromosomes, plasmids and phages, share, in the majority of cases, DNA amongst themselves. In other words, chromosomes share DNA with chromosomes, plasmids share DNA with plasmids and phages share DNA with phages [5]. Variation among bacterial chromosomes however is predominantly mediated by genetic exchange from plasmids and only transiently so by phages [5]. Our results indicated that plasmids, on average, had significantly lower D
KL
than any of the other types of DNA sequences. This could mean that plasmids are more tolerant to genetic alterations something that may be crucial to maximize host range [33]. A previous study has reported a correlation between plasmid-host similarity and GC content, i.e. the more similar the plasmids-hosts were in terms of genomic signatures, the more GC rich they tended to be [9]. Phages have been found to have a narrow host range, in fact even more so than plasmids [5] in spite of their larger numbers (estimations go as high as 5-10 phages for each bacterium on earth [34–36]), which may indicate that they have been subjected to increased selective pressures resulting, in turn, in significantly higher D
KL
than for plasmids. Due to the possible link between relative entropy and DNA sequence mutations it can be speculated whether phages are more vulnerable to genetic rearrangements than plasmids, resulting in higher D
KL
, on average in phages.