Comprehensive prediction of chromosome dimer resolution sites in bacterial genomes
© Kono et al; licensee BioMed Central Ltd. 2011
Received: 24 September 2010
Accepted: 11 January 2011
Published: 11 January 2011
Skip to main content
© Kono et al; licensee BioMed Central Ltd. 2011
Received: 24 September 2010
Accepted: 11 January 2011
Published: 11 January 2011
During the replication process of bacteria with circular chromosomes, an odd number of homologous recombination events results in concatenated dimer chromosomes that cannot be partitioned into daughter cells. However, many bacteria harbor a conserved dimer resolution machinery consisting of one or two tyrosine recombinases, XerC and XerD, and their 28-bp target site, dif.
To study the evolution of the dif/ XerCD system and its relationship with replication termination, we report the comprehensive prediction of dif sequences in silico using a phylogenetic prediction approach based on iterated hidden Markov modeling. Using this method, dif sites were identified in 641 organisms among 16 phyla, with a 97.64% identification rate for single-chromosome strains. The dif sequence positions were shown to be strongly correlated with the GC skew shift-point that is induced by replicational mutation/selection pressures, but the difference in the positions of the predicted dif sites and the GC skew shift-points did not correlate with the degree of replicational mutation/selection pressures.
The sequence of dif sites is widely conserved among many bacterial phyla, and they can be computationally identified using our method. The lack of correlation between dif position and the degree of GC skew suggests that replication termination does not occur strictly at dif sites.
In bacteria, replication fork arrest is mainly repaired by homologous recombination . When such a recombination event occurs an odd number of times in one DNA replication event of circular chromosomes, the replicated chromosome is not properly segregated into two daughter chromosomes but instead produces a concatenated dimer [2, 3]. Therefore, many bacteria harbor highly conserved chromosome dimer resolution (CDR) machinery to separate the dimer chromosome into two monomer daughter chromosomes.
In Escherichia coli, chromosome dimers are resolved by two tyrosine recombinases, XerC and XerD, by the addition of a crossover at a specific 28-bp sequence called the dif site, which is located in the replication termination region of the chromosome [4, 5]. The dif sequence contains a pair of palindromic sequence motifs that correspond to the binding domains of XerC and XerD. The reaction is coordinated to the last stages of cell division by an essential cell division protein, FtsK, which functions as a septum-located DNA translocase [6–10]. FtsK moves along the chromosome unidirectionally towards the dif sequence, thanks to polar and orientated sequences, the KOPS [11–13]. CDR is initiated when FtsK reaches dif and its extreme C-terminal domain directly interacts with the C-terminal domain of XerD [14–18]. The dif/XerCD chromosome dimer resolution system seems widely conserved. In vivo experimental evidence for its conservation has been obtained in Xanthomonas campestris, Caulobacter crescentus and Vibrio cholerae [19–21]. In vitro characterization of Xer recombinases and dif sites has also been carried in Haemophilus influenzae and Bacillus subtilis [22, 23]. However, the importance of dif/XerCD for the fitness of bacteria has only been demonstrated in E. coli and V. cholerae [20, 24]. In some other bacteria, like Lactococci and Streptococci, chromosome dimer resolution is resolved by single tyrosine recombinases that act at specific dif site [25, 26]. In this case, dimer resolution still depends on FtsK and dif is still located opposite the origin of replication between oriented polar sequences . Several filamentous phages are known to hijack this site-specific recombination machinery of dif/XerCD for their integration into the host chromosome, containing pseudo- dif sequences within these phage genomes [28–34]. However, the dif sequence remains intact during such recombination process to ensure the integrity of chromosome dimer resolution machinery [35, 36]. The dif -like sequences in phages often contain more variable central region that is longer than the canonical 6 bp [31, 33, 34], and the XerD binding arm is considerably degenerate .
Because there is only one origin of replication on bacterial circular chromosomes, replication generally terminates in a specific region of the chromosome. This can be followed by the existence of a GC skew on the two replichore arms of the chromosomes with a shift-point opposite the origin of replication . Based on the observation that dif sites are generally located at or near the GC skew shift-point, Hendrickson and Lawrence proposed that replication might generally terminate at dif, which coordinate replication and chromosome dimer resolution . In E. coli, the replication process usually terminates at a narrow region that includes approximately 5% of the genome length and is located directly opposite the replication origin [39–41]. This is partly due to the existence of the Tus/Ter replication fork trap . dif is located within the replication fork trap but termination occurs precisely at the Tus site, not at dif  and dif is active when displaced outside of the replication termination region if it is still within the zone where KOPS converge . Nevertheless, the lack of universal conservation of the Tus protein may suggest that replication terminated at dif sites until the relatively recent takeover by the Tus-Ter system . We reasoned therefore that the comprehensive identification of dif sites and of their location with respect to the GC skew shift-point in hundreds of complete genomes might provide clues to the evolution of the CDR machinery and its possible link with the replication termination mechanism in bacterial species.
Prediction of the dif sequences has been reported by several groups with different approaches. Hendrickson and Lawrence showed that sequence skew can be used to predict the location of dif sites, and they identified putative dif sequences in 25 bacteria based on sequence similarity . Le Bourgeois and colleagues reported a new type of tyrosine recombinase, named XerS, which is responsible for CDR in Streptococci and Lactococci and this recombinase targets a 31-bp sequence element named dif SL . For comparison, they predicted dif sequences in 22 Firmicutes based on their similarity to that of B. subtilis with Megablast  and on the fact that the dif sequence occurs only once per genome. Val and colleagues identified that V. cholerae chromosome II, whose many features are plasmid-like, has an original dif sequence independently, and therefore it has FtsK-dependent CDR . For this purpose, they predicted dif sequences in five α-Proteobacteria and ten β-Proteobacteria that harbour multiple chromosomes, and discussed a conserved FtsK-dependent CDR on multiple chromosomes based on the close relative distance of the position of dif sequences and the GC skew shift-points. Their prediction method is based on a HMMER  score (<1.0e-05) with a profile built from 27 aligned dif sequences in the largest chromosomes of γ-Proteobacteria species, with manual checking for 6-bp spacing between two XerC and XerD binding motifs.
Carnoy and Roten reported the most comprehensive predictions to date, identifying putative dif sequences in 204 chromosomes in 137 Proteobacteria strains, discussing the high conservation of dif/ XerCD systems and the possible loss of dif sequences in endosymbionts, with suggestions for other CDR mechanisms . Here, the prediction was based on BLAST searches and YASS alignment  with the dif sequences of E. coli and B. subtilis, and candidates were selected based on their proximity to the GC skew shift-points and a single occurrence per chromosome. Previous predictions were therefore limited to three bacterial phyla: Proteobacteria, Firmicutes and Actinobacteria.
To this end, we describe comprehensive predictions for dif sequences based on a machine learning approach, tracing the phylogenetic conservation patterns of XerCD recombinases and using an iterative hidden Markov modeling method. Furthermore, we observed the relationship between predicted dif sequence positions and GC skew shift-points, and investigated whether replication termination occurs at the dif site.
Prediction result overview
% (chr %)
All of these predictions resulted in unique hits above the threshold, and their validity was further confirmed through leave-one-out cross-validation. On the other hand, predictions below the threshold (score < 10 and e-value > 1.0E-04) often resulted in multiple candidates with insufficient scores. When the initial prediction using the strict threshold failed, we manually checked the predicted sequences for the conservation of palindromic structure in the 7-12-bp and 17-22-bp positions, and candidates that were located close to the origin of replication were removed because the displacement of a dif sequence near the origin significantly reduces the growth rate .
In Proteobacteria, fuzzy matching in 28 Escherichia strains based on the dif sequence of E. coli K12 for the creation of an initial seed profile hidden Markov model yielded a unique dif sequence in each of the 28 strains. Iterated HMM using this seed profile resulted in unique predictions over the validation threshold in 306 genomes. An additional 137 chromosomes in 69 genomes were predicted with iterated HMM separated by classes, and 10 distant genomes were predicted using an alternative seed profile created with the 3 most similar genomes. The predicted dif sequences totaled 482 in 414 organisms, with a prediction rate of 98.61% for single-chromosome strains and 95.00% for multiple-chromosome strains. Predictions failed in eight organisms and ten chromosomes, namely, Agrobacterium tumefaciens str. C58, Paracoccus denitrificans PD1222 chromosome I, II (α-Proteobacteria), Burkholderia phytofirmans PsJN chromosome I, Burkholderia sp. 383 chromosome I, III, Nitrosospira multiformis ATCC 25196 (β-Proteobacteria), Desulfotalea psychrophila LSv54 (δ-Proteobacteria), Sulfurimonas denitrificans DSM 1251 and Nitratiruptor sp. SB155-2 (ε-Proteobacteria).
For Firmicutes, fuzzy matching in 17 Bacillus strains (based on the dif sequence of B. subtilis str. 168 for the creation of the initial seed profile hidden Markov model) yielded a unique dif sequence in each of the 17 strains. Iterated HMM using this seed profile resulted in unique prediction over the validation threshold for 79 chromosomes in 79 genomes. The dif sequences are predicted in a total of 97 organisms, with a prediction rate of 97.00%. Prediction failed in three genomes, namely, Clostridium perfringens str. 13, C. beijerinckii NCIMB 8052 (Clostridia), and Lactobacillus helveticus DPC 4571 (Lactobacillales).
Although no experimentally confirmed dif sequence is available for Actinobacteria, that of F. alni is suggested to be 5'-CACGCCGATAATGCACATTATGTCAAGT-3' . Therefore, we used this sequence for fuzzy matching in two genomes, Nocardia farcinica IFM 10152 and Mycobacterium avium subsp. paratuberculosis K-10, whose XerCD amino acid sequences were most similar to those of F. alni. Iterated HMM using this seed profile resulted in successful predictions above the validation threshold in all 66 genomes.
In Chlorobi, an initial seed profile was created with predicted dif sequences in Chlorobaculum parvum NCIB 8327 and Prosthecochloris aestuarii DSM 271 that scored above the validation thresholds using the Firmicutes profile, which resulted in the highest scores compared to those of Proteobacteria and Actinobacteria. Likewise, the profile of Firmicutes yielded the highest scores in Chlamydiae, where the initial seed profile was created from predicted dif sequences in Chlamydophila pneumoniae CWL029 and Protochlamydia amoebophila UWE25, which were below the validation thresholds, but contained palindromic structure and were located within 0.01-1.48 degrees from the shift-points of GC skew. Using these seed profiles, iterated HMM successfully predicted dif sequences in all 11 genomes in Chlorobi and 14 genomes in Chlamydiae.
Because the number of genomes is very small in all of the other phyla, we utilized the profiles of Proteobacteria, Firmicutes, Actinobacteria, Chlorobi, and Chlamydia that were created thus far instead of applying iterated HMM based on specific seed profiles, and all of the following candidates were confirmed based on scores, palindromic structure, and position. In Elusimicrobia and Tenericutes, all profiles showed high HMMER scores, and predictions using the profiles of Firmicutes and Chlamydiae predicted identical dif sequences. Similarly, the profiles of Firmicutes, Chlamydiae, and Proteobacteria predicted identical dif sequences in Nitrospirae, and predictions based on the profiles of Proteobacteria and Chlorobi were identical in Gemmatimonadetes.
In Spirochaetes, predictions using the profiles of Firmicutes, Chlamydiae and Proteobacteria profiles resulted in unique dif sequences in species with single chromosomes, and the profiles of Firmicutes were used for the predictions of 12 chromosomes in 6 species with multiple chromosomes, all with HMMER scores above the validation thresholds. The most suitable profiles varied among species in other phyla. In Acidobacteria, the dif sequence of Acidobacterium capsulatum ATCC 51196 was predicted by the profiles of Firmicutes, Chlamydiae, and Chlorobi dif sequences, and other species were predicted using the profile of Firmicutes only. In Verrucomicrobia, profiles based on Proteobacteria, Firmicutes and Chlorobi predicted Methylacidiphilum infernorum V4, and that of Proteobacteria and Firmicutes predicted Opitutus terrae PB90-1 and Akkermansia muciniphila ATCC BAA-835. In Chloroflexi, the Chlorobi profile was suitable for Dehalococcoides sp. BAV1 and Dehalococcoides sp. CBDB1, and that of Actinobacteria was used in D. ethenogenes 195 dif sequences. dif sequences were predicted in 14 Bacteroidetes strains using the profile of Proteobacteria, and those in five strains were predicted using alternative profiles created with the three most similar genomes. In this way, we successfully predicted dif sequences in most phyla, although the prediction failed in the phyla Cyanobacteria and Planctomycetes.
To further investigate whether replication terminates at the dif site, by observing the overall contribution of the genomic selection/mutation pressures of the replication machinery to the collinearity of the dif sequence positions and GC skew shift-points, we plotted the distances between them against the GC Skew Index (GCSI) of genomes to quantify the degree of replicational mutation/selection pressures. GCSI is an index that quantifies the degree of GC skew of a given genome, which can be used as a comparative measure of the accumulated replicational mutation/selection pressures . Since the strength of the GC skew is speculated to partly correlate with the growth rate of bacteria , high replication mutation/selection rate indicated by GCSI implies a greater number of replication events in these organisms. Therefore, if the replication terminates at or around the dif site, even allowing for statistical fluctuations, we can assume that the increasing number of replication events should shape GC skew shift-points closer to the dif site by the central limit theorem and by the law of large numbers. Hence, genomes with higher GCSI should have closer relative distance between the GC skew shift-points and dif sites, if replication terminates at the dif site. However, as depicted in Figure 2B, we observed no correlation between these two variables (Spearman rank correlation coefficients in Proteobacteria and Firmicutes: ρ = -0.046 and 0.112, respectively).
In this study, we first demonstrated that the conservation of XerCD genes follows phylogenetic conservation patterns that are specific to each bacterial phylum (Figure 1). Based on this principle, we comprehensively predicted the dif sequences in hundreds of completely sequenced genomes using a recursive strategy that iteratively models and predicts these sequences using profile hidden Markov models. As a result, we obtained unique candidate dif sequences in 715 chromosomes in 641 strains that were validated through multiple means, resulting in the largest collection of predicted dif sequences assembled to date. In comparison to previous work by Carnoy and Roten, which predicted dif sequences in 228 genomes, our predictions coincided with their results in 208 genomes and we added 507 genomes, including Aromatoleum aromaticum str. EbN1, which Carnoy and Roten reported to lack the dif/XerCD system. Excluding strains or chromosomes we could not predict, namely, A. tumefaciens str. C58, Burkholderia sp. 383 chromosome I, II, D. psychrophila LSv54, N. multiformis ATCC 25196, P. denitrificans PD1222 chromosome I, II and S. denitrificans DSM 1251, the predicted dif sequences in this study differed in 12 chromosomes in comparison to the results of Carnoy and Roten: C. crescentus CB15, Granulibacter bethesdensis CGDNIH1, Pseudoalteromonas haloplanktis TAC125 chromosome II, Ralstonia eutropha H16 chromosome II, Rhodobacter sphaeroides 2.4.1 chromosome I, R. sphaeroides 2.4.1 chromosome II, Rickettsia bellii OSU 85-389, R. conorii, R. felis URRWXCal2, R. prowazekii, R. typhi Wilmington, and Shewanella sp. ANA-3. For R. eutropha H16 chromosome II and P. haloplanktis TAC125 chromosome II, both studies predicted positions that were symmetric from the origin of replication, and although experimental confirmation is required to confirm which candidates function in vivo, the palindromic structures of the XerCD binding sites are more conserved in the candidates predicted by our method. Therefore, overall, our results were identical with those of Carnoy and Roten for 92% of the genome analyzed (208/228), and 11/12 mismatch resulted in candidates with more conserved XerCD binding sites, with the addition of 507 genomes among numerous phyla. Carnoy and Roten noted that some Vibrio species contain two dif sites both located at the vicinity of the GC skew shift-points. Therefore, we further tested whether the predicted dif sites in multiple chromosomes are all located near the GC skew shift-points. Using 5% genomic distance as a threshold, 45 out of 54 strains with two chromosomes, including Vibrio species, and 6 out of 9 strains with three chromosomes showed such agreement of the positions, (Additional file 2, Table S1).
There are four factors that may explain the advantages of our results. First, the selection of bacterial strains in the study by Carnoy and Roten was limited to genomes harboring XerCD that were identified by their similarity to those of E. coli, whereas we used all genomes with XerCD orthologs as identified by the KEGG Orthology database. While there is a little time-delay until the sequences are annotated and incorporated into the KEGG Orthology database, use of this database provides a more generic and comprehensive starting point. Second, similarity searches using software tools such as BLAST are not suitable for short sequence motifs that undergo mutation, and the difficulty in identifying only those dif sequences with sequence similarity has been shown for C. crescentus  and several classes of Proteobacteria . Third, dif sequences require two binding motifs of XerC and XerD to be functional ; therefore, the conservation of palindromic structure at the 7-12-bp and 17-22-bp positions should be confirmed for each predicted candidate. Finally, the use of iterated HMM allowed dif sequence prediction using the profiles of closely related species for each iteration, following the phylogenetic conservation pattern of XerCD.
The high predictability shown in this study suggests that the dif/ XerCD system of chromosome dimer resolution is highly conserved among bacterial species and that dif sequences are almost always conserved when XerCD is present within the genome. In fact, according to the KEGG Orthology database, XerC and XerD are conserved in approximately 60-70% of bacterial species, which is a higher percentage than is found for the replication termination protein Tus  and for universal genes such as the SOS response repressor LexA . In light of the remarkable conservation of the dif/ XerCD system, although it is beyond the scope of this study, explorations of alternative CDR machinery in species that lack the dif/ XerCD machinery would be an interesting area of future research. Chromosome dimer resolution pathways are suggested to be present in species that lack the dif/ XerCD system, and several alternative pathways have been reported and suggested. Le Bourgeois et al. reported an unconventional CDR pathway involving only one recombinase (XerS) in Streptococci and Lactococci, along with a 31-bp dif sequence . Similarly, through computational analysis, Carnoy and Roten suggested the existence of another pathway, termed XerH, in ε-Proteobacteria in place of XerCD and XerS and discussed the likelihood of the existence of dif analogues in these species [26, 46]. The basic strategy of iterated HMM should be applicable in predicting dif analogues in these species when defined seed sequences and detailed positions of recombinase binding sites are elucidated.
Although dif sequences are expected to be located near the shift-point of the GC skew, we did not use this feature to predict and validate dif sequences with iterated HMM; therefore, using the comprehensively predicted dif sequences across numerous phyla, we were able to directly compare the positions of predicted dif sequences with those of the GC skew shift-points to analyze their relationships. As expected, these two positions are highly correlated in terms of genomic loci, confirming a previous work . In this respect, because GC skew is the cumulative result of replicational selection/mutations, the degree of conservation of the CDR machinery is presumably in concordance with the degree of replication selection/mutation pressures (i.e. GC skew), which is partly characterized by the difference in the replication machinery and partly characterized by the growth rate . On the other hand, as shown in Figure 2B, the differences in the positions of the GC skew shift-point and the strength of the GC skew, as quantified by GCSI, were not correlated. If replication termination occurs at the dif site, as proposed by Hendrickson and Lawrence , a stronger GC skew that is generated by a larger number of replication events and/or a higher mutation rate should statistically bring the GC skew shift-point closer to the dif site by the central limit theorem and law of large numbers. In fact, the overall correlation of these loci leads to the proposal that the dif site is the replication termination point. However, because a stronger degree of replication mutation/selection pressures does not bring these two loci closer to each other, they are not in a causal relationship. Therefore, although the dif sequence is located near the replication termination site for efficient CDR, the replication termination site is suggested to be at a site other than the dif site, as was recently shown in vivo . On the other hand, the dif sequences in Firmicutes are more conserved in various phyla because the profile of Firmicutes was the best suited as the initial profile of iterated HMM in Chlorobi, Acidobacteria, Gemmatimonadetes, Nitrospirae, Elusimicrobia, Tenericutes, and Spirochaetes, where initial seed sequences were not available, and those in Proteobacteria were more variable, as shown by the requirement to predict by iterated HMM in classes instead of phyla. Tus proteins, which are shown to terminate replication in vivo, are more conserved in Proteobacteria and are not widely conserved in other, partly supporting the possible change in replication termination mechanism by a relatively recent takeover by the Tus-Ter system . On the other hand, to the best of our knowledge, Tus analogues have not been comprehensively searched in other phyla, and therefore further analysis is required in order to fully support this hypothesis.
By taking the phylogenetic iterated HMM approach and validating predicted candidates through a combination of HMMER score thresholds, conservation of palindromic structure, and cross-validation, we achieved a comprehensive identification of unique dif candidates in hundreds of genomes. As the result, we obtained unique candidate dif sequences in 715 chromosomes in 641 strains that were validated through multiple means, resulting in the largest collection of predicted dif sequences assembled to date. All of the predicted dif sequences described in this study, as well as visualizations of dif locations on circular genome maps, are freely available in an online database at http://www.g-language.org/data/repter/. The locations of dif sequences can be useful for studies of the regions surrounding the replication terminus, for phylogenetic studies of the replication termination and chromosome dimer resolution mechanisms, and can serve as supporting evidence for GC skew analyses.
Furthermore, we compared the positions of predicted dif sequences with those of the GC skew shift-points to understand the relationship between dif sequence and replication terminus using GCSI. As the result, although these two positions were highly correlated in terms of genomic loci, the differences in the positions of the GC skew shift-point and the GCSI were not correlated. Therefore, despite the dif sequence is located near the replication termination site for efficient CDR, the replication termination site is suggested to be at a site other than the dif site.
All analyses in this study were conducted using programs written in Perl with the G-language Genome Analysis Environment, version 1.8.10 [55–57]. Hidden Markov Modeling and searching was conducted with HMMER, version 2.3.2 . The dif sequence is the binding site of the XerCD recombinase; therefore, we first selected 734 circular bacterial chromosomes among 658 species/strains according to their conservation of XerCD using the KEGG (Kyoto Encyclopedia of Genes and Genomes) Orthology database (KO; ). We obtained these sequences from the NCBI FTP Repository . The following experimentally confirmed (E. coli and B. subtilis) or computationally predicted (F. alni) dif sequences were used as seed sequences for subsequent searches and machine learning:
E. coli 5'-GGTGCGCATAATGTATATTATGTTAAAT-3' 
B. subtilis 5'-ACTTCCTAGAATATATATTATGTAAACT-3' 
F. alni 5'-CACGCCGATAATGCACATTATGTCAAGT-3' 
XerCD conservation does not immediately imply dif sequence conservation . Therefore, to determine the phylogenetic conservation patterns of XerCD, we first aligned all XerCD amino acid sequences in the 734 genomes analyzed in this work with those in organisms with the above-mentioned dif sequences using ClustalW . The average of the distances of XerC and XerD sequences that were calculated from this alignment were used to infer phylogenetic conservation patterns among phyla.
Based on the phylogenetic conservation patterns of XerCD, we iteratively created the hidden Markov models (HMM) for the accurate prediction of dif sequences, seeded with the previously described dif sequences (Figure 4A). Iterated HMM is shown to be able to build a more diverse and potentially more sensitive models than regular HMM, by incorporating distant homologous sequences while avoiding the contamination of non-homologous sequences into the model , and thus iterative HMM has been frequently utilized in bioinformatics and computational biology [63–66]. In this work, the first profile hidden Markov model was created from the dif sequences identified in genomes belonging in the same genus as the genome harboring the seed sequence. For example, in Proteobacteria, the seed sequences came from E. coli; therefore, the dif sequences were searched in 28 genomes belonging to the genus Escherichia by means of fuzzy matching with the seed sequences of E. coli K12 using Perl module String::Approx 3.26 . For fuzzy matching, the maximum numbers of insertions, deletions, and substitutions were previously determined to be 0-bp, 0-bp, and 8-bp, respectively . Likewise, initial profiles were created for Firmicutes based on 24 genomes in the genus Bacillus and for Actinobacteria based on two genomes in the genus Frankia. Based on these initial profile hidden Markov models, dif sequences were predicted in the genomes of the closest genus to the seed genus according to the amino acid sequences of XerCD proteins.
In the case of Proteobacteria, an initial profile was created using genomes belonging to the genus Escherichia, and this profile was used to predict dif sequences in the genus Shigella. Subsequently, a new profile was created using the previous profile and the newly predicted dif sequences, and this new profile was used to predict the second nearest genus (in the case of Proteobacteria, Salmonella). In this way, profile creation and dif sequence prediction were iterated in decreasing order of similarity of XerCD from the seed sequences; thus, iterated HMM was conducted for each phylum. Because no dif seed sequences were available for phyla other than the three described above, the three profile hidden Markov models obtained by iterated HMM in Proteobacteria, Firmicutes, and Actinobacteria were used as the initial profiles. At each iterated HMM, predicted candidates were validated according to the following criteria: 1) HMMER score ≥10 and E-value < 1.0e-04, 2) leave-one-out cross-validation using the new profiles, and 3) conservation of the palindromic structure. For cross-validation, each time a new profile was created in the iterated HMM, we tested the validity of the training set by leaving out one of the dif sequences from the accumulated set of dif sequences and checking that the prediction of the left-out sequence by training with all of the other dif sequences is always above the threshold for all dif sequences collected up to that iteration. For the palindromic structure, positions 7-12-bp and 17-22-bp of dif sequences, corresponding to the binding sites of XerC and XerD, were checked for complementarities. For example, the palindromic structure of E. coli dif sequences in bracket notation is "--(--- ((((((-()-)))))) ---)--", and the conservation threshold is set to more than four pairs of complementarities within the 7-12-bp and 17-22-bp positions of the predicted dif sequences.
Although iterated HMM is based on phyla, this taxonomic unit is sometimes too diverse to accurately follow phylogeny with recursive means. Therefore, prediction was separately conducted in classes instead of phyla for 60 strains, harboring 130 chromosomes for classes α-, β- and γ-Proteobacteria. Similarly, sometimes, a species is highly phylogenetically distant from the seed organism, making it the case that utilization of profile hidden Markov models from other phyla is more suitable than own phyla's profile. When iterated HMM fails in such cases, an alternative seed profile is created using the dif sequences from the top three genomes with the closest XerCD sequences, as determined by alignment using ClustalW (Figure 4B).
GC skew's shift-point, calculated as (C - G)/(C + G), was computed using the find_ori_ter function of the G-language GAE, based on the cumulative GC skew  at 1-bp resolution. Although GC skew is widely observed in bacterial species, a number of genomes do not exhibit notable compositional bias [48, 69]. To determine the presence of genomic nucleotide compositional bias, the GC skew Index (GCSI) was calculated for all genomes, and GCSI ≥ 0.05 was used as the threshold [48, 70]. GCSI quantifies the degree of GC skew using the compositional distance between the leading and lagging strands and the spectral amplitude of 1 Hz signal of GC skew graph using Fast Fourier Transform. In this study, the replication origin is defined based on the cumulative GC skew at 1-bp resolution using the G-language GAE .
Conservation quantity was calculated based on the nucleotide variance in each position of dif sequences in Figure 5. Firstly, we calculated the position-specific base composition of all dif sequences in a group (phylum or class). Subsequently, variance of the most frequent base in that position is calculated from the base composition. For example, when a group with 100 dif sequences has nth base composition of (A, T, G, C = 100, 0, 0, 0) or (A, T, G, C = 25, 25, 25, 25), the variance is 2500 or 0, respectively. Hence, if the position-specific base composition is biased toward any one base, its high variance indicates high degree of conservation. These values are normalized to percentages for comparison with other groups in Figure 5. In the case of multiple chromosomes, since these conservation quantities were calculated in each strain, the average value was used for normalization.
The authors would like to thank the anonymous reviewers for thoughtful comments and suggestions. This research was supported by a Grant-in-Aid for JSPS Fellows and funds from the Yamagata Prefectural Government and Tsuruoka City.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.