Preparation of genomic data
We used the Microbial Genome Database for Comparative Analysis (MBGD) [14, 72] in order to construct the orthologous groups (OGs) of the specified sets of related genomes. In this work, we used two sets of genomes belonging to the families Bacillaceae and Enterobacteriaceae (Table 1), as per the NCBI Taxonomy database for taxonomic classification.
Construction of a core genome alignment by ordering the OGs
Our procedure, CoreAligner, for constructing an alignment of the core structure basically consists of finding the best conserved ordering of a pre-identified set of OGs. To construct the OGs, we used the DomClust algorithm implemented on the MBGD server, which can handle both inparalogous relationships and domain fusion/fission events in the process of ortholog grouping [24], although, in principle, other programs or databases can also be used for this purpose. Figure 10A gives an overview of the CoreAligner procedure, which consists of the following steps: 1) extraction of the conserved neighborhood relationships and construction of an initial conserved neighborhood graph, G0; 2) assignment of a consistent orientation based on the minimum spanning tree of G0 and construction of a directed graph, G1; 3) conversion of G1 into a triplet graph, G2, in which each edge represents the arrangement of three genes rather than two genes; 4) elimination of the loops from G2 and construction of a directed acyclic graph, G3; 5) identification of the paths of maximum weight on G3 using the dynamic programming algorithm; 6) construction of an alignment by restoring the original graph. In the following, we describe the procedure step by step.
Step 0: Extraction of the conserved OGs
Before the procedure, we retained only those OGs that are conserved among at least a given ratio (CONS_RATIO) of the total number of genomes. Throughout this procedure, the genomes that are close enough to each other are grouped so that the genomes in each group are counted only once in order to avoid bias; here, we used the species groups shown in Table 1, which were determined through visual inspection of the pairwise comparison data (see Additional file 1). Let F denote the resulting set of conserved OGs, and let G be the set of genomes compared. We consider each genome, G ∈ G, as an ordered gene list, L
G
= [g1, g2,..., g
N
], where each gene, g
i
, in L
G
is ordered according to the genomic position on G and belongs to one of the conserved OGs, F
j
∈ F. Let order
G
(g) denote the ordinal position of the gene, g, in the list L
G
. We define the distance between genes g1 and g2 on the circular genome, G, as Dist(g1, g2) = min [Diff(g1, g2),| L
G
|-Diff(g1, g2)], where Diff(g1, g2) = |order
G
(g1)-order
G
(g2)|.
Step 1: Generation of the initial conserved neighborhood graph
In this step, we extract the OG pairs that are located within MAX_GAP genes in at least NBR_CONS_RATIO (= CONS_RATIO, by default) of the total number of genomes and construct a conserved neighborhood graph. First, let us define a neighboring OG pair. A gene pair, g0, g1 ∈ G, is a neighboring gene pair if Dist(g0, g1) ≤ MAX _GAP, and an OG pair, F0, F1 ∈ F, is a neighboring OG pair if there exists a neighboring gene pair g0, g1 such that g0 ∈ F0, g1 ∈ F1. Let Ng(F0, F1) denote the set of neighboring gene pairs between F0 and F1. We also consider a relative direction for each neighborhood OG pair. The relative direction between two neighboring genes, denoted by Rdir(g0, g1), is either (L, L), (L, R), (R, L) or (R, R), where L and R signify the left and right end of each gene, respectively, and the relative direction between two neighboring OGs, denoted by Rdir(F0, F1), is determined by majority vote among (g0, g1) ∈ Ng(F0, F1) (Figure 10B). In this majority vote, each neighboring gene pair is weighted as a function of the reciprocal of distance, weight(g0, g1) = [1/Dist(g0, g1)]α, with the parameter α. Throughout this work, we simply set α = 1.
Let c be the number of neighboring gene pairs in the majority direction, i.e., |{(g0, g1)|Rdir(g0, g1) = Rdir(F0, F1)}|. The pair (F0, F1) is considered to be a conserved OG pair if c satisfies the conservation criterion (c/N ≥ NBR _CONS _RATIO)). Among the conserved OG pairs, we further extract the proximal OG pairs, which are defined as follows: a conserved OG pair F0, F1 is a proximal conserved OG pair if there exists a pair of neighboring genes, g0 ∈ F0, g1 ∈ F1, between which there is no gene g2 ∈ F2, such that either (F0, F2) or (F2, F1) is a conserved OG pair. Finally, the conserved neighborhood graph, G0 = (V0, E0), is constructed with vertices V0 = F and edges E0 ⊆ V0 × V0 that are the set of proximal conserved OG pairs. On each edge, an edge weight is assigned as the sum of the weights of the neighboring gene pairs: .
Step 2: Conversion to a directed graph (direction assignment)
In this step, we construct a minimum spanning tree (MST), T
M
, of the conserved neighboring graph, G0, and assign a consistent orientation to each vertex and edge based on the relative direction on each edge of the tree, T
M
; thus the procedure converts the undirected graph, G0, into a directed graph, G1. To construct the MST, a weighted distance is assigned to each edge (F0, F1) as Wdist(F0, F1) = 1/Weight(F0, F1), so that highly weighted edges are likely to be incorporated into the MST. For each F ∈ F, a vertex orientation, Dir(F) = {1, -1}, is determined based on the MST T
M
as follows: starting from an arbitrary vertex, F0 ∈ F, for which we set Dir(F0) = 1, the orientation is determined along each edge, (F
i
, F
j
) ∈ T
M
, by Dir(F
j
) = Dir(F
i
) × sgn [Rdir(F
i
, F
j
)], where sgn(rdir) = 1 if rdir ∈ {(R, L), (L, R)} and sgn(rdir) = -1 otherwise (see Figure 10B; in Figure 10A2, the node colors and edge colors represent Dir(F
i
) and sgn [Rdir(F
i
, F
j
)], respectively, where red = 1 and blue = -1). In addition, the orientation of each edge, (F
i
, F
j
) ∈ T
M
, is determined by Dir(F
i
, F
j
) = Dir(F
i
) × side(F
i
, F
j
), where side(F
i
, F
j
) = 1 if F
j
is on the right side of F
i
and side(F
i
, F
j
) = -1 otherwise (Dir(F
i
, F
j
) is represented by an arrow in Figure 10A2). Here, the edge orientation represents the consensus gene order, and the node orientation represents the consensus gene orientation along the virtual genome. The orientation can be consistently assigned as long as it is done along a tree. In addition, for each remaining edge, (F
i
, F
j
) ∈ {e|e ∈ E0, e ∉ T
M
} (broken lines in Figure 10A2), the edge orientation is calculated by the same formula, Dir(F
i
, F
j
) = Dir(F
i
) × side (F
i
, F
j
), but the orientation is assigned only when the values calculated from both directions is consistent, i.e., Dir(F
i
, F
j
) = -Dir(F
j
, F
i
); otherwise, the edge is removed from the converted graph, G1 (X mark in Figure 10A2). Finally, a directed graph, G1 = (V1, E1), with vertices V1 = F and edges E1 ⊆ V1 × V1 is constructed by connecting the vertices from F
i
to F
j
if Dir(F
i
, F
j
) = 1 and from F
j
to F
i
if Dir(F
i
, F
j
) = -1.
Step 3: Conversion to a triplet graph
In this step, the directed graph, G1, is converted into a triplet graph, G2, in which each edge represents the order of three OGs (F
i
, F
j
, F
k
), rather than two OGs (F
i
, F
j
), as in each edge of the graph G1. For this purpose, a neighborhood OG list, Nf(g, side) = {F'|Dist(g, g') ≤ MAX _GAP, side(g, g') = side, g' ∈ F'}, is prepared for each g ∈ F and side ∈ {-1,1} (-1 and 1 signify the left and right sides, respectively). To convert G1 = (V1, E1) into G2 = (V2, E2), we set V2 = E1, and connect each pair of edges in G1 incident to a node, F
j
∈ V1, i.e., (F
i
, F
j
), (F
j
, F
k
) ∈ E1 = V2, in G2 in this direction if there exist some genes, g ∈ F
j
, such that F
i
∈ Nf(g,-1) ∧ F
k
∈ Nf(g,1). This conversion guarantees that the triplet order (F
i
, F
j
, F
k
) represented by each edge of the graph G2 will actually appear in at least some genomes (the parameter NBR_CONS_NUM2; default 2). Figure 10C shows a simple example illustrating the effect of this conversion. Here, the OG represented by node 2 contains two inparalogous subgroups, due to which the graph contains a loop (Figure 10C1). However, the triplet graph conversion linearizes the graph (Figure 10C2), since the path of 5-2-3 in the original graph, which is responsible for the loop formation, does not exist in the actual sequences.
Step 4: Conversion to a DAG (loop elimination)
Although the triplet graph conversion at the previous step can eliminate some trivial loops as described above, generally there still remain loops in the directed graph, G2. In this step, we eliminate all loops by removing some vertices from the graph G2 (corresponding to the edges of the graph G1) to make a directed acyclic graph (DAG), G3. Since the problem of finding the minimum number of vertices to cut in a directed graph required for making a DAG (the minimum feedback vertex set problem) is NP-hard, we used a heuristic method called a "contraction algorithm" [73], which returns a correct answer when the graph is completely contractible through a series of contraction operations defined in the algorithm; otherwise, the algorithm still returns a valid answer within a reasonable time period, although its optimality is not guaranteed. Fortunately, it turned out that all cases treated here were contractible and hence the algorithm could solve the problem correctly. We conjecture that the graph G2 is contractible in most cases if the compared genomes are sufficiently close.
Step 5: Extraction of the set of paths of maximum weight
In this step, we find the longest paths in the DAG, G3 = (V3, E2), using a DP algorithm. For this purpose, we assign a weight to each vertex, v ∈ V3 = E1, which actually corresponds to a neighboring relationship (F
i
, F
j
), as Weight(v) = Weight(F
i
, F
j
). The basic recursion formula for finding the path of maximum weight is as follows:
where v
n
, v
m
∈ V3 and Out(v
m
) is the set of vertices in which the out-edges of v
m
enter. The pair of vertices in the above formula is saved as an arc (v
m
, v
n
) of a directed graph (a set of trees, or forest), T
P
, which can be backtracked to find the path of maximum weight. The extracted maximum path, which we here call a "cluster," is added to the core structure if it meets the following criteria: (1) it contains at least a given number of OGs (the parameter MIN_CLUSTER; default 10), and (2) at least a given proportion of OGs are present in each genome (the parameter SP_COVER; default 0.5). The latter criterion is required to eliminate lineage-specific clusters, which are likely to belong to phages or genomic islands rather than the genomic core.
In some cases, a "local version" of this latter criterion is needed to eliminate problematical cases. In fact, during experimental applications of the CoreAligner procedure to the Enterobacteriaceae dataset, we encountered such a case where an apparent genomic island was included in the core structure (Additional file 8; see also Discussion). To eliminate such regions, we first identified locally sparse regions on each genome by finding a maximal scoring segment with a scoring system of absence = +1 and presence = -1, and if this score exceeded a certain cutoff point (here we used 20) in at least one genome, the resulting region was subjected to the test with the above SP_COVER criterion to determine whether the region should be removed or not.
To find out the path of the next maximum weight, we used a similar strategy to that for finding non-intersecting suboptimal sequence alignments [74]: each vertex of the maximum path is marked as deleted and assigned a bad score (-∞), and all of its descents are once removed from T
P
. Step 5 is repeated and the scores of the removed nodes are recalculated by DP and T
P
is reconstructed to find the next maximum path. The procedure is repeated until no (undeleted) vertex remains.
Step 6: Construction of an alignment from the paths of maximum weight
The set of paths extracted in the previous step is actually a triplet graph, which should be converted back to the original graph. This conversion can be done straightforwardly: for example, a path of 1:2-2:3 can be simply converted into 1-2-3. However, an interesting situation arises when an OG contains some inparalogous lineages, as in the example in Figure 10C, which may cause multiple appearances of the same node in the alignment path. Even in such a case, a genuine ortholog can typically be found among the inparalogs by examining the context of the alignment path (nodes 2.1 and 2.2 in Figure 10C3). To solve this problem, CoreAligner scores each gene according to the matching of the local context around it, and takes the best-scoring gene as a genuine ortholog. Here, the weight of the gene, g0 ∈ F0, is defined as , where Dist
G
and Dist
C
are the distances along the genome, G, that contains g0 and g1 and the consensus alignment, C, respectively, and the summation is taken over all g1 ∈ F1 such that Dist
G
(g0, g1) ≤ MAX _GAP and Dist
C
(F0, F1) ≤ MAX _GAP.
Identification of orthologs between Bacillaceae and Enterobacteriaceae
To examine the correspondence between the core gene sets of Bacillaceae and Enterobacteriaceae, we generated the OGs of the combined dataset of these two families by the DomClust program. Each OG of an individual family, say F
B
of Bacillaceae, is considered to correspond to an OG of the combined dataset, say F
C
, if more than half of the members of F
B
are included in F
C
. Then, we considered that two OGs of Bacillaceae and Enterobacteriaceae, say F
B
and F
E
, respectively, correspond to each other if both F
B
and F
E
correspond to the same OG of the combined dataset, say F
C
.
Essential gene analysis
The sets of B. subtilis essential genes (271 genes) [32] and E. coli essential genes (300 genes) [33] were taken from the original papers. Three genes (yabQ, yafF, yibJ) were eliminated from the E. coli set, since those ORFs are treated as pseudogenes in the current RefSeq database. For the E. coli essential gene set, we also referred to the essential gene set of Kato et al. (302 genes) [35] obtained from the PEC database [75]. By taking the intersection of these two data sets, we obtained a more rigorous essential gene set containing 261 genes.
G+C content of the third codon positions (GC3)
For the calculation of the GC3 values, we used only genes comprising more than 100 codons. We also eliminated from the dataset putative highly expressed (PHX) genes that have specifically biased codon usage patterns. Here, we used a simplified version of the previously proposed procedure to define the PHX genes [43]: a gene, g, is considered to be a PHX gene if the predicted expression level of g defined as E
RP
(g) = B(g|All)/B(g|RP) exceeds 1.05, where B(g|G) is the codon usage difference of g relative to the given gene set, G, according to Karlin and Mrázek [43], and All and RP denote the sets of all genes and ribosomal proteins in a given genome, respectively. This procedure eliminated around 150–300 genes for each genome in our dataset as PHX genes.
Phylogenetic analysis
To simplify the phylogenetic analysis, we used only OGs that have a one-to-one correspondence (i.e., those containing neither duplication nor domain fusion/fission events). For each OG, a multiple sequence alignment was generated by CLUSTALW [76], and from that alignment the conserved alignment blocks suitable for phylogenetic analysis were extracted by the Gblocks program [77]. In order to create the phylogenetic trees of the core structure, we constructed a concatenated sequence alignment of the universally conserved one-to-one core OGs, and created a neighbor joining tree [44] using the CLUSTALW program and a maximum likelihood tree using the MOLPHY program [45] under the JTT model [78]. In addition, we created a phylogenetic network using the Neighbor-Net method [47] implemented in the SplitsTree program [48].
We also conducted phylogenetic analyses for individual core genes; here, maximum likelihood tree construction and the Shimodaira-Hasegawa test were performed using the TREE-PUZZLE program [79] under the JTT model. In this test, the relationships among the closely related species shown in Table 1 were fixed, and all the remaining possible topologies (maximally 945 topologies among the 7 taxa) were generated and tested. Here, we considered the ML topology of the concatenated core sequences (Figure 8A), or an appropriate subtree of it when considering a non-universal OG, as a representative genomic tree, and compared it with the ML topology of the individual OG.
Statistical analyses, including classical multidimensional scaling and Fisher's exact test, were performed using the cmdscale and fisher.test commands in the R package. The Robinson-Foulds topological distances [49] were calculated using the treedist command in the PHYLIP package [80]. For simplicity, we used the value of the resulting symmetric difference (which is always an even number) divided by 2.