Advance Access Publication Date: Day Month Year Manuscript Category Core-genome Scaffold Comparison Reveals the Prevalence That Inversion Events Are Associated with Pairs of Inverted Repeats

Motivation: Genome rearrangement plays an important role in evolutionary biology and has profound impacts on phenotype in organisms ranging from microbes to humans. The mechanisms for genome rearrangement events remain unclear. Lots of comparisons have been conducted among different species. To reveal the mechanisms for rearrangement events, comparison of different individuals/strains within the same species (pan-genomes) is more helpful since they are much closer to each other. Results: We study the mechanism for inversion events via core-genome scaffold comparison of different strains within the same species. We focus on two kinds of bacteria, Pseudomonas aeruginosa and Escherichia coli, and investigate the inversion events among different strains of the same species. We find an interesting phenomenon that long (larger than 10,000 bp) inversion regions are flanked by a pair of Inverted Repeats (IRs) (with lengths ranging from 385 bp to 27476 bp) which are often Insertion Sequences (ISs). This mechanism can also explain why the breakpoint reuses for inversion events happen. We study the prevalence of the phenomenon and find that it is a major mechanism for inversions. The other observation is that for different rearrangement events such as transposition and inverted block interchange, the two ends of the swapped regions are also associated with repeats so that after the rearrangement operations the two ends of the swapped regions remain unchanged. To our knowledge, this is the first time such a phenomenon is reported for transposition event. Availability and Implementation: Source codes and examples for our methods are available at


Introduction
Comparative genomics studies show that genome rearrangement events often occur between two genomes.Genome rearrangement events play important role in speciation.The rearrangement operations include deletions, insertions, inversion, transposition, block interchange, translocation, fission and fusion, etc.The mechanisms for those rearrangement events are still unclear.Here we study the mechanism for inversion events via core-genome scaffold comparison of different strains within the same species.
By comparing two genomes, we can find candidate rearrangement operations.However, the set of rearrangement operations to transform one genome into the other is not unique in many cases.Computing the rearrangement operations between two genomes under different assumptions is an active area, where intensive research have been conducted (Li et al., 2006).It is reported that breakpoints appear more often in repeated regions (Lemaitre et al., 2009;Longo et al., 2009).A summary of the where and wherefore of evolutionary breakpoints is given by Sankoff (2009).The prevalence of short inversions has been studied (Lefebvre et al., 2003).Pevzner and Tesler found extensive breakpoint reuse for inversion events in mammalian evolution when comparing human and mouse genomic sequences (Pevzner and Tesler, 2003a,c).
An interesting problem is to reveal the mechanisms under the rearrangement operations.Many hypothetical mechanisms for the rearrangement operations have been reported (Gray, 2000).For example, Chen (2011) discussed mutational mechanisms for genomic rearrangements.To reveal the mechanisms under the rearrangement operations, comparison of different individuals/strains within the same species (pan-genomes) can be more helpful since strains within the same species are conserved.
A pan-genome, or supra-genome, describes the full complement of genes in a clade (typically for species in bacteria and archaea), which can have large variation in gene content among closely related strains.Pan-genomes were first studied by Tettelin more than a decade ago (Tettelin et al., 2005).Several tools have been developed for pan-genome analysis.For example, GET_HOMOLOGUES (Contreras-Moreira and Vinuesa, 2013) is a customizable and detailed pan-genome analysis platform.BLAST atlas (Jacobsen et al., 2011) visualizes which genes from the reference genome are present in other genomes.Mugsy-Annotator (Angiuoli et al., 2011) identifies syntenic orthologs and evaluates annotation quality using multiple whole genome alignments.Characterization of the core and accessory genomes of Pseudomonas aeruginosa has been done by Ozer et al. (2014).For pangenome analysis, genomes from different strains of the same species are decomposed to core blocks (in all the strains), dispensable blocks (in two or more strains) and strain-specific blocks (in one strain only).Here we extend the pan-genome analysis by comparing the core-genome scaffolds of different strains of the same species.
We study two types of bacteria, Pseudomonas aeruginosa and Escherichia coli, and investigate the inversion events among different strains of the same species.We find an interesting phenomenon that long (larger than 10,000 bp) inversion regions are flanked by pairs of Inverted Repeats (IRs) which are often Insertion Sequences (ISs).This mechanism also explains why the breakpoint reuses for inversion events happen.We study the prevalence of the phenomenon and find that it is a major mechanism for inversions.The other observation is that for different rearrangement events such as transposition and inverted block interchange, the two ends of the swapped regions are also associated with repeats so that after the rearrangement operations the two ends of the swapped regions remain unchanged.To our knowledge, this is the first time such a phenomenon is reported for transposition event.

Methods
We develop a pipeline to generate the core-genome blocks, dispensable blocks and strain-specific blocks based on the multiple sequence alignment produced by Mugsy (Angiuoli and Salzberg (2011)).
We then develop a computer program to generate the scaffolds of the strains from the core-genome blocks by repeatedly merging two consecutive blocks appearing in all the strains of the same species.In this way, the number of distinct blocks in the core-genome scaffold is reduced dramatically.For example, for Pseudomonas aeruginosa, before merging, there are 185 blocks in the core genome of the 25 strains.After merging, the scaffolds contain 69 blocks.
After that, we compute the inversion distance between two scaffolds.Computing the inversion distance between two scaffolds is a very hard and complicated combinatorial problem.Several algorithms have been developed.Due to the difficulty of algorithm design, most of the algorithms only consider inversion events.However, a transposition/blockinterchange event can be represented as 3 inversion events, and an inverted transposition/block-interchange event can be represented as 2 inversion events.Therefore, some of the computed inversion events may not be real.There are algorithms dealing with inversion and other rearrangement events such as block interchanges simultaneously.However, the weights for different events are different (again due to the difficulty of algorithm design).Thus, those algorithms still suffer from the problem of outputting inversions that are not real.
Our strategy here is to eliminate some obvious transposition, inverted transposition, block interchange, and inverted block interchange events.
For simplicity, we always assume that G 1 = +1 + 2... + n is the first input scaffold and G 2 = π 1 π 2 . . .πn is a sign permutation of the n blocks over the set N = {1, 2, ..., n} of n distinct blocks, where each integer i ∈ N appear once in G 2 in the form of either +i or −i.All the rearrangement operations are on G 2 .
A transposition swaps the order of two consecutive blocks/regions without changing their signs.A transposition (i, j, k) on regions π i , . . ., π j−1 and π j . . .π k−1 transforms the sign permutation Though an independent transposition swaps two consecutive blocks π i+1 and π i instead of two regions π i , . . ., π j−1 and π j . . .π k−1 as in the definition of a general transposition, a pre-process allows us to merge two consecutive blocks if they are consecutive in both input genomes.Thus, we can still handle some cases for swapping two consecutive regions.For example, the genome +1 + 2 + 6 + 7 + 3 + 4 + 5 + 8 becomes +1 + 2+4 + 3 + 5 after merging +6 + 7 (represented as +4)and +3 + 4 + 5 (represented as +3) and re-number +8 as +5 in the new representation.An independent transposition can change +1 + 2+4 + 3 + 5 into +1 + 2 + 3 + 4 + 5.In terms of breakpoint graph, the two blocks π i+1 π i in an independent transposition is involved in a 6-edge cycle and after the transformation the 6-edge cycle becomes three 2-edge cycles.In other words, the three breakpoints involved in the 6-edge cycle disappear after the transformation.See Figure 1.An inverted transposition swaps the order of two consecutive blocks/regions with one of the block's sign changed.An inverted transposition (i, j, k) on regions π i , . . ., π j−1 and π j . . .π k−1 transforms the sign permutation π 1 . . .
A block interchange swaps the locations of two separated blocks without changing their signs.A block interchange (i, j, k, l) on regions π i . . .π j and π k . . .π l transforms π p+1)−p for {q, q + 1, q + 2} ⊆ N and {p, p + 1, p + 2} ⊆ N .Similarly, the two blocks π k and π i are involved in two (interleaving) 4-edge cycles in the breakpoint graph and after the transformation, they become four 2-edge cycles.In other words, there are four breakpoints at the two ends of the two blocks, after the transformation, the four breakpoints disappear.See Figure 2. 1.A inverted block interchange swaps the location of two separated blocks with both signs of the two blocks changed.A inverted block interchange (i, j, k, l) on regions π i . . .π j and π k . . .π l transforms π p+1)−p for {q, q + 1, q + 2} ⊆ N and {p, p + 1, p + 2} ⊆ N .Again, there are four breakpoints at the two ends of the two blocks −π i and −π k , after the transformation, the four breakpoints disappear.
After eliminating independent transposition, inverted transposition, block interchange and inverted block interchange events, we use GRIMM-Synteny (Tesler, 2002a,b) to compute the inversion distance between pairwise core-genome scaffolds.We only seriously consider the cases where the rearrangement distance is small.When the rearrangement distance is large, there may be multiple solutions for the inversion history.Thus, in this case, the computed inversion events may not be real.
Finally, we developed a pipeline to compare sequences at the two ends of each inversion region to see whether a pair of inverted repeats exists.
Once the inverted repeats are found, the pipeline can also search all the strains and mark down its positions in different strains.
We computed the pairwise inversion distance between scaffolds after eliminating other kinds of independent rearrangement events such as transpositions, inverted-transpositions, block-interchanges, and invertedblock-interchanges.For each of the 8 scaffolds, we chose a scaffold with the minimum inversion distance (after eliminating other independent rearrangement events) to compare.The purpose was to compare two scaffolds with a small number of inversions so that we can observed real inversions between them.From Table 1, it can be seen that Group 1 is the closest group to all the other groups except for Group 6.The closest group to Group 6 is Group 5, where the inversion distance is 7.In total, there are 13 inversion events among the 7 distinct pairs of scaffolds (Table 1, where pair 1 and 2 appears twice).Among the 13 inversion regions, 7 of them are flanked by a pair of IRs.The remaining 6 inversions with no IRs found at the two ends of the inversion regions are very short and their lengths are from 2100 bp to 7400 bp.For each of the first three (Table 1, rows 1-4) inversions, the lengths of the inversion regions are more than 4 mbp, and we find a pair of IRs (+A/-A) at the two ends of each of the three long inversion regions.For the pair of Groups 5 and 1, there are three inversions and the lengths of the three inversions in the core-genome are 5.879 mbp, 0.597 mbp, and 6.8 kbp, respectively.Interestingly, we find a repeat B that appears four times in Both Scaffold 1 and Scaffold 5, where B appear as −B once and as +B three times in Scaffold 1.The four occurrences of B form a pair of IRs at the two ends of each of the 3 inversion regions (See Figure 3).For Groups 6 and 5, there exist two independent transpositions and one inverted transposition (See supplement-1).After eliminating the three independent rearrangement events, there are 7 inversions between Groups 6 and 5 which are calculated by GRIMM-Synteny (See Supplement-1) and only one inversion (28,28) is flanked by a pair of IRs (See Table 1).Note that both −56 and −59 appear twice in Scaffold 6.We remove the green blocks in Figure 3 in our comparison.Among these seven inversions, only one inversion (28,28) is longer than 10000bp and flanked by a pair of IRs (+O/-O).Group 1 can be obtained from Group 7 with one independent transposition.A repeat +R appears three times at the ends of the two blocks involved in the transposition.See Figure 3).Those occurrences of +R play an important role in the transposition and the details will be discussed in Section 2.1.2.
For Group 8 and 1, there exist two independent transpositions and two independent inverted transpositions (See Supplement-1).After eliminating the four independent rearrangement events, the scaffolds for Group 8 and 1 are actually the same and the inversion distance between them is zero.Again, both Blocks 2 and Block 4 appear twice in Group 8. (The physical positions of all the copies of Blocks 2 and 4 in Group 8 are in Supplemental Table S5h).We remove the green blocks in Figure 3 in our comparison.
For the first inversion between Group 1 and 2, there are 13 strains in Group 1 and 6 strains in Group 2. All the strains in Group 1 and Group 2 contain Repeat +A and −A as shown in Figure 3.The physical positions as well as the lengths of the repeats differ slightly in different strains.See Supplemental Table S5a.Thus, the inversion (from Blocks 10 to 52) between Scaffold 1 and Scaffold 2 (row 1 in Table 1) is found between the 13 × 6 pairs of strains in these two groups.For the remaining inversions listed in Table 1, the physical positions, the lengths of repeats and coregenome blocks (at the two ends of an inversion) in different strains are given in Supplemental Tables S5b-e.
In summary, three different pairs of IRs are found and we use +A/-A, +B/-B and +O/-O to differentiate these three pairs.We also find three copies of +R in comparison of Groups 1 and 7.The locations of these repeats in the scaffolds are shown in Figure 3.The lengths (in bp), gene products and protein IDs (in NCBI Protein database) of these repeats are listed in Supplemental Table S9.

Breakpoint reuse
The three inversion steps from Scaffold 1 to 5 are shown in Figure 4, where it can be seen that there is a +B and three -Bs in Scaffold 5.The three inversion events are -B-6+B to -B6+B, +B7∼64-B to +B-64∼-7-B and +B-64-B to +B64-B and the breakpoint the black arrow points at in Figure 4 is used three times.
Fig. 4. Three inversion steps from scaffold 1 to scaffold 5.The breakpoint between -6 and 64 in Scaffold 5 is used three times.See the black arrow.
Here +B plays a crucial role in the three inversions and is used three times, each time +B and -B form a pair of inverted repeats at the two ends of the inversion regions.Now let us have a close look at +B (of length 820 bp), we can see that for the first inversion (-B-6+B to -B6+B), the real cutting points (breakpoints) are at the left end of -B and the right end of +B, while for the other two inversions (+B7∼64-B to +B-64∼-7-B and +B-64-B to +B+64-B), the real cutting points (breakpoints) are at the left end of +B and the right end of -B.Here the real cutting point does not seem to be important and the repetitive element B should be viewed as the breakpoint.
Another interesting finding is that for Groups 1, 2, 3 and 4, each scaffold contains a -A and three +As.(See Figure 3.) Theoretically, this -A can be reused three times with each of the three +As.However, we did not observe such three breakpoint reuses in a single pairwise scaffold comparison.But it has been observed that this -A, along with each of the three +As, mediate three different inversion events which occur between Group 1 and Group 2, Group 1 and Group 3, and Group 1 and Group 4, respectively (Table 1, row 2-4).

Transposition
Figure 5 gives the detailed scaffolds for Groups 1 and 7.Both Scaffolds 1 and 7 contain four merged core blocks (1∼17), ( 18∼46), (47∼60), and (61∼69).Moreover, both Scaffolds 1 and 7 contain another two non-core blocks DS1 and DS2, where the occurrences of DS1 and DS2 in both scaffolds are 100% identical.Besides, there are three occurrences of a repeat +R in both scaffolds.It can be seen that by swapping 47∼60 and DS1 with 18∼46 and DS2, Scaffold 7 is transferred into Scaffold 1.The most interesting finding is the three occurrences of +R located at the three breakpoints of the transposition.We believe that this three occurrences of +R play an important role in this transposition event because the repeat +R can make sure the two ends of the two swapped regions remain unchanged before and after the transposition.This is similar to the mechanism that inversion regions are franked by a pair of IRs, where after the inversion the two ends of the inversion region remain the same.For reference, the physical positions of the three +Rs, DS1, DS2 and Blocks 47, 60, 18 and 46 in the chromosomes of Group 7 and 1 are listed in Supplemental Table S5f.
After computing pairwise inversion distance among the 9 scaffolds, we selected a scaffold with minimum inversion distance for each of the 9 scaffolds as shown in Table 2 for comparison.From Table 2, it can be seen that Group 1 is the closest group to all the other 8 groups with inversion distances ranging from 0 to 4. The closest group to Group 1 is Group 2, where the sign of Block 24 is different.In total, there are 17 inversion Fig. 6.Nine groups of scaffolds for the 31 Escherichia coli strains events among the 8 distinct pairs in Table 2 (the pair of Group 1 and Group 2 appears twice) and the inversion region lengths varies from 0.0075 mbp to 1.402 mbp.(See Table 2.) Among the 17 inversion regions, 12 of them are found to be flanked by a pair of inverted repeats in the strains of the source groups.For inversion (-5,5) between Group 1 and Group 6 (row 6 in Table 2) and the four inversions between Group 1 and 8, no pairs of inverted repeats are found at the two ends of the block.The length of inversion (-5,5) (Row 6 in Table 2) is short (7.5 kbp).The four computed inversions between Group 1 and 8 may not be true since there are another 6 other rearrangement events between the two scaffolds (Row 8 in Table 2).
For Groups 6 and 1, the rearrangement distance is five (one independent inverted block interchange and a sequence of four inversions).See Table 2.
At the breakpoints of this inverted block interchange, we also find IRs and we will discuss it in Section 2.2.2.For Group 8 and 1, after eliminating six independent transpositions, there exists a sequence of four inversions (See Supplement-1).Only one of these four inversions is flanked by a pair of IRs.We observe that there are seven copies of Block 45 in Group 8 and we used the -45 next to -46 for comparison.The distance between Group 1 and Group 8 is big (6 transpositions + 4 inversions) and thus our predicted rearrangement history between Group 1 and Group 8 may not be correct.(Again, for reference, the physical positions of these seven copies of Block 45 in the chromosome of Group 8 are in Supplemental Table S7i.)To obtain Group 1 from Group 9, an independent inverted transposition and an inversion (Block -48 in Scaffold 9) are required.(See Table 2).The inverted region (Block -48) is flanked by a pair of IRs (+F/-F) in the Group 9. (See Figure 6.)In addition, we find that this inverted transposition event is also associated with repeats and we will discuss this in Section 2.2.3.For all the inversions listed in Table 2, the physical positions, the lengths of repeats and core-genome blocks (at the two ends of inversions) in different strains are given in Supplemental Table S7a-g.
We find a total of 12 different types of pairs of inverted repeats and use letters from +D/-D to +M/-M, +S/-S and +Q/-Q to label and differentiate these 12 pairs of IRs.The locations of these IRs in the scaffolds are shown in Figure 6.The lengths (in bp), gene products and protein IDs (in NCBI Protein database) of these 12 IRs are listed in Supplemental Table S8.We note that 7 of these 12 pairs of IRs contain genes which encode transposase.

Breakpoint reuse
The three inversion steps from Scaffolds 1 to 7 are illustrated in Figure 7. From Figure 7, it can be seen that The breakpoint between 41 and 42 in Scaffold 1 is used twice.The corresponding inversion regions are flanked by -L and +L.
It is worth pointing out that the two +Ms in Scaffold 1 form a pair of directed repeats (DRs).After inversion (35,-41), the pair of directed repeats (DRs) of M becomes a pair of inverted repeats.This means that a pair of DRs has the potential to mediate inversions.

Inverted Block Interchange
We find an inverted block interchange between Scaffold 6 and 1 and we use Figure 8 to illustrate.In Figure 8, Region +E-27∼-20+S and -S13∼-11-E in Scaffold 6 are inversely interchanged with each other to obtained Scaffold 1.The existence of two pairs of IRs (+E/-E and +S/-S) makes sure the two ends of the swapped blocks remain unchanged after the inverted block interchange event.The physical positions of +E/-E, +S/-S and Blocks 27, 20, 13 and 11 in Groups 6 and 1 are listed in Supplemental Table S7h.The other explanation is that an inverted block interchange can be replaced by two inversions.Figure 9 shows the two inversions which can replace the inverted block interchange of Blocks -27∼-20 and Block -13∼-11.Each of these two inversions is flanked by a pair of IRs (See Figure 9).

Fig. 2 .
Fig. 2. The breakpoint graph for an independent block interchange.

Fig. 3 .
Fig. 3. Eight groups of scaffolds for the 25 Pseudomonas aeruginosa strains.Each orange block stands for a merged block which may represent several consecutive core-genome blocks.The numbers above each orange block indicate the included core-genome blocks, for example, 1∼5 means the orange block includes five core-genome blocks, which are Blocks 1, 2, 3, 4 and 5. Repeats A, B, O are represented by blue, red and purple triangles respectively.The arrow directions indicate positive/negative strand.

Fig. 7 .
Fig. 7. Three inversions between Scaffolds 1 and 7.The breakpoint between 41 and 42 in Scaffold 1 is used twice.See the black arrow.

Fig. 9 .
Fig. 9. Two inversions which can replace the inverted block interchange of Regions -27∼-20 and -13∼-11 between Scaffold 6 and 1.The first inversion is flanked by +E and -E and the second inversion is flanked by +S and -S.The steps from the Scaffold 6 to its next scaffold are omitted.

Table 1 .
Shortest inversion distance for each of the 8 groups of Pseudomonas aeruginosa.
a Column sG is the source scaffold group, Column cG is the closest scaffold group.b Inv d indicates the inversion distance between sG and cG after eliminating other independent rearrangement events.R d indicates the distance of other independent rearrangement events.c The two numbers indicate the starting and ending block of the inversion in the source scaffold (sG).Rearrangement scenario is calculated from the source group to the closest group d l is the length (in Mbp) of inversion of the core-genome segments.e Column IR lists which pair of inverted repeats (A, B or O) franks the inversion.The numeric code: 0 indicates the respective IR was found only in the source group, 1 indicates the IR was found only in the closest group, 2 indicates the IR was found in both groups.

Table 2 .
Shortest inversion distannce for each of the 9 groups of Escherichia coli.
c l=length of Block 38 + length from Block 29 to Block 37 in Group 6. d l=length of Block 42 + length of Block 35 in Group 7. e l=length from Block 41 to Block 36 + length of Block 35 in Group 7.