Our method starts with the collection of Computational Synteny Blocks (CSB) - similar to SB associated with coding regions, and CSB also covering non-coding regions. The CSBs are calculated using GECKO-CSB [15] (second step in Fig. 1). Applying linearity and collinearity functions (described in [15]) over the CSB provided by GECKO-CSB we identify LSGR (so far duplications, inversions and translocations). The next step — which is reported in this document- is the precise refinement of the borders of CSBs involved in every detected LSGR (third step in Fig. 1). This refinement is applied to the sequences involved in calculation (namely sequences X and Y) in two independent and separable processes. After that we combine the results to get the final refinement. Figure 1 describes the workflow step by step.
Once an LSGR is detected, we take the two CSBs involved. The repetitions in between them, if any, are also take into account. Then we define a region of interest (ROI) running from the tail of one CSB to the head of the other (step 4 in Fig. 1). This ROI includes an arbitrary offset to force the overlapping between CSBs and repetitions (see Figs. 2 and 12). A virtual CSB (C
S
B
V
) and virtual repetitions are created by extending the borders in order to cover the ROI. Afterwards, these C
S
B
V
and virtual repetitions are aligned using a fast customized implementation of the Needleman and Wunsch [17] global alignment method. The main idea of this process is to force overlapped regions to study the alignments within the ROI.
At this point an identity vector for every aligned C
S
B
V
and all repetitions is computed (step 5 in Fig. 1. See Additional file 1 for more details). Then, a “difference vector” (V
diff
) is calculated (step 7). If we are working with only two C
S
B
V
, the V
diff
contains the normalized absolute difference between the two identity vectors. If besides that we are working with repetitions, we compute the V
diff
taking into account a consensus identity vector from the repetitions (step 6).
The rationale behind the method is the following: The V
diff
vector contains high values when identity vectors are different. In those regions where values are similar in both identity vectors, the values contained in V
diff
will be low. At some point we will observe a transition between high and low values along the V
diff
vector. These transitions will define the BP. A finite-state machine (FSM) was designed to detect these transitions (step 8). At the end of the process, CSB borders are refined based on the BPs detected by the FSM. The method does not force the BP to be a region or a point. This will depend on the transition’s features.
Detection of CSBs repetitions and large-scale genomic rearrangements
CSBs and repetitions are detected using Gecko-CSB [15], an extension of Gecko [18]. This software has demonstrated its capacity to yield HSPs of high-quality beating reference software. In [15] we presented a set of formal definitions describing different levels of linearity and collinearity between CSBs. Using these definitions, a set of rules was defined to identify LSGR in single chromosome species, such a inversions, translocations, reverted translocations and duplications. Once a LSGR is detected, we perform our refining method over those CSBs involved in the LSGR.
After the detection of a LSGR two CSBs (namely A and B) are selected. Optionally, if collinearity between CSB A and CSB B is interrupted by a set of repeats, the repeats will be included in the selection as well. Repeats can be separated in two groups. Those repeats whose coordinates in the sequence X overlap with CSBs A and B are grouped in a collection named repeats-X. In the collection repeats-Y are the equivalents regarding sequence Y.
Refining CSBs
At this point the method splits in two branches. The refinement in the sequence X and Y are complementary and independent. In this document we will describe the refinement for the sequence X branch. The sequence Y branch is the same, but interchanging X by Y.
Calculating the region of interest
The CSBs and repeats define a ROI (see Eq. 5, Figs. 2 and 12). Since our method is focused on finding transitions between CSBs and repetitions, we introduce an offset parameter, which ensures overlapping between the end of CSBs and the beginning of virtual CSBs and the virtual repetitions, guaranteeing that transitions are present. In the worst case, the method will have offset number of nucleotides in both CSBs that share similarity and therefore, they can be aligned with a high value of identity. In other words, the offset parameter stabilises the beginning and the end of the signal (More details in “FSM thresholds selection” in the Additional file 1). The ROI is defined as follows:
$$ {}\begin{aligned} ROI_{xStart} &= min(A_{xEnd}, B_{xStart}, Repeats_{xStart}) - {offset} \\ ROI_{xEnd} &= max(A_{xEnd}, B_{xStart}, Repeats_{xEnd}) + {offset} \\ ROI_{yStart} &= min(A_{yEnd}, B_{yStart}, Repeats._{yStart}) - {offset} \\ ROI_{yEnd} &= max(A_{yEnd},B_{yEnd},Repeats_{yEnd}) + {offset} \end{aligned} $$
(1)
After calculating the ROI, new CSBs named virtual CSBs (C
S
B
V
) are created using the ROI X
Start
and X
End
coordinates. This means that all C
S
B
V
s will start and end at the same point. In this step we are extending or trimming the old CSBs concerning ROI start and end points. New C
S
B
V
s’ Y coordinates will be calculated depending on how much we have trimmed or extended the coordinates in X regarding the old CSB. The equations that describe this process are the following:
$$ \begin{aligned} {CSB}_{V xStart} &= ROI_{xStart} \\ {CSB}_{VxEnd} &= ROI_{xEnd} \\ \alpha_{L} &= CSB_{xStart} - {CSB}_{VxStart} \\ \alpha_{R} &= {CSB}_{VxEnd}- {CSB}_{VxEnd} \\ {CSB}_{VyStart} &= CSB_{yStart} - \alpha_{L} \\ {CSB}_{VyEnd} &= CSB_{yEnd} + \alpha_{R} \end{aligned} $$
(2)
Notice that α takes negative values when trimming and positive when extending. New C
S
B
V
s are aligned using a Needledman and Wunsch implementation.
Calculating identity vectors
After the alignment of C
S
B
V
s, identity vectors (I
V
) are created for every C
S
B
V
. All I
V
s have the same length and they represent the percentage of identity that a certain region of length W has in the alignment. We take a window of length W to calculate that percentage of identity.
First we create a binary vector (V
B
) which represents matches in the alignment. V
B
has the length of the alignment. Since V
B
takes into account gaps, its length can be different from one C
S
B
V
to another. By using a window of length W, we can compute the percentage of identity at any point in V
B
. As long as we are going to compare I
V
from different C
S
B
V
s, identity values from those points in the alignment that represent a gap in sequence X are not stored. This way, all identity vectors from different C
S
B
V
s will have the same length, R
O
I
length
.
Low values in parameter W produce a noisy identity vector corresponding with high frequency changes of identity. On the contrary, high values in parameter W smooth the noise and produce a low frequency signal. The selection of a proper W value is not possible as it might change depending on the C
S
B
V
involved. We could also be interested on changes that happen at different frequencies. Therefore, instead of choosing a fixed W value, which would mean changes at only one frequency, we build a vector containing all frequencies as follows:
$$ I_{V}(x) = \sum\limits_{i=0}^{N} A_{i}I_{i}(x) $$
(3)
where A
i
is the weight of the identity vector at a certain frequency
$$ \sum\limits_{i=0}^{N} A_{i} = 1 $$
(4)
And the Identity vector at a certain frequency is calculated as follows:
$$ I_{i}(x) = \frac{1}{2N+1}\sum\limits_{j=x-N}^{x+N} V_{B}(j) $$
(5)
In this model, N defines the maximum window to compute the percentage of identity and also defines the start and end positions where the values of the vector can be used. From 0 to 2N+1 and from 2N+1−R
O
I
length
to R
O
I
length
the I
V
is uncompleted. Therefore, N cannot be as long as we want. It should be at least lesser than OFFSET. In practice we have observed that a value of 50 is enough to get good results.
Finally, since identity vectors are going to be compared, they must to be normalized.
Calculating consensus identity vector
In the case that a group of repetitions are detected, we use the information of the consensus sequence to improve accuracy of the refinement method.
After repeats have been aligned and the V
B
s have been computed, a Sum Match Vector (V
SM
) is calculated by adding all V
B
s vectors. This vector has a length of R
O
I
length
, so only positions which are not representing a gap are taken into account -as we did in the previous section. Then, we calculate the percentage of repeats that cover one specific position in the V
SM
. To calculate the Consensus Identity Vector (V
CI
), only positions that comply with a given threshold are set to 1, and 0 otherwise. In this implementation the threshold was set to 25 %. This new vector is named Consensus Binary Vector. After this process, we calculate the V
CI
by processing the Consensus Binary Vector as we already described in the previous section.
Vector difference
In order to detect transitions which delimitate the BP, we compute the absolute difference between the C
S
B
V
s identity vector. C
S
B
V
s are extracted from CSBs according to the ROI, using the OFFSET to ensure that similar regions are represented in C
S
B
V
s. As a result, the identity vectors for the C
S
B
V
-A have a high value at the beginning and low value at the end. On the contrary, the identity vectors for the C
S
B
V
-B have a low value at the beginning and high value at the end. This is the reason why the vector difference will start and end with high values. If repetitions are detected, then the difference vector will have high values in the middle as well.
Anyways, transitions will be found in between these high values (see Fig. 2).
Detecting transition points
To detect transitions a Finite-State Machine (FSM) was designed. Figure 3 shows the design. Basically, the FSM detects the coordinates where the vector difference value was the last time at a certain threshold (U1) before reaching the second threshold (U2). As a result, the selected region defined by the coordinates is the transition between high and low values along the vector difference.
We associate these transitions as a candidate for a BP. After this process, the refined SB can be trimmed or extended. The threshold selection is discussed in the next section.