Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants

Background Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. Results Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. Conclusions Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept.

Data preparation. Our workflow takes as input genome data in the form of GenBank files. However, the genome data obtained from different databases (e.g. JGI, NCBI, Ensembl) are in different formats. Conversion of these formats was necessary to standardize the dataset. Further, scaffolds not associated to any chromosome but which were also present in the data files have been filtered out. In detail, we have: Sunflower: Removed scaffolds and added locus tag to all CDS entries.
Grape: Removed scaffolds and added locus tag to all CDS entries.
Artichoke: Removed scaffolds and added locus tag to all CDS entries.
Coffee: A GenBank file was built from Fasta and GFF files by using the auxiliary script fa+gff2gbk.py included in the workflow under the data/scripts folder. 12,996 unmapped scaffolds (totaling 204 Mb) were removed.
Lettuce: A GenBank file was built from Fasta and GFF files by using the auxiliary script fa+gff2gbk.py included in the workflow under the data/scripts folder. Chromosome names had to be shortened, since they were too long and were causing errors in BioPython's GenBank writer.

Workflow availability
The workflow implementation, named ANGORA, is publicly available at https://gitlab.ub.uni-bielefeld. de/gi/angora. All steps necessary to download and configure the workflow, its dependencies, and how it can be run are described in the provided README.md. The workflow package includes two small sample datasets, one having 3 unichromosomal genomes of simulated species, and one having 3 multichromosomal genomes of real species (Ostreococcus green algae).

LASTZ: local sequence alignment
The tool used for aligning DNA sequences is LASTZ, which can be obtained at https://github.com/lastz/ lastz. However, LASTZ has a bug that, depending on the parameters choice, outputs inconsistent data. By the time this study was published, there was no public release correcting the bug. Therefore, our workflow contains a custom LASTZ hotfix version that can be found at https://gitlab.ub.uni-bielefeld.de/gi/ lastz-hotfix. The complete list of LASTZ parameters can be found at https://lastz.github.io/lastz/. Parameters and their settings used in our eudicot study are: Values under 6000 for hspthresh resulted in excessive noise for the eudicots dataset. Table 1 shows how to configure LASTZ parameters in the config.yaml file of our workflow. LASTZ parameter → Workflow configuration (config.yaml) entry any parameter --parameter lastz params: --parameter ... any parameter --parameter=<value> lastz params: --parameter=<value> ...

Filtering families
After the genome segmentation, the resulting families pass through a filtering step made by the script that post-processes the genome segmentation output (atoms2cog.py). In this filtering step, families that are too large or occur only once can be removed. We have filtered out families according to the following rules: • Families larger than 98% of all the families (--percent 98); • Families that have a single representative (--ignore0). Table 3 shows how to configure family filtering parameters in the config.yaml file of our workflow. atoms2cog.py parameter → Workflow configuration (config.yaml) entry any parameter --parameter cog params: --parameter ... any parameter --parameter <value> cog params: --parameter <value> ...

Gecko3-DCJ: discovering syntenic blocks
Syntenic blocks in our workflow are found by Gecko3-DCJ (https://gitlab.ub.uni-bielefeld.de/gi/ gecko-dcj), which finds (referenced-based) approximate common intervals and quantifies their structural similarity by means of the local DCJ similarity score. A collection of intervals associated with genome content G is approximate common if the symmetric difference between the genome content of each interval and G is bounded by δ sum and, more specifically, the number of excessive (i.e., inserted ) markers is bounded by δ add , and the number of missing markers by δ loss . The two δ tables (default and relaxed) used in the eudicots study are shown in Table 4 (-dT <table> parameter). The quorum parameter q was set to 3 using the -q 3 option. All Gecko3-DCJ options used in command line can also be set using its graphical interface.

ANGES: Ancestral genome reconstruction
In this last step of the pipeline, ANGESpy3 (a port of ANGES to Python 3) was used with default parameters. In the main experiments made for this work we provided to ANGESpy3 syntenic block scores calculated by the average local DCJ similarity between the block occurrence in the reference genome (grape) and all block occurrences in other species. Besides, the heuristic algorithm was used to reconstruct the PQ-tree. ANGESpy3 can be downloaded from https://gitlab.ub.uni-bielefeld.de/gi/angespy3. Table 6 shows how to configure ANGESpy3 behavior in the config.yaml file of our workflow. In this work, the proportions of genomic markers attributed to each ancestral chromosome were compared to the proportions derived from Badouin et al.'s [1] gene-based reconstruction. Figure 2 shows the comparison of ancestral genome content w.r.t. coffee and grape chromosomes of this analysis. The genome architecture of grape is closest to the post-γ ancestor, therefore the layout of the grape genome serves as proxy for reconstructing the genome of the post-γ ancestor in this work. The method for calculating such proportions of shared content between the grape genome and one of the other genomes in the reconstructed ancestor is as follows: Given a chromosome C of the other species, let S be the set of syntenic blocks occurring in C that are part of some contiguous ancestral region. We compute the percentage of blocks in S that also occur in each of the chromosomes in grape genome. As an example, for the chromosome 1 of coffee genome, the total number of syntenic blocks that are part of some CAR is 385 and, from this total, 233 blocks (60.52%) occur in chromosome 6 and 127 (32.99%) in chromosome 7 of grape genome. Figure 2 shows these proportions.
Based on these calculated proportions, the difference of the layouts given by our ancestral reconstruction and the ancestral reconstruction by Badouin et al. (as computed in the main mansucript) is used as a measure of how much the two reconstructions differ. Our measure is simply the average absolute differences over all chromosomes between our calculated proportions and those reported by Badouin et al.