Getting insight into the pan-genome structure with PangTree

Background The term pan-genome was proposed to denominate collections of genomic sequences jointly analyzed or used as a reference. The constant growth of genomic data intensifies development of data structures and algorithms to investigate pan-genomes efficiently. Results This work focuses on providing a tool for discovering and visualizing the relationships between the sequences constituting a pan-genome. A new structure to represent such relationships – called affinity tree – is proposed. Each node of this tree has assigned a subset of genomes, as well as their homogeneity level and averaged consensus sequence. Moreover, subsets assigned to sibling nodes form a partition of the genomes assigned to their parent. Conclusions Functionality of affinity tree is demonstrated on simulated data and on the Ebola virus pan-genome. Furthermore, two software packages are provided: PangTreeBuild constructs affinity tree, while PangTreeVis presents its result.


Pan-genomes
The amount of genomic data is enormous and its growth rate is increasing constantly. As an example, bacteria genomes availability has risen over 10,000-fold over last 25 years [1]. One of the reasons for this phenomenon is that DNA sequencing technologies are more and more cost and time efficient. Since sequencing cost is decreasing faster than storing and processing costs, optimization of these operational expenditures has become a substantial issue [2]. On the other hand, access to numerous sequenced genomes enhances the potential of comparative genomics.
*Correspondence: dojer@mimuw.edu.pl 1 Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland Full list of author information is available at the end of the article Consequently, the idea of a pan-genome has emerged. Initially proposed as a single data structure for joint analysis of a group of bacterial genes [3], in the presence of a variety of whole genome sequences available this term has evolved. Currently, it refers to a model of joint analysis of genomes of different organisms. Related data structure is expected to support various operations, including construction, comparison, visualization, annotation, read mapping etc. [4].
Simultaneous access to many genomes makes the data structure information-rich. Pan-genome can serve as a demonstration of single organism's genome compared to its relatives, but also as a complex scheme exposing the variety of the component sequences. The first perspective is particularly important for applications in personalized medicine, because it makes possible to highlight one's individual properties which can be crucial in the mean of its phenotype. The other one can be useful when a reference genome for a population is needed.

Modeling pan-genomes
Several pan-genome models were proposed, ranging from collections of unaligned sequences to sophisticated (e.g. HMM-based) models that require complex preprocessing of sequence data [4]. Compromise solutions include multiple sequence alignment (MSA) and various sequence graphs.
The classical MSA model has a form of a matrix with rows obtained from input sequences by inserting gaps in appropriate positions. This model has several advantages, e.g. matrix columns naturally define a joint coordinate system for all input sequences. Additionally, most common sequence evolutionary events -insertions, deletions and substitutions -are well highlighted. However, such representation cannot handle rearrangements that violate colinearity of input sequences, e.g. duplications or inversions. In order to properly represent such rearrangements, the whole alignment is usually split into blocks, which represent aligned fragments of input sequences as classical colinear MSA. A small example of such block alignment scheme is presented in Fig. 1.
Sequence graphs naturally represent rearrangements of any kind. Moreover, in these models identical fragments of different input sequences may be combined into one, reducing the MSA redundancy in sequence representation. Such reduction is particularly important when a pan-genome is used for read mapping. However, it comes with the cost of loosing information regarding the input sequences, so additional graph annotation is required to provide a common coordinate system.
Graph-based MSA models were also proposed. One of the first such solutions was probably Partial Order Alignment (POA) [5]. In this approach graph nodes represent residues of input sequences and directed edges represent consecutive residues. Moreover, undirected edges join aligned residues with different labels. As illustrated in Fig. 2, this graph can be constructed from an alignment matrix straightforwardly: in each column identical letters are collapsed into one node, undirected edges link nodes obtained from the same column and directed edges join nodes representing consecutive residues from at least one of the sequences.
Several graph-based alignment models were applied to the problem of whole genome alignment, e.g. A-Bruijn graph [6], Enredo graph [7], Cactus graph [8] (see [9] for review). There were also proposed alignment-free sequence graph models (e.g. colored de Bruijn graphs [10]), having meaningful advantages in terms of scalability and performance.

Pan-genome structures
All multiple sequence alignment representations are complex, even classical colinear one. Some MSA applications require previous extraction of selected properties that could be represented in a more compact way. For example, sequence profile is a vector of symbol frequency distributions in particular MSA columns, consensus sequence is composed of symbols dominating consecutive columns, while information content vector quantifies the levels of that domination.
The above concepts could be straightforwardly extended to most graph-based alignment models, but we anticipate possible results unsatisfactory for at least two reasons. Firstly, the result would be still a sequence graph with the same or similarly complex structure, and this structure is what makes the graph alignment hard to use. Secondly, in all the above objects MSA columns are treated independently, while graphs encode some dependency between residues (e.g. two nodes or paths can represent alternative variants of the same locus). We find this property an important advantage of graph-based  One of the possible solutions was proposed in [11]: instead of a single sequence profile or consensus there should be computed several consensus sequences, which in common would represent the whole (potentially heterogeneous) set of aligned sequences. The paper presents an algorithm identifying homogeneous sequence subgroups and computing for each subgroup a consensus sequence. Input alignment for the algorithm is represented as a POA and each consensus sequence forms a path in a POA graph. Fig. 3 Distributions of the value of sequence compatibility with consensuses C 1 (top), C 2 (middle) and C 3 (bottom), computed in consecutive iterations. Vertical green lines indicate resulting compatibility thresholds, i.e. right ends of the largest gaps between individual compatibility values (indicated by short vertical blue lines). C 1 is calculated from all input sequences, while C 2 and C 3 -only from sequences having in the previous iteration compatibility above the threshold. The set of sequences exceeding the compatibility threshold has changed in the second iteration, since a more homogeneous subgroup was recognized with C 2 . The third iteration doesn't change the respective set and the procedure converges. However, consensus C 3 represents selected sequences better than C 2 , which is reflected by their larger average compatibility

Our goals
The concept of POA consensus sequences has found several applications. Examples include paralog separation in EST alignments [11], gene isoform identification in de-novo assembly of RNA-seq data [12] and assembly of third generation sequencing reads [13]. However, to our best knowledge, it has never been applied to pan-genomic research.
The aim of this article is to fill this gap. For this purpose we extend the above concept, based on the following principles: • pan-genome may be split into different genome subsets, depending on the assumed subset homogeneity level, • subsets of a wide range of levels can contribute to the description of the relationships between genomes; the same applies to their consensus sequences, • the image of the pan-genome structure should be complemented by subset hierarchy.
In the following sections we introduce the notion of affinity tree that meets the above requirements. Furthermore, we propose an algorithm computing affinity tree for given MSA, implemented in the package PangTreeBuild. Functionality of our approach is evaluated on simulated data and on the MSA of Ebola virus genomes. Finally, we present a tool for affinity tree visualization, called PangTreeVis.

POA and POA consensus sequences
Partial Order Alignment (POA) [5] was probably one of the first graph-based MSA models. In this approach graph nodes represent residues of input sequences (aligned residues are combined into one if they are labeled alike) and directed edges join residues that are neighboring in at least one of the input sequences. Moreover, nodes representing aligned residues with different labels are connected with undirected edges.
Lee [11] presented a tool for calculating POA consensus sequences. In this method each edge is assigned a weight corresponding to the number of covering input sequences. Then HeaviestBundle algorithm searches the heaviest path in the graph using dynamic programming. Next, the algorithm calculates the compatibility of each sequence with the consensus, defined as where N i is the length of seuquence i and N ij is the number of POA nodes belonging to both the sequence i and the consensus j. The sequence is assigned to the consensus, if the result is larger than the assumed threshold. Finally, edge weights are recalculated and the whole procedure is repeated for remaining sequences until all sequences are assigned or none of them exceeds a compatibility threshold. This forms a partition of the input sequences into subsets with assigned consensus paths.

Affinity tree
As opposite to the flat division produced by the algorithm of Lee [11], affinity tree allows to represent the hierarchy of sequence subsets. Each node of this tree has the following attributes assigned: • a subset of input sequences, • a consensus sequence being their common representation, • minimum of the sequence-consensus compatibilities (minComp).
The value of the minComp attribute reflects node's homogeneity level. The root node has all input sequences assigned and each non-leaf node has at least two children nodes that form a partition of the sequences assigned to their parent into more homogeneous subsets. Subsets with homogeneity level bigger than the assumed parameter stop form leafs.

Affinity tree building algorithm
The input of the algorithm consists of a set of aligned sequences, represented with the POA model. Affinity tree is created in BFS ordering: starting from the root, sequences assigned to consecutive nodes are split into smaller, more homogeneous subsets until the stop homogeneity level is achieved. The pseudocode of the algorithm is shown in Algorithm 1.
The procedure of splitting sequences into subsets is similar to that of Lee [11], but we introduced some important modifications. The first one regards the choice of subset consensus sequences. In Lee's algorithm they are computed for the whole current set of sequences and then sequence subsets are selected based on the compatibility with the consensus. The artifact of this approach is that fragments of the consensus sequence may be representative to sequences not assigned to it rather than assigned ones. In the case of longer (e.g. genomic) sequences this behavior occurs more commonly than in the case of short ones which Lee's algorithm was designed for. In order to avoid this artifact we recalculate consensuses based on selected sequences only and repeat this procedure until convergence. We observed that very seldom more than two iterations are needed, typical case is presented on Fig. 3.  Another problem with a direct application of Lee's algorithm is that the nodes of the tree have various homogeneity levels, which are not known a priori. In order to avoid using an arbitrarily chosen threshold of Lee's approach, we invented a data-driven compatibility threshold selection method. The idea is that the compatibilities between sequences and a consensus should form clusters corresponding to homogeneous subsets. In particular, good candidates for consensus-assigned sequences should be separated from poor ones by a large difference in compatibility. Therefore, as the cutoff threshold we choose the right boundary of largest gap in sorted compatibility values.
This rule works well in the case of the first child of a node. However, in the other case this procedure would force splitting remaining sequences into smaller and smaller subsets with inadequately high homogeneity levels. Therefore, after creating the first child of a node, for the purpose of threshold selection we introduce two modifications to the list of sequence compatibilities: • parent minComp value is added, • all but the smallest compatibilities larger than all minComp values of already created siblings are removed.
Since the parent minComp value is smaller than all considered compatibilities, the first modification allows (and defines criteria for) creating a node covering all remaining sequences. The second modification restricts searching to compatibility gaps overlapping the interval covering the minComp values of a parent and already created siblings (however, the right end of the gap still can be larger).
Although the restriction could potentially cause choosing an accidental threshold, the interplay between data-driven threshold selection and iterative consensus recalculation (see Algorithm 2 for details) minimizes this effect. Finally, we introduced an additional parameter P that controls the granularity level of the resulting tree. For the purpose of calculation of the largest gap compatibilities and minComp values are transformed according to the formula t p (c) = c p For P > 1 the distance between smaller compatibilities decreases and the distance between higher compatibilities increases. Consequently, the sequence set assigned to a particular node is split into smaller and more homogeneous subsets and hence more children are created for the node. The opposite happens for P < 1. The effect of this parameter is presented in Fig. 4.

Implementation
The algorithm was implemented in software called PangTreeBuild, which is available as a Python 3.6 package or a console application for Linux operating system. Moreover, it could be executed from the PangTreeVis web browser interface (see Fig. 5).
As an input PangTreeBuild requires a block alignment in MAF format, which is widely used to represent whole genome alignments and can be interpreted as a pangenome model. PangTreeBuild internal data model is POA so alignment blocks are converted to it as shown in Fig. 2. Subsequently, edges between blocks are filled in and the connections that create cycles are identified and removed using the Mafgraph tool [14]. Next, unaligned fragments of input sequences, which are usually absent in MAF files, are complemented from NCBI database or provided FASTA files.
Given such a pan-genome model and parameters P and stop, AffinityTree procedure is run. The constructed data model together with the resulting affinity tree are saved as a JSON file, which can be visualized using PangTreeVis tool.
PangTreeVis is an interactive browser application. The visualization has two parts (see Fig. 6). The first onepan-genome graph -consists of two views: general (for navigating purposes) and detailed (single residue resolution). The second part is affinity tree visualization and contains tree diagram featured by its detailed characteristics and metadata that could be provided in advance for PangTreeBuild.

Results
The PangTreeBuild algorithm was applied to two types of MSA datasets: simulated (yielded from genome sequence evolution simulations) and real-life (computed for Ebolavirus genomes).

Simulated data
To test whether affinity tree correctly discovers evolutionary patterns, we generated MSA following sequence evolution simulation. We applied the scheme used in the data simulated for the purpose of Alignathon assessment [15]: MSA was generated using genome simulation software Evolver [16] together with evolverSimControl [17] -a wrapper for running Evolver's simulations guided by phylogenetic trees.
The root sequence consisted of two 100kbp-long fragments of human chromosomes 20 and 21. Two evolutionary parameter sets were prepared, both based on the example provided in the evolverSimControl repository. In the first parameter set, possible evolutionary events were restricted to substitutions, deletions and nonduplicative insertions, resulting in colinear MSAs (i.e. without rearrangements). In the second one, genome rearrangement events were added: intra-chromosomal inversions, tandem repeats, transpositions and duplications, as well as inter-chromosomal transpositions and duplications.
The simulations were guided by a phylogenetic tree of 138 yeast strains (reconstructed in [18] and deposited in TreeBASE repository under accession no. S12670). From each simulation, the MSA reflecting simulated evolutionary events was created (see [17] for details) and restricted to chromosome 20 sequences of leaf genomes. In this way 10 sequence sets and their MSAs were generated for each simulation parameter set. All input datasets, parameters and simulation scripts are provided as example data in the PangTreeBuild repository.
For each MSA affinity trees were generated by PangTreeBuild with parameters stop= 0.999 and P ∈ {0.25, 1, 4}. Similarly, Lee's algorithm was applied to each  MSA with three compatibility thresholds: 0.98, 0.99 and 0.995. Resulting partitions were converted to affinity trees in the following way: each leaf was a child of the root and represented either a subset of Lee's partition or an original sequence not assigned to any subset. For the purpose of comparison with the original phylogenetic tree using the Robinson-Foulds (RF) metric, we added to each leaf children, representing original sequences assigned to a corresponding subset. Results are presented on Fig. 7.
RF distance is defined as the number of splits that are induced by one of the trees but not by the other. The number of splits induced by a single tree equals the number of its nodes. For a fixed number of leafs, the largest possible number of splits is obtained by bifurcating trees. In our case phylogenetic tree is bifurcating, while affinity trees are highly multifurcating, since they are intended to express only the most evident partitions. Consequently, the part of the RF distance that is due to splits induced by the phylogenetic tree only reflects the granularity of affinity trees rather than the inconsistency between trees. As is shown on Fig. 7, such splits contribute to > 80% of the distances for PangTree affinity trees and > 95% of the distances for Lee's affinity trees. However, in the last case this is very close to the total number of non-trivial splits in the phylogenetic tree (i.e. 137), which means that only very few of them are shared by Lee's affinity trees.
The above observation is confirmed by Fig. 8, which presents more detailed analysis of the reconstruction of splits induced by the phylogenetic tree. PangTree algorithm reconstructs splits induced by most of longer edges, especially in the case of datasets without rearrangements, while Lee's algorithm performs very poorly.
The red bars on Fig. 7 represent the splits that can be considered as false positives (i.e. supported by affinity trees, but absent in the original phylogenetic tree). Absolute numbers of such splits are visibly higher for the PangTree algorithm than for Lee's algorithm, which is due to the difference in the total number of splits yielded by both methods. The proportion of false positive splits (i.e. false discovery rate) for the PangTree algorithm is near the half of the one for the Lee's algorithm.
In order to evaluate the behavior of minComp as a subset homogeneity measure, we compared it against the corresponding measure in the phylogenetic tree -maximal distance from an internal node to its leafs. Results are presented on Fig. 9.

Ebola dataset
Ebola outbreak in 2014 intensified studies concerning this dangerous virus. As it is crucial to find cure and vaccine for the illness it causes, a large set of sequenced samples was prepared. The data and associated studies are collected in UCSC Ebola Portal [19]. The dataset include MSA of 158 genomes of ebolavirus and 2 genomes of marburgvirus (the closest Ebola relative). Every sequence from this alignment is assigned to one of the seven groups:  [20], the comparison is presented in Table 1.
The length of a single genome of Ebola or Marburg virus is ∼ 19kbp. Due to high sequence similarity POA for all 160 genomes contains only 70603 nodes. Affinity tree computed with the parameters P = 0.25, stop= 0.99 is presented in Fig. 10. In order to look into details of the relationships between species, we calculated compatibilities between their consensus sequences. Results are presented in Table 2. All the compatibilities are around 0.6, but slightly higher are those between Zaire ebolavirus and other species, as well as those between Tai Forest and Bundibugyo ebolaviruses.
The compatibility measure describes similarity averaged over the whole sequence. In order to analyze its variation over the genome sequence, we calculated local compatibilities between consensuses of the ebolavirus species. Results presented on the Fig. 11 show that local compatibilities for all species pairs share similar patterns. Firstly, compatibility significantly decreases in non-coding areas. Secondly, there are few regions of lower similarity Fig. 9 Homogeneity of sequence groups measured in affinity trees and in the phylogenetic tree. Each inner node of an affinity tree is represented by a single dot. Homogeneity of the corresponding sequence group is measured in the affinity tree by 1 − minComp (X-axis) and in the phylogenetic tree by the maximum distance from leafs representing these sequences to their lowest common ancestor (Y-axis). Top row: PangTree algorithm, bottom row: Lee's algorithm. Left column: simulations without rearrangements, right column: simulations with rearrangements in coding areas. One of them, near the 3' end of gene L (around position 16.5kbp), is known to exhibit some interspecies variation [21].

Performance
Computational requirements of affinity tree calculations for datasets used in this study are compared in Table 3. All the computations were performed on a laptop with a 1.8GHz Intel Core i7-8565U CPU and 16GB RAM. Bombali -

Marburg 1987
The requirements seem to scale linearly with respect to the size of input data, so PangTreeBuild probably may be applied to hundreds of virus genomes or dozens of bacterial ones. However, the impact of MSA complexity could be crucial and is hard to estimate.

Discussion
The relationships between homologous sequences are usually represented using two complementary structures -phylogenetic tree and multiple sequence alignment. In the current work we proposed a novel structure, called affinity tree, which joins both perspectives. As opposite to phylogenetic trees, affinity trees are not intended to be a detailed reconstruction of evolutionary history. Instead, they provide: • hierarchy of most evident homogeneous sequence subgroups, • subgroup reference (or consensus) sequences, • graph-based multiple sequence alignment, joining input and reference sequences, • local and global sequence homogeneity measures.
Consequently, affinity tree can serve as a pan-genome model, supporting interesting features for comparative  genomics studies. Given a whole-genome alignment, affinity tree specifies homogeneous subgroups of contributing genomes. Each subgroup is characterized in terms of its consensus sequence and subgroup genetic diversity. Graph-based alignment induced on consensus sequences represents spatial distribution of similarity between subgroups, while alignment between sequences constituting a subgroup and their consensus -of similarity within the subgroup. Furthermore, spatial distribution of compatibility of mixed individual's genome with consensus sequences of various subgroups can delineate the mosaic structure of that genome. In summary, graphbased alignment of consensus sequences can serve as a diversified population reference genome. Several practical issues should be considered when using affinity trees in comparative genomics studies. First, joint visualization of the features supplied by the affinity tree model is challenging. PangTreeVis provides the visualization of the alignment graph (on both genomescale and nucleotide level) and of the tree structure, enables interactive subgroup selection and viewing genome-subgroup relationships. Although the basic requirements are satisfied, the advantages of the affinity tree model could be highlighted better if some extensions were provided, e.g. graph visualization on intermediate scale levels, support for local genetic diversity layer and additional layers (annotation, experimental etc.).
Second, the quality of the affinity tree hardly depends on the provided whole-genome alignment. PangTreeBuild algorithm was designed to pan-genomes with limited number of structural variants. It performs well on datasets examined in the current study, but our experiments show that large-scale rearrangements influence both the algorithm efficiency and the quality of resulting affinity tree. Consequently, complex alignment graphs may require an adjusted building algorithm.

Conclusions
Affinity tree and its generation algorithm support the pan-genomic research field. Apart from hierarchical division of aligned genomes, affinity tree describes homogeneity of resulting subgroups and provides subgroup reference sequences. The introduced method gives new insight into multiple sequence alignment analysis -its result can serve as both taxonomic study and a population reference pan-genome. Two complementary software packages: PangTreeBuild and PangTreeVis enable affinity tree construction and visualization.