 Research
 Open Access
 Published:
A program to compute the soft Robinson–Foulds distance between phylogenetic networks
BMC Genomics volume 18, Article number: 111 (2017)
Abstract
Background
Over the past two decades, phylogenetic networks have been studied to model reticulate evolutionary events. The relationships among phylogenetic networks, phylogenetic trees and clusters serve as the basis for reconstruction and comparison of phylogenetic networks. To understand these relationships, two problems are raised: the tree containment problem, which asks whether a phylogenetic tree is displayed in a phylogenetic network, and the cluster containment problem, which asks whether a cluster is represented at a node in a phylogenetic network. Both the problems are NPcomplete.
Results
A fast exponentialtime algorithm for the cluster containment problem on arbitrary networks is developed and implemented in C. The resulting program is further extended into a computer program for fast computation of the Soft Robinson–Foulds distance between phylogenetic networks.
Conclusions
Two computer programs are developed for facilitating reconstruction and validation of phylogenetic network models in evolutionary and comparative genomics. Our simulation tests indicated that they are fast enough for use in practice. Additionally, the distribution of the Soft Robinson–Foulds distance between phylogenetic networks is demonstrated to be unlikely normal by our simulation data.
Background
Since Darwin’s The Origin of Species, the evolutionary history of life has been widely depicted as phylogenetic trees. However, the simplified treelike evolutionary models are being challenged by the accumulating amount of evidence of lateral genetic transfer between lineages, particularly in prokaryotes [1–3]. Additionally, other reticulate evolutionary events also cause complications in constructing treelike models, such as hybridization and introgression between species [4, 5], and recombination of various forms [6]. The recognized limitations of phylogenetic trees motivated the adoption of phylogenetic networks to model these reticulation events [7, 8]. Phylogenetic networks can be used to either visualize conflicting phylogenetic information or model reticulation events explicitly. The former are typically unrooted, whereas the latter are rooted, which is the focus of this study. In recent years, phylogenetic networks have been the subject of intensive theoretic studies [9–12]. However, considerable challenges in reconstructing phylogenetic networks still exist [13].
In a phylogenetic tree, the taxa below a node form a unique subset of the taxa, called its cluster. A phylogenetic tree is uniquely determined by the set of “nested" clusters in the tree (see, for example, [10]).
A phylogenetic network is a generalization of a phylogenetic tree in which there are additional reticulation nodes, which are the nodes with an indegree of at least two. Since most gene families have treelike evolutionary histories, the network model of the evolution of a set of genomes is often built and validated by checking its consistency with the available related gene trees and/or clusters [10].
In a phylogenetic network, a nonreticulation node is called a tree node. Each tree node represents a cluster and a set of soft clusters. Similar to the case of phylogenetic tree, a node’s cluster consists of all taxa below it, whereas its soft clusters are the clusters represented by this node in the phylogenetic trees that are displayed in the network. Here, a phylogenetic tree is said to be displayed in a phylogenetic network if it can be obtained by deleting all but one incoming edges from each reticulation node and then contracting all the nodes of degree two.
The tree containment problem (TCP) and the cluster containment problem (CCP) have arisen from reconstructions of phylogenetic networks [14]. The TCP asks whether a phylogenetic tree is displayed in a phylogenetic network. The CCP asks whether a cluster is a soft cluster of some tree node in a phylogenetic network. Both the TCP and CCP are NPcomplete [10, 14], even for restricted networks [15].
A polynomialtime algorithm for the CCP is given for reticulationvisible networks [10]. A network is reticulationvisible if, for each reticulation node, a leaf exists such that every path from the network root to the leaf contains this reticulation node. Recently, a lineartime algorithm is presented for the CCP on this class of networks [16]. Given that a large fraction of phylogenetic networks are not reticulationvisible [17], however, it is necessary to develop an algorithm for the CCP for arbitrary networks for the following reason.
Measuring the dissimilarity between phylogenetic networks is important for assessing a network reconstruction method. One of the metric functions that has been proposed for this purpose is the Robinson–Foulds (RF) distance, which is a generalization of the same metric for phylogenetic trees. Simply put, it is the half of the cardinality of the symmetric difference of the two sets of clusters respectively contained in the two networks [18]. It takes lineartime to compute the RF distance between phylogenetic networks [10].
By replacing clusters with soft clusters, we obtain the Soft Robinson–Foulds (SRF) distance [10]. Since the CCP is NPcomplete, there is unlikely a polynomialtime algorithm for computing the SRF distance. To the best knowledge of the authors, only a straightforward method has been implemented in the software Dendroscope [19]. The method exhaustively searches the clusters that are in a phylogenetic tree displayed in one network but are not in any phylogenetic tree displayed in another.
Recently, Gunawan et al. [20] developed a computer program for solving the TCP on arbitrary networks. Although it has exponentialtime complexity in the worst case, it runs fast enough to be used in practice.
Here, we first develop an algorithm for the CCP by using the decomposition theorem in [16]. We then extend it into an algorithm for computing the SRF distance. We implemented these two algorithms in C and tested them on empirical and simulated network datasets. As an application of the programs, we examined the differences of networks reconstructed for two datasets in the literature. We also conducted a preliminary study of the distributions of the RF and SRF distances in the phylogenetic network space.
Methods
We first introduce the basic concepts and notation, then recap the decomposition technique for arbitrary phylogenetic networks, and finally describe the algorithms for the CCP and the SRF distance.
Concepts and notation
Let X be a set of taxa. A rooted phylogenetic network (network for short) over X is an acyclic digraph in which the leaves (i.e., nodes of outdegree zero) are bijectively mapped to X. A taxon typically represents some extant organism or species. A network has a unique root (of indegree zero).
There can be two types of internal nodes in a network: tree nodes, which include the root and nodes of indegree one and outdegree of at least one, and reticulation nodes, which have outdegree one and indegree of at least two. The tree nodes represent speciation events and the reticulation nodes represent reticulation events. We allow degreetwo nodes in a network.
Here, we use the following notation for a network N:

T(N): the set of tree nodes in N.

L(N): the set of leaves in N.

R(N): the set of reticulation nodes in N.

V(N): the set of all nodes in N, namely T(N)∪L(N)∪R(N).

E(N): the set of edges in N.

ρ(N): the root of N.

N−E: the subnetwork (V(N),E(N)∖E) for a subset E⊆E(N).

N−S: the subnetwork (V(N)∖V(S),E ^{′}), where E ^{′}={(x,y)∈E(N)  {x,y}⊆V(N)∖V(S)} for a subnetwork S of N.
For u,v∈V(N), u is a parent of v and v is a child of u if (u,v)∈E(N). We use c(r) to denote the unique child of r∈R(N). If there is a direct path from u to v, v is called a descendant of u.
We use [ r]_{ N } to denote the subnetwork below r∈V(N), which consists of all the descendants of r and the edges between them in N. For a leaf ℓ below r, we use N−[ r]_{ N }+ℓ to denote the subnetwork obtained by replacing [ r]_{ N } with ℓ so that ℓ becomes the child of r.
If each reticulation node in a network has exactly two parents, the network is bicombining. A bicombining network is binary if each tree node is of outdegree two. A phylogenetic tree is a binary network without reticulation nodes. If the unique child of each reticulation node in a network is a tree node or a leaf, this network is called reduced.
Following Gunawan et al. [16], we allow a network to have dummy nodes (i.e., unlabelled nodes of outdegree zero) because such a network may be generated in a recursive step of our algorithms.
Given the set of taxa X, a cluster is any proper subset of X (excluding the empty set and the full set). A cluster is trivial if it contains only one element.
In a phylogenetic tree T over X, each nonroot node induces a unique set of taxa that consists of the labels of the leaves below the node, which is called the cluster of the node. A cluster is in T if it is the cluster of some node in T.
Given a network N over X and a phylogenetic tree T over X, we say that T is displayed in N if T can be obtained from N by the following operations: removing all but one incoming edges for each reticulation node in N, removing nodes that are not in any path from ρ(N) to a leaf ℓ∈X, and contracting degreetwo nodes (i.e., nodes of indegree one and outdegree one). To contract a degreetwo node w which has two incident edges (u,w) and (w,v), we merge the two edges into one edge (u,v).
A cluster B⊂X is a soft cluster in N if there is a tree T displayed in N such that B is a cluster in T. A tree node in a network may represent multiple soft clusters, which could be obtained from different trees displayed in the network. We use S C(N) to denote the set of soft clusters in N.
Given B⊂X and a network T on X, the CCP asks whether B is a soft cluster in N [10], which is formulated as below:

CLUSTER CONTAINMENT

Instance: A phylogenetic network N over a set of taxa X and B⊂X.

Question: Is B∈S C(N)?
Let N _{1} and N _{2} be two networks over the same set of taxa X. The SRF distance between them is defined to be (S C(N _{1})∖S C(N _{2})+S C(N _{2})S C(N _{1}))/2 denoted by d _{ SRF }(N _{1},N _{2}).
It is worth noting that the SRF distance is not a strict metric, since two distinct networks may represent the same set of soft clusters and hence the SRF distance between them will be zero [10]. Nevertheless, the SRF distance provides a useful measure of network dissimilarity.
A decomposition theorem
The key to solving the CCP and computing the SRF distance is the decomposition theorem, which was first proposed by Gunawan et al. [16] for reticulationvisible networks and used later for arbitrary networks in [20].
The decomposition theorem says that an arbitrary network N can be decomposed into a set of connected tree components which are separated by reticulation nodes. Specifically, there is a tree component C _{ r } for each r∈R(N)∪{ρ(N)}, which is either {c(r)} if r∈R(N) and c(r)∈R(N), or a subtree induced by all the tree nodes and leaves that are reachable from r. A tree component is trivial if it contains only one leaf or if it is empty (for a dummy reticulation node).
A node is visible on a leaf ℓ if it lies on all the paths from ρ(N) to ℓ. If a node r∈R(N)∪{ρ(N)} is visible on a leaf ℓ, its tree component C _{ r } is visible on ℓ as well. Given two tree components \(C_{r'}\phantom {\dot {i}\!}\) and \(\phantom {\dot {i}\!}C_{r^{\prime \prime }}\), r ^{′} and \(\phantom {\dot {i}\!}C_{r^{\prime }}\) are right below \(\phantom {\dot {i}\!}C_{r''}\) if a parent of r ^{′} is in \(\phantom {\dot {i}\!}C_{r^{\prime \prime }}\). A tree component is exposed if it contains only one leaf or if all the tree components right below it are trivial.
Obviously, N contains at least one exposed nontrivial tree component. In addition, an exposed tree component C _{ r } is visible if and only if C _{ r } contains a leaf or if a reticulation node r ^{′} exists right below C _{ r } such that all the parents of r ^{′} are in C _{ r }.
These concepts are briefly illustrated in Fig. 1. See [16, 20] for more details of the decomposition theorem.
Description of the algorithm
The CCP algorithm
With the aid of the generalized decomposition theorem, we extend the lineartime CCP algorithm for reticulationvisible networks in [16] to arbitrary networks.
Roughly speaking, our new CCP algorithm works as follows:
To determine whether or not a cluster C is in a phylogenetic network N, the algorithm selects a nontrivial exposed component M of N. If M is visible, we either find the negative answer to the problem by working on M or we obtain an instance of the problem that is simpler than the input instance (C,N) in linear time proportional to the size of M. In the latter, we reduce the original instance of the CCP to a simpler instance.
If M is not visible, there is then a reticulation node which has a unique leaf child and does not have all parents in M. In this case, two phylogenetic networks N _{1} and N _{2} are derived from N, which contain fewer nodes than N. The algorithm is then called on both instances ( C,N _{1} ) and ( C,N _{2} ) recursively.
Although this algorithm seems simple, it has significantly less time complexity when the input network is binary. In the rest of this section, we present a formal description of the algorithm.
Let N be a network over X and B⊂X, respectively. We examine a nontrivial exposed tree component C _{ r } of N.
The reticulation nodes below C _{ r } are divided into innerreticulation nodes for which the parents are all in C _{ r }, and crossreticulation nodes for which some parents are not in C _{ r }. We use I R(C _{ r }) and C R(C _{ r }) to denote the sets of inner and cross reticulation nodes, respectively. For example, in Fig. 1, I R(C _{r5})=∅ and C R(C _{r5})={r4,r7,r8}.
We use L _{ r } to denote the set of leaves on which C _{ r } is visible:
We use \(\check {L}_{r}\) to denote the set of leaves below C _{ r } which are in B and on which C _{ r } is not visible:
For example, in Fig. 1, L _{r5}=∅ and we can get \(\check {L}_{\text {r5}}=\{\text {leaf1}, \text {leaf2}\}\) when assuming B={leaf1,leaf2,leaf5}.
Suppose that L _{ r } is nonempty. C _{ r } is then visible with respect to a leaf ℓ∈L _{ r }. We first check whether B is a soft cluster in C _{ r }. This can be solved by a lineartime algorithm [16]. If not, we then solve the CCP according to the relationship between L _{ r } and B.
Let \(\bar {B}=X \backslash B\). If L _{ r }∩B≠∅ and \( L_{r} \cap \bar {B} \neq \emptyset \), B must be a soft cluster of a node in C _{ r } if B is a soft cluster in N [16].
If \(L_{r} \cap \bar {B} = \emptyset \), B may be a soft cluster of ρ(C _{ r }) or a node in a larger subnetwork containing C _{ r }. Assuming that r ^{′}∈C R(C _{ r }), we then define:
The leaves below the root of C _{ r } in N _{ a } (i.e., \(\phantom {\dot {i}\!}L([\rho (C_{r})]_{N_{a}})\)) are then \(\phantom {\dot {i}\!}L_{r} \cup \check {L}_{r}\). We denote \(\phantom {\dot {i}\!}L([\rho (C_{r})]_{N_{a}})\) as \(\hat {B}\) for convenience.
Since L _{ r }⊆B and \(\check {L}_{r} \subseteq B\), \(\hat {B} \subseteq B\). If \(\hat {B} = B\), B is a soft cluster of ρ(C _{ r }) in N _{ a }. Otherwise, if \(\hat {B} \subset B\), we set:
If L _{ r }∩B=∅, B may be a soft cluster of a node in C _{ r } if \(\check {L}_{r} \neq \emptyset \). Otherwise, when B is not a soft cluster of a node in C _{ r } and r ^{′}∈C R(C _{ r }), we define:
We can then set:
With this notation, we can get Theorem 1 for arbitrary networks, which is similar to a theorem proved for reticulationvisible networks in [16]. Theorem 1 is proved in the Additional file 1.
Theorem 1
Assume that C _{ r } is a nontrivial, exposed and visible tree component in a network N over the taxa set X, and that B⊂X. Let L _{ r }, \(\hat {B}\), B ^{′}, N a′, and N b′ be defined above.

(i) If \(\hat {B} \subset B\), B is a soft cluster in N if and only if B ^{′} is a soft cluster in N a′.

(ii) If B is not a soft cluster of a node in C _{ r } and L _{ r }∩B=∅, B is a soft cluster in N if and only if B is a soft cluster in N b′.
Suppose that C _{ r } is not visible. If C _{ r }≠{c(r)}, there is at least one reticulation node r ^{′} right below C _{ r } such that \(\phantom {\dot {i}\!}C_{r'}\) is trivial and at least one parent of r ^{′} is not in C _{ r }. If C _{ r }={c(r)} and \(c(r)=r'\phantom {\dot {i}\!}\), then at least one parent of r ^{′} is not r. We can now define:
and
Clearly, B is a soft cluster in N if and only if B is a soft cluster in either N ^{′} or N ^{″}.
In consideration of all the cases above, we have come up with an algorithm for solving the CCP on an arbitrary network, which is given in Fig. 2.
The SRF distance algorithm
We now use the CCP algorithm to compute the SRF distance between two arbitrary networks on the same taxa set X.
For X, we define a kcluster as a cluster having k taxa. We enumerate all the possible clusters over X by generating all the kclusters of X for each k ranging from 1 to X−1. We then call the CCP algorithm on each cluster to see whether it is a soft cluster in only one network.
The time complexity of this SRF distance algorithm is O(2^{L(N)} T(N)), where T(N) is the time complexity of the CCP algorithm.
The program for computing the SRF distance in Dendroscope first finds trees displayed in each network, then extracts clusters from these trees to get the soft clusters in each network, and finally traverses the two sets of soft clusters to compute their symmetric difference. If the networks are bicombining, the time complexity for this method is O(2L(N)∗2^{R(N)}+2q), where q is the number of the soft clusters in a network. We will compare this approach and our SRF distance program in next section.
Results and discussion
Performance of the CCP program
In this subsection, we first analyze the time complexity of the CCP algorithm. We then report the performance of the CCP program on both simulated and empirical networks. The simulated networks were generated by using a network generator reported by Zhang [17].
Theoretical analysis of the time complexity
According to the analysis in [16], the runtime of Step 3 of the CCP algorithm is O(E(C _{ r })), where E(C _{ r }) is the set of edges in the tree component C _{ r }. Thus the time complexity of the CCP algorithm is O((m+1)E(N)) for a general network N, where m is the number of times Step 3 is executed. Note that m should be an exponential function of R(N) because of the NPcompleteness of the CCP. If N is a bicombining reduced network, the time complexity of the CCP algorithm is (2^{0.694R(N)}E(N)) [20].
We denote log2(m) as b(N,B) and call it the effective reticulation number of the CCP algorithm for the network N and the cluster B [20]. We use b(N)= maxB b(N,B) to represent the effective reticulation number of the CCP algorithm for the network N.
To the best of our knowledge, the only previously known algorithm for solving the CCP on an arbitrary network is the naive algorithm which considers all the trees displayed in a network and checks whether the input cluster is in one of them. The number of possible trees displayed in a network N can be as large as \(\prod _{r\in R(N)} deg^{}(r)\), where d e g ^{−}(r) is the indegree of r. This number equals 2^{R(N)} when N is bicombining. It takes O(L(T)) time to check whether a cluster is in a tree T [10]. Thus the effective reticulation number seems to be a good indicator of the efficiency of the CCP algorithm. If log2(m) is smaller than R(N), the CCP algorithm will be faster than the naive algorithm in theory.
Performance on random networks
We examined the performance of the CCP program on random networks in term of the effective reticulation number. The tests were done on computers each with 32 GB RAM and a 2.1 GHz AMD Opteron 32core CPU.
We tested the CCP program on random networks with 10 to 30 leaves and 10 to 80 reticulation nodes. Given that the number of clusters over 15 leaves is huge, it was hard to conduct the evaluation on the whole space of clusters. We therefore generated random clusters for testing on networks with more than 15 leaves. According to the results, the effective reticulation number for each network–cluster pair was frequently smaller than half the number of reticulation nodes in the network.
Here, we report the performance of the CCP program on five groups of networks with 10 leaves and all the possible 1022 (=2^{10}−2) clusters. Each group contained 20 networks, and the networks in the k ^{th} group had 5(1+k) reticulation nodes for each k from 1 to 5. The wall clock time on 102,200 (=5×20×1022) network–cluster pairs was 15 minutes and 15 seconds, implying that on average, the program took about one centisecond for each network–cluster pair.
Figure 3 shows the percentages of the clusters in the entire cluster space with the same effective reticulation numbers for each network. Several facts were observed from the test. Firstly, the effective reticulation numbers for the networks in each group increase with the number of reticulation nodes. For example, the effective reticulation numbers for most networks are <5 for the first group, whereas the effective reticulation numbers for more than half of the networks are >9 for the last group. Secondly, there are at least three distinct values of effective reticulation numbers for each network and all the clusters, except for five networks. The effective reticulation number of value one appears for all the networks, since it is easy to determine whether the trivial clusters are soft clusters in a network. Thirdly, the highest effective reticulation number 12 only appears for the 12th network in the last group and one cluster, which is barely seen in Fig. 3 because of the extremely low percentage.
Application to a network in the literature
We selected one of the largest networks in the literature to validate the performance of the CCP algorithm. This is a bicombining network (denoted A, Additional file 1: Figure S1) from [21] that has 7 leaves and 32 reticulation nodes. This network is an ancestral recombination graph reconstructed to study the phylogenetic relationships among the M2 doublestranded RNA in the Rhizoctonia species complex.
Our test showed that all the clusters on the seven taxa appear as soft clusters in the network A. We calculated b(A,B) for each cluster B on the seven taxa. The distribution of b(A,B) is shown in Table 1. The effective reticulation number b(A) is 8, 1/4 of the number of reticulation nodes in A. This suggests that the CCP program is about thousands of times as fast as the naive method for this real network.
Performance of the program for the SRF distance on random networks
In this subsection, we first compare the program in Dendroscope and our program for the SRF distance. We then report the performance of a parallel version of our program.
The tests were performed on computers each with 128 GB RAM and a 2.6 GHz Intel Xeon E52690 24core CPU. For the generation of random networks, we considered six cases. In the k ^{th} case, we generated six groups of network pairs. The j ^{th} group consists of 3000 pairs of networks with 4k leaves and k j/4 reticulation nodes, where k was from 1 to 6 and j=1,2,4,5,6.
For the comparison test, we computed the SRF distance for each pair of networks in every group. The results are summarized in Fig. 4.
Our program ran faster than the program in Dendroscope for networks with up to 16 leaves. However, our program became slower than the latter when there were more than 16 leaves. This is reasonable, since the number of clusters increases exponentially with the number of taxa and it takes even long time for our program to merely list all the possible clusters when there were more than 16 leaves.
Additionally, the memory usage of our program was extremely low compared with the program in Dendroscope. The memory usage of the Dendroscope program increased rapidly with the number of reticulation nodes in a network. For example, the average maximum resident memory for networks with 12 leaves and 18 reticulation nodes was around 95 GB, which is approximately six times that for networks with 12 leaves and 15 reticulation nodes. Because of this, the average runtime of the Dendroscope program for networks with 12 leaves and 18 reticulation nodes sharply increased. During test, the Dendroscope program failed to get results for networks with more than 12 leaves and 20 reticulation nodes. Hence, some data points are missing for the Dendroscope program in the two panels at the bottom in Fig. 4. In contrast, our program can run on networks with more than 30 reticulation nodes. Even for networks with 24 leaves and 36 reticulation nodes, the average maximum resident memory of our program was less than 32 MB. Thus the test shows that our program is computationally efficient when the number of reticulation nodes in the input network is large.
Although our program runs slow for networks with many leaves, it can be easily parallelized for speeding up. We used OpenMP to implement a parallel version of it by parallelizing the enumerations of clusters. This parallel version ran at least 20 times faster than the original program with slightly extra memory. For 3000 pairs of networks each with 20 leaves and 25 reticulation nodes, the parallel version finished in about 36 seconds with less than 40 MB memory on average.
Computing the SRF distances on real biological data
In this subsection, we examine the SRF distance between phylogenetic networks reconstructed from two datasets in the literature.
Computing the SRF distance between networks over a set of grass species
The Proaceae dataset, originally from the Grass Phylogeny Working Group [22], has often been used for validating network reconstruction methods. The dataset contains sequences for six loci: ITS (internal transcribed spacer of ribosomal DNA), ndhF (NADH dehydrogenase, subunit F), phyB (phytochrome B), rbcL (ribulose 1,5biphosphate carboxylase/oxygenase, subunit), rpoC (RNA polymerase II, subunit β ^{″}), and waxy (granule bound starch synthase I). Rooted binary gene trees were built for these loci previously by Schmidt [23]. From the six trees, van Iersel et al. [24] constructed 57 subsets of gene trees for comparisons of network reconstruction methods.
A recent method called Hybroscale [25] can compute all the representative networks with the minimum number of reticulation nodes from a set of multiple binary phylogenetic trees. We ran Hybroscale on three subsets of gene trees from the grass dataset, which are on 11, 12, and 15 taxa, respectively (Table 2). The reconstructed networks have less than seven reticulation nodes. Since there are tens of output networks for each input dataset, we computed their pairwise SRF distances to examine their dissimilarity. As shown in Table 2, the average SRF distances between the networks for all the datasets are relatively small, which implies that the computed networks are rather similar.
On the other hand, different network reconstruction methods on the same data could produce very different networks. Using five gene trees (ITS, ndhF, phyB, rbcL, rpoC2), we constructed three networks: a cluster network (Additional file 1: Figure S2) obtained from a program in [26], a galled network (Additional file 1: Figure S3) obtained from a program in [27], and a reticulate network (Additional file 1: Figure S4) obtained from PIRN [28]. Since the original network reconstructed by PIRN had reticulation nodes with more than one child and leaves with more than one parent, it was transformed into an equivalent one satisfying our definition in this paper. The three networks have 18, 7, and 13 reticulation nodes and contain 445, 261, and 209 soft clusters, respectively. The SRF distance between the cluster network and the galled network is 199. The SRF distance between the galled network and the reticulate network is 118. The SRF distance between the reticulate network and the cluster network is 185. This suggests that the galled network is more similar to the reticulate network than to the cluster network. This also reflects that the SRF distance is sensitive to the structural properties of phylogenetic networks.
Computing the SRF distance between networks over six mosquito species
To study phylogenetic relationships and introgression among six mosquito species in the Anopheles gambiae species complex, Fontaine et al. [5] constructed a network (denoted M1) by employing treebased methods on the wholegenome sequences. Later, Wen et al. [29] rebuilt a similar network (denoted M2) for the six species by directly applying a network inference method on the gene trees. The two networks are shown in Additional file 1: Figure S5. M1 has three reticulation nodes and M2 has four reticulation nodes. There are 18 and 24 soft clusters in M1 and M2, respectively. The SRF distance between M1 and M2 is 7, implying that the two networks are still quite different in the embedded soft clusters.
Comparison of the RF distance and the SRF distance
Although the RF and SRF distances were proposed to measure the dissimilarity of networks, their relationship is unclear [10]. In this subsection, we present our preliminary comparison of these two measures.
Given a fixed number of leaves and reticulation nodes, we generated 100,000 random network pairs and computed their RF and SRF distances. Figure 5 shows the distributions of these two measures in the space of networks with different numbers of leaves and reticulation nodes. The results suggest the following three facts:

(i.)
There are at least as many soft clusters as clusters in a network. Therefore, as expected, the SRF distance has larger range than the RF distance.

(ii.)
The RF distance seems to have a normal distribution of small mean and small variance.

(iii.)
The distribution of the SRF distances seems not to be normal. It is skewed towards small distances (especially for networks with more leaves) and a small fraction of network pairs had much larger SRF distances than the average SRF distance.
Taken altogether, these three facts indicate that the SRF distance is a fine metric for networks and hence more suitable than the RF distance for measuring the dissimilarity of networks.
Conclusions
The generalized decomposition technique developed in [20] was shown to be powerful for solving the TCP on arbitrary networks. In this work, by applying this technique, we have developed efficient algorithms for solving the CCP and computing the SRF distance for arbitrary networks. These two algorithms were implemented in C.
Both programs facilitate reconstructing and validating network models in evolutionary and comparative genomics. Our simulation experiments showed that the SRF distance program ran fast for networks with an intermediate number of leaves and reticulation nodes. Therefore, the SRF distance program is ready for assessing a network reconstructed by a new method via comparing it with other networks.
References
Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 2005; 3(9):679–87.
Soucy SM, Huang J, Gogarten JP. Horizontal gene transfer: building the web of life. Nat Rev Genet. 2015; 16(8):472–82.
Doolittle WF, Brunet TDP. What is the tree of life?PLoS Genet. 2016; 12(4):1005912.
Mallet J. Hybrid speciation. Nature. 2007; 446(7133):279–83.
Fontaine MC, Pease JB, Steele A, Waterhouse RM, Neafsey DE, Sharakhov IV, Jiang X, Hall AB, Catteruccia F, Kakani E, et al. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science. 2015; 347(6217):27–8.
Martin DP, Lemey P, Posada D. Analysing recombination in nucleotide sequences. Mol Ecol Resour. 2011; 11(6):943–55.
Doolittle WF, Bapteste E. Pattern pluralism and the tree of life hypothesis. Proc Nat Acad Sci USA. 2007; 104(7):2043–9.
Bapteste E, van Iersel L, Janke A, Kelchner S, Kelk S, McInerney JO, Morrison DA, Nakhleh L, Steel M, Stougie L, et al. Networks: expanding evolutionary thinking. Trends Genet. 2013; 29(8):439–41.
Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006; 23(2):254–67.
Huson DH, Rupp R, Scornavacca C. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge: Cambridge University Press; 2010.
Morrison DA. Introduction to Phylogenetic Networks. Uppsala: RJR Productions; 2011.
Gusfield D. ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. Cambridge: MIT Press; 2014.
Huber KT, van Iersel L, Moulton V, Wu T. How much information is needed to infer reticulate evolutionary histories?Syst Biol. 2015; 64(1):102–11.
Kanj IA, Nakhleh L, Than C, Xia G. Seeing the trees and their branches in the network is hard. Theor Comput Sci. 2008; 401(1–3):153–64.
van Iersel L, Semple C, Steel M. Locating a tree in a phylogenetic network. Inform Process Lett. 2010; 110(23):1037–43.
Gunawan ADM, DasGupta B, Zhang L. A decomposition theorem and two algorithms for reticulationvisible networks. Inform. Comput. (to be accepted). In: Proc. of RECOMB 2016: 2016. p. 266. The journal version is available online: arXiv:1603.08655.
Zhang L. On treebased phylogenetic networks. J Comp Biol. 2016; 23(7):553–65.
Cardona G, Llabrés M, Rosselló F, Valiente G. Metrics for phylogenetic networks i: Generalizations of the robinsonfoulds metric. IEEEACM Trans Comput Biol Bioinform. 2009; 6(1):46–61.
Huson DH, Scornavacca C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol. 2012; 61(6):1061–7.
Gunawan ADM, Lu B, Zhang L. A program for verification of phylogenetic network models. Bioinformatics. 2016; 32(17):i503–i510.
Charlton ND, Carbone I, Tavantzis SM, Cubeta MA. Phylogenetic relatedness of the M2 doublestranded RNA in Rhizoctonia fungi. Mycologia. 2008; 100(4):555–64.
Grass Phylogeny Working Group. Phylogeny and subfamilial classification of the grasses (Poaceae). Ann Mo Bot Gard. 2001; 88(3):373–457.
Schmidt HA. Phylogenetic trees from large datasets. Düsseldorf: PhD thesis, University of Düsseldorf; 2003.
van Iersel L, Kelk S, Rupp R, Huson D. Phylogenetic networks do not need to be complex: using fewer reticulations to represent conflicting clusters. Bioinformatics. 2010; 26(12):124–31.
Albrecht B. Computing all hybridization networks for multiple binary phylogenetic input trees. BMC Bioinforma. 2015; 16(1):236.
Huson DH, Rupp R. Summarizing multiple gene trees using cluster networks. In: Algorithms in Bioinformatics (WABI), Vol. 5251 of Lecture Notes in Bioinformatics. Berlin: Springer: 2008. p. 296–305.
Huson DH, Rupp R, Berry V, Gambette P, Paul C. Computing galled networks from real data. Bioinformatics. 2009; 25(12):85–93.
Wu Y. Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics. 2010; 26(12):140–8.
Wen D, Yu Y, Hahn MW, Nakhleh L. Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis. Mol Ecol. 2016; 25(11):2361–72.
Acknowledgements
The authors thank Andreas DM Gunawan for discussion on the topic and anonymous reviewers for helpful comments on the first submission of this work.
Funding
The work in this paper and the publication costs were funded by the Singapore Ministry of Education Academic Research Fund MOE2014T21155.
Availability of data and material
The programs and related data are available for download on https://github.com/icelu/PhyloNetwork.
Authors’ contributions
Project design: LXZ. Program implementation and experiments: BXL. Manuscript preparation: BXL, LXZ and HWL. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 2, 2017. Selected articles from the 15th Asia Pacific Bioinformatics Conference (APBC 2017): genomics. The full contents of the supplement are available online http://bmcgenomics.biomedcentral.com/articles/supplements/volume18supplement2.
Author information
Authors and Affiliations
Corresponding author
Additional information
From The Fifteenth Asia Pacific Bioinformatics Conference Shenzhen, China. 16–18 January 2017
Additional file
Additional file 1
Supplementary material. Supplementary material contains the proof for Theorem 1 and Supplementary Figures. (PDF 229 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lu, B., Zhang, L. & Leong, H. A program to compute the soft Robinson–Foulds distance between phylogenetic networks. BMC Genomics 18 (Suppl 2), 111 (2017). https://doi.org/10.1186/s1286401735005
Published:
DOI: https://doi.org/10.1186/s1286401735005
Keywords
 Phylogenetic network
 Cluster containment problem
 Tree containment problem
 (Soft) Robinson–Foulds distance
 Exponentialtime algorithm