Phylogenomic incongruence in Ceratocystis: a clue to speciation?

BACKGROUND
The taxonomic history of Ceratocystis, a genus in the Ceratocystidaceae, has been beset with questions and debate. This is due to many of the commonly used species recognition concepts (e.g., morphological and biological species concepts) providing different bases for interpretation of taxonomic boundaries. Species delineation in Ceratocystis primarily relied on genealogical concordance phylogenetic species recognition (GCPSR) using multiple standard molecular markers.


RESULTS
Questions have arisen regarding the utility of these markers e.g., ITS, BT and TEF1-α due to evidence of intragenomic variation in the ITS, as well as genealogical incongruence, especially for isolates residing in a group referred to as the Latin-American clade (LAC) of the species. This study applied a phylogenomics approach to investigate the extent of phylogenetic incongruence in Ceratocystis. Phylogenomic analyses of a total of 1121 shared BUSCO genes revealed widespread incongruence within Ceratocystis, particularly within the LAC, which was typified by three equally represented topologies. Comparative analyses of the individual gene trees revealed evolutionary patterns indicative of hybridization. The maximum likelihood phylogenetic tree generated from the concatenated dataset comprised of 1069 shared BUSCO genes provided improved phylogenetic resolution suggesting the need for multiple gene markers in the phylogeny of Ceratocystis.


CONCLUSION
The incongruence observed among single gene phylogenies in this study call into question the utility of single or a few molecular markers for species delineation. Although this study provides evidence of interspecific hybridization, the role of hybridization as the source of discordance will require further research because the results could also be explained by high levels of shared ancestral polymorphism in this recently diverged lineage. This study also highlights the utility of BUSCO genes as a set of multiple orthologous genes for phylogenomic studies.


Background
Delineation of species boundaries is a complex and highly contentious topic among evolutionary biologists. Ideally, a species should be defined as representing a single lineage that maintains its identity from others, with its own evolutionary tendencies and historical fate [1]. In fungi, species recognition is generally based on three commonly applied concepts i.e., the Biological Species Concept (BSC), the Morphological Species Concept (MSC) and the Phylogenetic Species Concept (PSC) [2,3]. Typically, species are recognised based on the application of systematic characters to reliably distinguish all individuals belonging to a defined group or lineage. MSC and BSC are trait-based and species are grouped using visibly measurable traits such as morphology or reproductive compatibility [4]. PSC differs from MSC and BSC in that it makes use of conservation in DNA sequences to represent shared ancestry [4].
Species delineation is the taxonomic practice that is used to describe an organism in relation to others [5]. Species boundaries defined using BSC, PSC and MSC in fungal systematics are challenged by recurrent inconsistencies [4,5]. For example, in the case of the BSC and MSC, species numbers could be underestimated due to the extended time periods for changes in morphology or mating compatibility to become evident [2]. PSC determines species boundaries objectively by measuring DNA changes over time [6]. As such, it could be argued that the PSC offers the best possible approach because changes in gene sequences can be easily related to evolutionary time. Trait-based concepts typically lead to ambiguous outcomes due to convergent evolution of morphological traits and where cryptic species are commonly overlooked [2]. In this regard, cryptic speciation is common in, but not limited to, groups that comprise large numbers of species such as the prokaryotes and fungi [5].
Ceratocystis is one of numerous genera that reside in the family Ceratocystidaceae, order Microascales, and class Sordariomycetes [7]. Species in this family include important plant pathogens that cause serious disease, both in agricultural crops and in natural ecosystems [8][9][10]. Application of the PSC for Ceratocystis reveals four geographically defined groups. These include the North American clade (NAC) [11], the Latin American clade (LAC) [12,13], the African clade (AFC) [14,15] and the Asian-Australian clade (AAC) [11,16,17]. Yet problems regarding the taxonomy of Ceratocystis remain prominent. For example, Fourie et al. [18] were not able to distinguish between C. manginecans and C. acaciivora using commonly used molecular markers and reduced these species to synonymy. Similarly, Oliveira et al. [19] could not distinguish among phylogenetic lineages of C. manginecans, C. eucalypticola and C. fimbriata using BSC and consequently regarded these three species as a single taxon represented by multiple distinct genotypes. Other researchers (Harrington et al. and Li et al. [20,21]) suggest that isolates of C. fimbriata, C. manginecans and C. eucalypticola represent a single South American species that has been introduced on different hosts to other continents by humans.
The Internal Transcribed Spacer (ITS) region of ribosomal RNA genes is generally treated as the barcode region used for fungal species identification [22]. It is often used in combination with additional gene regions such as β-tubulin and translation elongation factor 1-α to delineate species utilising Genealogical Concordance Phylogenetic Species Recognition (GCPSR) [2]. But the ITS region, especially when it is used alone, is not considered reliable for species delineation in Ceratocystis [18,19]. This is due to intragenomic variation of multiple ITS gene within individual isolates of Ceratocystis [23,24]. This variation was initially observed in a single C. manginecans isolate (LAC), which included ITS types similar to the ITS of two distinct species [23,25,26] but many other examples have arisen more recently [27,28].
Intragenomic variation in the ITS region has been associated with hybridization [29,30]. Ribosomal genes occur as tandem repeats and the intragenomic copies, or paralogs, are usually conserved due to concerted evolution [31]. The mechanisms responsible for this phenomenon include gene conversion and unequal crossing over [32]. In plants, hybridization leads to the retention of both parental ITS types, homogenization to a single ITS sequence and/or homogenization of elements of each parental ITS type into a single composite sequence [29]. Hybridization was first suggested to occur in Ceratocystis by Engelbrecht and Harrington [12]. A study on Ceratocystis manginecans to elucidate the causes of intragenomic variation in the ITS region demonstrated the effects of unequal crossing over, and potentially gene conversion, to explain the random homogenization toward a specific ITS type in culture [23]. The results suggested that the observed polymorphisms in the ITS region could have originated from a hybridization event.
Phylogenetic incongruence in Ceratocystis, and the presence of multiple ITS types within individual isolates has raised many questions regarding species boundaries in this genus. Phylogenomic analyses have been used to resolve incongruent phylogenetic relationships [33], analyse incongruence of genes and their histories, understand population dynamics and to explore evolutionary patterns acting across the genome [34]. The aim of this study was to use a phylogenomic approach to (i) identify a set of orthologous genes shared across the Ceratocystidaceae (ii) use these genes to identify the extent of discordance among gene trees, (iii) and analyse the alternative topologies within Ceratocystis, specifically within the LAC. The overall objective was to explore the possible role of hybridization and/or introgression that might explain phylogenetic discordance in the group. This approach allowed for a comprehensive species tree estimation using GCPSR with the largest dataset used thus far for this genus. This phylogenomic study made use of the Benchmarking Universal Single-Copy Orthologs tool [BUSCO] method [35] as the basis for ortholog selection.

Genome information
The genomes and genome assembly statistics are summarised in Table 1. Genome sizes in Ceratocystis varied between 27 to 30 Mb. These genomes were of high quality, as shown by their N50 values (Table 1) and genome completeness based on BUSCO analyses ( Table 2). The representative isolates have a broad geographical distribution, including North America, Africa, Europe and South East Asia.
Ortholog selection using BUSCO analysis BUSCO analysis of the 17 Ceratocystidaceae genomes showed high levels of completeness (Table 2) with scores between 97 and 98%. An average of 1409 complete, single-copy BUSCO genes were successfully identified across all genomes. The average number of duplicated BUSCOs was approximately 7.5%, with all genomes showing little fragmentation and low levels of missing genes (± 1%). Orthologs for phylogenomic analysis were selected based on BUSCO genes that were complete, and present in single copy in each genome. A total of 1123 BUSCOs were found to be shared within Ceratocystis. Of these, 1121 BUSCO sequences were retained after curation and considered for phylogenomic analysis. When the outgroup taxa Davidsoniella and Endoconidiophora were used, the total was 1082 BUSCOs with 1069 nucleotide alignments being retained after curation.

Phylogenetic analyses
Functional annotation of the 1082 complete BUSCOs revealed that these genes were predominantly associated with primary cellular functions, including cellular regulation, organization and related key processes (Additional file 1: Figure S1). To determine the phylogenetic relatedness of Ceratocystis spp., initial analyses only included C. smalleyi, C. manginecans, C. albifundus, C. platani, C. fimbriata, and C. eucalypticola. Two maximum likelihood (ML) species trees were generated using curated concatenated amino acid sequence alignments (633,499 aa) and nucleotide alignments (approximately 2.2 Mbp long). These data were obtained from a total of 1121 shared BUSCO genes. The species tree nodes were well supported with bootstrap values of 100% observed in all nodes ( Fig. 1). Incongruence between the amino acid and nucleotide ML species tree topologies was observed between C. manginecans, C. fimbriata and C. eucalypticola. The amino acid ML species tree placed C. fimbriata and C. eucalypticola as a sister clade to C. manginecans (Fig. 1a). In contrast, the nucleotide ML species tree placed C. eucalypticola and C. manginecans as a clade separate from C. fimbriata (Fig. 1b).
Further analysis of incongruence among the 1121 amino acid ML tree set using DensiTree revealed 448 consensus tree topologies present in the tree set (Fig. 2a). Tree topologies showed incongruent branches throughout the dataset, including inconsistencies in the deeper nodes of the tree. MetaTree analysis showed a star-like pattern, with support for four consensus nodes (Additional file 2: Figure S2 A). Although not a complete representation of the number of gene trees supporting each Number of contigs greater than 500 bp topology, the star-tree like pattern illustrated the major incongruence of this dataset. Topologies represented by the four consensus nodes lacked phylogenetic resolution and did not resolve the species relationships. None of the consensus trees resolved C. platani as a distinct lineage, while the two smaller consensus trees either lacked resolution for C. albifundus or showed no resolution across the analysed Ceratocystis spp.
DensiTree analysis of the nucleotide 1121 gene ML tree set showed a reduction in the number of alternative topologies (99) compared to the amino acid dataset (448). Discordance patterns were mostly observed within the C. manginecans, C. fimbriata and C. eucalypticola clade (Fig. 2b). Approximately 73% of the gene trees show incongruence occurring within C. fimbriata, C. manginecans and C. eucalypticola. Despite some incongruence involving C. platani and to a lesser extent C.  The number of Complete Duplicated Genes albifundus (CMW17620), the dataset supported the distinction of these species from C. manginecans and C. fimbriata. Three main topological patterns were evident within the C. manginecans and C. fimbriata lineage ( Fig.  2b and Additional file 3: Figure S3). These topologies were supported by approximately 17% of the ML gene trees. DensiTree analysis further showed that clade probability levels within this group range between 21 and 32%, with the larger percentage supporting the grouping of C. eucalypticola with C. manginecans. Meta-Tree analysis again revealed a star-like topology, but the improved resolution using nucleotide data revealed a greater number of tree clusters (Additional file 2: Figure  S2 B). Although most the consensus trees included C. platani as a part of the incongruent clade, the proportions of support for these consensus trees was masked by other topologies.
To better understand the levels of incongruence seen in the C. manginecans, C. eucalypticola and C. fimbriata clade, an expanded dataset including 5 C. albifundus isolates was analysed. These were specifically used to compare the patterns of incongruence within a well-defined species [37,38]. In addition, outgroups (D. virescens, E. polonica and E. laricicola) were included to root the phylogenetic trees. The final dataset included 17 Ceratocystidaceae isolates used in this study (Table 1). After concatenation and curation of the 1082 BUSCO genes shared among the expanded dataset, we inspected the alignment and removed genes that were not present in all 17 isolates leaving 1069 BUSCO genes. For this analysis only nucleotide data were considered due to the low signal caused by widespread conservation in the amino acid sequences in the initial analysis including only Ceratocystis species. The ML and Bayesian species tree estimation was performed using a concatenated dataset (again approximately 2 Mbp long) including all 1069 shared BUSCO sequences. Both ML and Bayesian species trees showed separation between C. manginecans and C. eucalypticola supporting previous findings [7] (Fig. 3 and Additional file 4). The branch lengths in the C. manginecans lineage were short however, there was evidence to suggest a deeper branching pattern compared to the C. albifundus lineage (Fig. 3).
Incongruence analysis of the nucleotide ML gene tree set of 1069 concatenated BUSCOs shared among the 17 Ceratocystidaceae genomes analysed using DensiTree revealed 977 consensus tree topologies ( Fig. 4a and b). There were several incongruent branches deep within the tree space, showing uncertainty in the divergence patterns of Ceratocystis. The deep branching pattern of the LAC was distinct, but a less uniform pattern was observed towards the terminal nodes. This was especially

Discussion
Several species concepts have recently been applied to determine species boundaries in Ceratocystis [18,19]. Species concepts in the phylogenetics era are however, constantly being challenged. This is particularly true when the regions/markers applied have conflicting signals due to lack of resolution, as seen for highly conserved genes or where there are high levels of ancestral polymorphism. The results of this study call to question the utility of employing small numbers of molecular markers when defining species boundaries.
The ML phylogenetic tree generated using the concatenated nucleotide dataset covering 17 genomes and seven species in this genus and over 1000 loci support the phylogenetic relationships established by the recent taxonomic study for alternative markers in Ceratocystis [18]. Previous studies have failed to differentiate between C. manginecans, C. eucalypticola and C. fimbriata isolates using BSC [19] but the ML phylogenetic tree placed C. fimbriata as a separate lineage from C. manginecans and C. eucalypticola. Results of the present study also suggest that BUSCOs [35], can be helpful in resolving taxonomic questions such as those for Ceratocystis, where commonly used nuclear markers fail to delineate species. Indeed, these BUSCO genes could complement previous efforts to identify molecular markers for delineating Ceratocystis species [18].
ML phylogenies obtained from nucleotide and amino acid datasets revealed incongruence in Ceratocystis. For example, discordance between the species tree topologies was observed among C. manginecans, C. eucalypticola and C. fimbriata. While the amino acid ML phylogenetic tree placed C. fimbriata and C. eucalypticola as a sister clade to C. manginecans, the nucleotide ML species tree placed C. eucalypticola and C. manginecans as a clade separated from C. fimbriata. Similar incongruence was observed between individual nucleotide and amino acid ML gene trees. The results of this study emphasise the importance of analysing a dataset comprised of multiple genes for species delineation [39]. This is particularly relevant for species of Ceratocystis residing in the LAC where the branching pattern is difficult to determine.
The hypothesis that Ceratocystis is a recently diverged lineage was raised in a recent study of Van der Nest et al. [40] where the age of speciation events in the Ceratocystidaceae was estimated. Short branch lengths Fig. 3 Maximum likelihood species phylogeny of the 17 Ceratocystidaceae isolates used in this study. The parameters used in the ML include the GTRGAMMA model of evolution and 1000 bootstrap replicates for branch support estimation. All nodes supporting each species are supported by 100% bootstrap values. Bootstrap for nodes supporting isolates of the same species were below 100% as expected (not shown). Insets A and B are zoomed in images of the C. manginecans and C. albifundus clades respectively separating these lineages as shown by the ML species phylogeny for Ceratocystis especially within the LAC, and the patterns of incongruence observed in this study are characteristics of recently diverged lineages [41]. Notwithstanding our findings, the possibility that the incongruence patterns in Ceratocystis are due to the use of highly conserved genes cannot be excluded. The resolution offered by the BUSCOs, which provide a large sample size of conserved orthologs present in all fungi [35], may not be sufficient, thus complicating the process of species delineation. As a case in point, in our study we were not able to resolve C. platani as a distinct lineage despite using more than 1000 gene loci.
Introgressive hybridisation or shared ancestral polymorphism are the most common biological causes of phylogenetic tree incongruence [42]. Both factors manifest in the same way when assessing tree topologies. There is no reliable way to distinguish between these possibilities, although several have been proposed [43,44]. The results of the present study show incongruence patterns in the LAC group of Ceratocystis, which may be expected in lineages that have undergone introgression. Introgression, or gene flow, is also most common in populations that constantly undergo admixture, or in populations that are in the process of divergence [6]. In a study by Lee et al. [45], an intermediate level of gene flow was reported in populations of C. albifundus. Overall, the results of the present study appear to reflect a situation in Ceratocystis where speciation is occurring and where gene flow will continue until barriers are established through absolute divergence [6].
Closely related species of Ceratocystis such as those related to C. fimbriata display a high level of host specificity. For example, the sweet potato pathogen that defines the genus infects only this host and isolates represent a single globally distributed clone that has recently been designated as a forma specialis of C. fimbriata [46]. Other species such as C. manginecans that also display relatively limited genetic variability have a much wider host range that could have been caused by undetected positive selection. How these should be treated taxonomically has yet to be resolved but this clearly requires an analysis of large populations of isolates, from different hosts and geographic locations. In this regard, species of Ceratocystis provide a useful example to explore species concepts in a fungal lineage that is currently undergoing divergence.
A phylogenomics analysis to resolve a taxonomic question utilises considerably more data than those based on multigene phylogenies. However, despite the larger body of data, this approach failed to resolve the issue as to whether the isolates of Ceratocystis residing in the LAC Fig. 4 DensiTree analysis of phylogenetic trees of 1069 concatenated gene sequences including all 17 isolates analysed in this study. This image illustrates the difference in branching patterns between the well-defined lineage of CALB (C. albifundus) and the more divergent groupings of CEUC-CMAN (C. eucalypticola and C. manginecans) and CFIM (C. fimbriata). a -DensiTree image of all trees drawn with default drawing settings using the 'Closest First' Shuffle. b -DensiTree image of the consensus tree topologies drawn using the star-tree drawing option to illustrate branching patterns of the ML phylogenies. LAC denotes Latin American Clade group are taxonomically robust species. All researchers in the field agree that C. platani is a species distinct from C. fimbriata but these two species have also been reported to be interfertile [12]. In addition, Fourie et al. [47] reported significant transmission ratio distortion in a cross between C. manginecans and C. fimbriata, providing evidence that these two species or populations have been reproductively isolated for some time. Results of the present study show there is greater separation between C. platani than between C. manginecans, C. fimbriata and C. eucalypticola. It is also clear that within a well-defined species such as C. albifundus there is more obvious recombination than is evident than in the LAC.

Conclusions
Phylogenomic analyses of representative species in Ceratocystidaceae revealed widespread incongruence among single gene trees. Our analyses showed evolutionary patterns consistent with those of introgressive hybridization although the similar effects of incomplete lineage sorting could not be completely ruled out and is subject to further studies. The concatenated dataset was able to resolve some of the incongruence suggesting a phylogenomic approach could be necessary for the phylogeny of these species. As such, we recommend that future taxonomic analyses of species in the Ceratocystidaceae should apply a phylogenomic approach incorporating larger populations sampled from different hosts and geographical regions to maximise genetic variability.

Isolates and genome information
Seventeen isolates representing 10 species across three genera of the Ceratocystidaceae [7], were chosen for this study ( Table 1). The collection included 14 isolates representing seven species of Ceratocystis, two representative isolates of Endoconidiophora and a single Davidsoniella isolate. These isolates were selected based on the fact that whole genome sequence data were available for them [40,[48][49][50][51][52][53][54]. The sequences were downloaded from the National Centre for Biotechnology Information database (NCBI; http://www.ncbi.nlm.nih.gov/).

Ortholog selection using BUSCO analysis
Single copy orthologs for phylogenomic analysis were identified using the Benchmarking Universal Single-Copy Orthologs (BUSCOs) tool (BUSCO v1.1b1). The BUSCO genomics tool performs an assessment of genome assembly completeness by quantifying the percentage of core genes present from a specific lineage (Simão et al. [35]). BUSCO was run in genome mode with the fungal lineage dataset. The remaining parameters focused on optimal training during the AUGUSTUS annotation step (−-long), with the gene models of Fusarium graminearum being specified. Fusarium is the phylogenetically closest fungal genus for which these models were available. The shared, complete, single copy BUSCOs were extracted into fasta files using BEDtools v2.26.0 [55,56].

Phylogenetic analyses
Characterization of the shared BUSCO genes across the Ceratocystidaceae was performed using Blast2GO's default pipeline [57][58][59] to identify functions possibly linked to patterns observed in the phylogenomic analysis. Multiple sequence alignments (MSAs) of the BUSCO nucleotide sequences were generated using MAFFT v7 [60,61] and curated using Gblocks v0.91b [62] to remove gaps and to rectify misaligned regions. The curated MSAs were concatenated using FASconCAT-G [63,64], and a Maximum Likelihood (ML) species phylogenetic tree constructed using RAxML [65,66] with 1000 bootstrap replicates performed. Additionally, to confirm the ML species tree topology, a Bayesian species tree was generated with MRBAYES with one million generations [67]. Individual ML gene trees were generated for each BUSCO alignment using RAxML with 100 bootstrap replicates. Substitution models for the nucleotide ML gene trees (individual gene trees and species tree) used the GTR model as a default parameter. DensiTree [68] was used to visualise the incongruent phylogenies by drawing all topologies in a tree set on the same image using transparency to show conflict. MetaTree was used to compare and visualize multiple alternative tree topologies [69]. The gene trees were first transformed in Figtree v1.4.2 (tree.bio.ed.ac.uk/software/figtree/) to make all branch lengths proportional, ensuring overlap for tree topology comparison in DensiTree. Compare2trees was used for comparisons between two alternative topologies [36].
Additional file 1: Figure S1. Pie chart summarizing the biological processes of the shared BUSCO genes in the analysed Ceratocystidaceae. The numbers in brackets represent the number of GO (Gene Ontology) annotations.
Additional file 2: Figure S2. Additional file 3: Figure S3. The three main consensus topologies of the DensiTree analysis of the 1069 nucleotide ML gene trees including all 17 Ceratocystidaceae genomes analysed. Topology 1 representing 17% of all gene trees is coloured in blue, topology 2 representing 16.5% of all ML gene trees is coloured in red, and topology 3 representing 16% of all ML gene trees is coloured in green. See Table 1 in main article for full species names.
Additional file 4: Figure S4. A Bayesian species tree for the Ceratocystidaceae species analysed. The GTR model with gamma distribution and one million generations in two runs were used. A burnin of 25% was applied when summarising the trees. All other parameters were set to default. The average standard deviation of tree splits was zero and the species tree nodes were absolutely supported with posterior probabilities of 1.