Streptomycetaceae genomes have highly variable gene content
We began by assembling the set of all Streptomycetaceae genomes publicly available from NCBI. We grouped proteins into COGs based on reciprocal best BLAST hits having pairwise amino acid identity of at least 60% (Additional file 1: Figure S1). This process resulted in identification of only 157 core genes shared by 701 complete or nearly-complete Streptomycetaceae genomes, as contrasted with 235,145 COGs comprising the pan-genome (see Methods). To explore pan-genome evolution, for each pair of genomes we calculated their protein-coding gene content similarity (the fraction of shared COGs; also known as proteome similarity) using the Jaccard index, a measure of binary similarity. Impressively, proteome similarity between pairs of Streptomycetaceae strains could be as low as 12.3%, with a median of 27.7% shared COGs. Even strains assigned the same species name had as little as 24.2% average gene content similarity and a median similarity of just 81.2%. The existence of divergent proteomes sharing the same species name justifies ongoing efforts to reclassify Streptomyces strains [33, 53, 54].
Correlations with conventional species similarity criteria
A common approach to delimit species of bacteria is to use a cutoff of > 97% sequence similarity in 16S rRNA gene sequence. We found 16S rRNA similarity to be a poor predictor of proteome similarity (R2 = 0.38). For example, strains with identical 16S sequences varied between 39 and 100% in proteome similarity (Fig. 1). The low correlation between 16S and proteome similarity is likely a result of the limited size of the 16S rRNA gene and its slow rate of sequence evolution relative to the rate at which auxiliary genes are gained or lost. The poor predictive value of 16S sequences corroborates previous reports that, regardless of threshold, 16S rRNA similarity cannot be employed to reliably identify species [13, 16, 55,56,57,58], although it is still widely used for this purpose (e.g., [59]).
The rpoB gene has been proposed as a better species-level phylogenetic marker for Streptomyces because it exhibits a strong correlation with the results of multi-locus sequence typing [60]. Similarity in rpoB sequence showed good correlation (R2 = 0.75) with proteome similarity, but provided little resolution above 75% proteome similarity, where rpoB sequences are nearly identical (Fig. 1). This strong correlation was impressive, given that rpoB is a single gene and is believed to undergo homologous recombination between strains [19]. As might be expected, two genome-wide measures of similarity, ANI and AF [52], also displayed strong correlations with proteome similarity (R2 = 0.87 and 0.80, respectively).
Pan-genome content shows signals of both horizontal and vertical inheritance
We next sought to determine whether the core- and pan-genomes shared the same predominant genealogical history, as has been demonstrated in studies of other bacterial groups [61, 62]. To accomplish this, we built trees from these non-overlapping data partitions using distinct tree-building methods. For the core-genome, we computed a maximum likelihood tree based on a concatenated alignment of the 157 core genes. For the pan-genome, we constructed a UPGMA tree from the matrix of gene content similarities. Figure 2 shows that these two trees differed in the resolution of basal nodes but were nonetheless highly congruent, especially in their branching order within smaller clades, as seen in studies of some other bacterial groups [63, 64]. The RF distance between the two trees is 598, much lower than the distance expected for random trees (~ 1396); the probability of two random trees being this similar by chance is 10− 1230. The remarkable congruence between the two trees strongly suggests that both have been shaped by the same, predominantly tree-like, genealogical history.
Genes that were not included in the core-genome must have experienced at least one gain or loss event within the Streptomycetaceae. Although Fitch parsimony is an imperfect method, given its prior assumption that gains and losses are equally likely [65], it does provide a minimum estimate of the total number of gene gains/losses. Mapping the pan-genome onto the core-genome tree using Fitch parsimony [66] requires 1,113,832 changes, of which 850,208 are gains and 263,624 are loses (under ACCTRAN optimization). Since there are 1,399 branches on the tree, this implies an average of about 608 gene gains and about 188 gene losses per branch, consistent with prior findings that there is a considerable amount of horizontal gene transfer within Streptomyces [19, 35]. Nonetheless, the fact that the core- and pan-genome trees agree significantly despite such a high frequency of HGT (and other causes of discordance), suggests that HGT usually involves a diversity of donor and recipient lineages, which allows an overall signal of vertical inheritance to prevail.
Exclusive taxa exist despite HGT
Species, like other ranks in the taxonomic hierarchy, are expected to meet the requirement of exclusivity, in which group members are more closely related to each other than they are to any non-group members [27, 67]. Given the shared genealogical history of the core- and pan-genome, we hypothesized that exclusive groups of bacteria may exist. Previously, the assessment of exclusivity has been based on the concordance factor: the proportion of the genome for which a given clade applies [27, 29]. The concordance factor treats all gene trees that lack a clade equally, regardless of how distantly related a group of tips are for each gene. Additionally, there is no specific concordance factor at which a group becomes exclusive: a group’s exclusivity depends on its concordance factor being greater than that of any conflicting group, which cannot be determined without checking the concordance factor of all clades with overlapping content. Here we developed a new approach based on the patristic (i.e., cophenetic) distances between each pair of strains. Since patristic distances differed across core-genes (mean R2 = 0.44), we averaged the patristic distances across all genes in the core-genome. This allowed us to quantify the degree of exclusivity as the minimum out-group patristic distance minus the maximum in-group patristic distance. Clades with a positive exclusivity score are exclusive, with higher scores indicating a greater degree of exclusivity.
Next, we compared the degree of exclusivity across three alternative approaches for defining groups, all while using the same matrix of average patristic distances to calculate exclusivity. First, we defined clades based on the pan-genome (gene content) tree. This was the most conservative approach because the data partition and analysis method used to delineate clades is distinct from that used to quantify exclusivity. For these data, exclusivity scores tended to be greater for clades with higher concordance factors, although larger clades were occasionally exclusive despite low concordance (Additional file 1: Figure S2). As might be expected, the exclusivity scores of internal edges were correlated with edge length on the pan-genome tree, with longer edges generally denoting clades having greater exclusivity (Additional file 1: Figure S3). In total, 358/699 (51.2%) of the internal edges on the pan-genome tree subtend exclusive taxa (as determined from the core-genome), and all but 41 of the 701 strains belong to at least one non-trivial (i.e., including > 1 strain) exclusive taxon (Fig. 3). The 41 strains that are not part of any multiply-sampled exclusive group all lack a close relative in the dataset, and therefore could reasonably be considered exclusive taxa by themselves (i.e., singletons). Second, we calculated exclusivity on the tree derived from a concatenated alignment of the 157 core genes, finding that 64.2% of clades were exclusive. Third, we used UPGMA to infer a tree based on the average patristic distances themselves, finding that 71.5% of clades on this tree were exclusive. Taken together, these three approaches demonstrate that exclusive clades are abundant among Streptomycetaceae genomes.
Exclusivity is potentially contingent upon the set of organisms included in a study, since groups can lose their exclusivity if a new strain is discovered that is closely related to a subset of organisms in two non-nested exclusive groups. Therefore, to assess whether the large number of exclusive taxa we identified could be an artifact of under-sampling the total population of Streptomycetaceae, we analyzed increasingly larger random subsets of the 701 genomes, using the original matrix of average patristic distances to recalculate exclusivity scores for clades in a re-computed pan-genome tree for each subsample of genomes. Whether considering the average exclusivity score or the proportion of clades that are exclusive, exclusivity quickly converges as more genomes are sampled (Fig. 4), plateauing at around 100 genomes. These findings suggest that further increasing the sampling of genomes would continue to add additional exclusive clades at the same rate that exclusive clades were disrupted, implying that the observed degree of hierarchical structure is not simply an artifact of under-sampling.
Absence of a clear transition at which to define the species rank
Having identified exclusive groups, we wished to explore whether there is an objective basis for treating some exclusive taxa as species. The fact that existing species names often match clades with relatedness above a certain threshold has previously been used to argue for an objective species boundary [13, 52]. However, this reasoning is suspect because defining a cutoff for the species rank based on what has previously been considered a species amounts to circular reasoning. To explore the objectivity of the species rank, we tested the hypothesis that there might be a sharp transition in the degree of exclusivity as a function of clade depth (i.e., increasing maximum in-group distance). Baum and Shaw [67] hypothesized that exclusivity would not apply within species but would emerge as one considered progressively more inclusive clades at or above the species rank. As can be seen in Fig. 5, this hypothesis is incorrect. Instead of an increase in exclusivity these data show a gradual decline in the degree of exclusivity with increasing in-group distance. Furthermore, there is no sudden step in the exclusivity function and, thus, no reason to believe that the species rank can be associated with an abrupt change in patterns of relationship. Therefore, while exclusive taxa are real, there does not appear to be any objective criterion based on genealogical concordance for denoting the species rank within bacterial genera.
Exclusive taxa can be ranked as species using sequence similarity cutoffs
In bacterial systematics, it has been common to associate species with a threshold of sequence similarity, often based on a single reference gene region (i.e., rDNA or rpoB). More recently, it has been argued to use a combination of both a genome-wide ANI of ≥96.5% and an AF of ≥60% [52, 68]. Here, we applied this combined ANI/AF criterion, equating species with the largest exclusive groups whose members satisfy the ANI/AF threshold with respect to one another (treating singleton tips as trivially exclusive). As above, we used a matrix of average patristic distances to calculate exclusivity and a UPGMA tree constructed from this matrix to define clades (Fig. 6). The ANI/AF threshold was applied in two ways: complete-linkage (members of a species meet the criteria with respect to all other members) and single-linkage (each member of a species meets the criteria with respect to at least one other member). Among the 500 exclusive groups detected, the complete- and single-linkage criteria assigned 374 and 372 to the species rank, respectively, of which 250 and 249 are singletons. The largest number of genomes assigned to a single species was 38. The large proportion of singleton species reflects the stringency of the joint ANI/AF criterion and the fact that the genus Streptomyces is evolutionarily ancient [35]. Notably, the addition of an exclusivity criterion does not result in over-splitting of species: among all pairs of genomes that meet the ANI/AF threshold, only 0.8% and 0.4% (for complete-linkage and single-linkage, respectively) were split into different species by considering exclusivity.
Similar results are observed in Bacillus
To determine whether our conclusions were applicable beyond Streptomycetaceae, we conducted the same set of analyses on a set of 1,586 genomes belonging to Bacillus. This resulted in a matrix of 140,300 COGs shared by at least two strains and 155 core-genes shared by all sampled Bacillus genomes. In comparison to Streptomyces, Bacillus was more redundantly sampled, with four named species being represented by more than 100 genomes each. Nonetheless, analyses of these Bacillus genomes yielded similar results to Streptomycetaceae (Fig. 5, Additional file 1: Figures S2-S3). Again, we observed substantial congruence between trees constructed from the core-genome and pan-genome data partitions (Additional file 1: Figure S4) and the probability of two random trees being this similar by chance is vanishingly small (10− 1712). On the pan-genome tree, 447 clades (28.2%) were exclusive according to the matrix of average core-genome patristic distances. Similarly, a UPGMA tree based on average patristic distances had 922 (58.2%) exclusive clades. These lower percentages (than Streptomycetaceae) mainly reflect a lack of exclusive groups in densely sampled clades of closely related genomes. Only 4 tips were not part of any exclusive group and, as was the case for Streptomycetaceae, all of these tips could be considered singleton species due to their substantial distance from any other genome in the set.
As with the Streptomycetaceae, we applied automatic delimitation of species in Bacillus based on exclusivity using the core-genome patristic distances combined with the same joint ANI/AF threshold. This approach identified 330 species with complete-linkage and 219 with single-linkage, of which 168 and 107 were singletons (Additional file 1: Figure S5). The average number of genomes per species for the single-linkage case in Bacillus (7.2) is higher than Streptomycetaceae (1.9) reflecting the existence of a higher average pairwise similarity among Bacillus strains, especially in the most densely sampled clade (Additional file 1: Figure S5, right side). Furthermore, the Bacillus data set includes many more pairs of potential conspecifics (i.e., strains whose similarity is greater than the ANI/AF threshold) that were split into separate species by the exclusivity criterion: 58.5% and 20.5% by complete-linkage and single-linkage, respectively. This discrepancy reflects the relatively dense sampling of closely related Bacillus genomes, which results in a large number of clades with exclusivity scores very close to zero. Indeed, if we permit species groups with exclusivity scores slightly less than zero (> − 0.01), we find that only 0.8% of potential conspecifics are assigned to different species by single-linkage. This looser definition of exclusivity prevents minor discrepancies in the distance matrix from breaking exclusivity, and resembles relaxed definitions of clades and the core-genome that have been applied in other studies [62, 69]. Notwithstanding these differences between Streptomycetaceae and Bacillus, our analyses suggest that exclusive taxa are widespread among bacteria and illustrate that our approach for automatically delimiting species groups is generalizable.