Distribution and diversity of dimetal-carboxylate halogenases in cyanobacteria

Background Halogenation is a recurring feature in natural products, especially those from marine organisms. The selectivity with which halogenating enzymes act on their substrates renders halogenases interesting targets for biocatalyst development. Recently, CylC – the first predicted dimetal-carboxylate halogenase to be characterized – was shown to regio- and stereoselectively install a chlorine atom onto an unactivated carbon center during cylindrocyclophane biosynthesis. Homologs of CylC are also found in other characterized cyanobacterial secondary metabolite biosynthetic gene clusters. Due to its novelty in biological catalysis, selectivity and ability to perform C-H activation, this halogenase class is of considerable fundamental and applied interest. The study of CylC-like enzymes will provide insights into substrate scope, mechanism and catalytic partners, and will also enable engineering these biocatalysts for similar or additional C-H activating functions. Still, little is known regarding the diversity and distribution of these enzymes. Results In this study, we used both genome mining and PCR-based screening to explore the genetic diversity of CylC homologs and their distribution in bacteria. While we found non-cyanobacterial homologs of these enzymes to be rare, we identified a large number of genes encoding CylC-like enzymes in publicly available cyanobacterial genomes and in our in-house culture collection of cyanobacteria. Genes encoding CylC homologs are widely distributed throughout the cyanobacterial tree of life, within biosynthetic gene clusters of distinct architectures (combination of unique gene groups). These enzymes are found in a variety of biosynthetic contexts, which include fatty-acid activating enzymes, type I or type III polyketide synthases, dialkylresorcinol-generating enzymes, monooxygenases or Rieske proteins. Our study also reveals that dimetal-carboxylate halogenases are among the most abundant types of halogenating enzymes in the phylum Cyanobacteria. Conclusions Our data show that dimetal-carboxylate halogenases are widely distributed throughout the Cyanobacteria phylum and that BGCs encoding CylC homologs are diverse and mostly uncharacterized. This work will help guide the search for new halogenating biocatalysts and natural product scaffolds. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07939-x.


Introduction
Nature is a rich source of new compounds that fuel innovation in the pharmaceutical and agriculture sectors [1]. The remarkable diversity of natural products (NPs) results from a similarly diverse pool of biosynthetic enzymes [2]. These often are highly selective and efficient, carrying out demanding reactions in aqueous media, and therefore are interesting starting points for the development of industrially relevant biocatalysts [2]. Faster and more accessible DNA sequencing technologies have enabled, in the past decade, a large number of genomics and metagenomics projects focused on the microbial world [3]. The resulting sequence data holds immense opportunities for the discovery of new microbial enzymes and their associated NPs [4].
Halogenation is a widely used and well-established reaction in synthetic and industrial chemistry [5], which can have significant consequences for the bioactivity, bioavailability and metabolic activity of a compound [5][6][7]. Halogenating biocatalysts are thus highly desirable for biotechnological purposes [6,8]. The mechanistic aspects of biological halogenation can also inspire the development of organometallic catalysts [9]. Nature has evolved multiple strategies to incorporate halogen atoms into small molecules [6], as illustrated by the structural diversity of thousands of currently known halogenated NPs, which include drugs and agrochemicals [10,11]. Until the early 1990's, haloperoxidases were the only known halogenating enzymes. Research on the biosynthesis of halogenated metabolites eventually revealed a more diverse range of halogenases with different mechanisms. Currently, biological halogenation is known to proceed by distinct electrophilic, nucleophilic or radical mechanisms [6]. Electrophilic halogenation is characteristic of the flavin-dependent halogenases and the hemeand vanadium-dependent haloperoxidases, which catalyze the installation of C-I, C-Br or C-Cl bonds onto electron-rich substrates. Two families of nucleophilic halogenases are known, the halide methyltransferases and SAM halogenases. Both utilize S-adenosylmethionine (SAM) as an electrophilic co-factor or as a cosubstrate and halide anions as nucleophiles. Notably, these are the only halogenases capable of generating C-F bonds [12]. Finally, radical halogenation has only been described for nonheme-iron/2-oxo-glutarate (2OG)dependent enzymes. This type of halogenation allows the selective insertion of a halogen into a non-activated, aliphatic C-H bond. A recent review by Agarwal et al. (2017) thoroughly covers the topic of enzymatic halogenation.
Cyanobacteria are a rich source of halogenases among bacteria, in particular for nonheme iron/2OG-dependent and flavin-dependent halogenases ( Fig. 1a and b). AmbO5 and WelO5 are cyanobacterial enzymes that belong to the nonheme iron/2OG-dependent halogenase family [13][14][15]. AmbO5 is an aliphatic halogenase capable of site-selectively modifying ambiguine, fischerindole and hapalindole alkaloids [13,14]. The close homolog (79 % sequence identity) WelO5 is capable of performing analogous halogenations in hapalindole-type alkaloids and it is involved in the biosynthesis of welwintindolinone [14,16]. BarB1 and BarB2 are also nonheme iron/2OG-dependent halogenases that catalyze trichlorination of a methyl group from a leucine substrate attached to the peptidyl carrier protein BarA in the biosynthesis of barbamide [17][18][19]. Other halogenases from this enzyme family include JamE, CurA, and HctB. JamE and CurA catalyse halogenations in intermediate steps of the biosynthesis of jamaicamide and curacin A, respectively [20,21], while HctB is a fatty acid halogenase responsible for chlorination in hectochlorin assembly [22]. Flavin-dependent halogenasesfor which characterized examples include the tryptophan-7-chlorinase PrnA from Pseudomonas fluorescens [23], the phenol brominase Bmp5 from members of the bacterial genera Marinomonas and Pseudoalteromonas [24,25] and the recently described chlorinase AoiQ which functionalizes the sp 3 -hybridized carbon atoms of 1,3-diketones [26] have also been reported in cyanobacterial biosynthetic pathways. While, to our knowledge, a cyanobacterial flavin-dependent halogenase has not been characterized in vitro, ApdC and McnD are predicted to be FADdependent enzymes responsible for the halogenation of tyrosine in cyanopeptolin-type peptides. It has been proposed that these enzymes chlorinate, respectively, anabaenopeptilides in Anabaena and micropeptins in Microcystis strains [27][28][29][30]. AerJ is another example of a predicted cyanobacterial FAD-dependent chlorinase, likely involved in the modification of a tyrosine-derived moiety in aeruginosin biosynthesis in Planktothrix and Microcystis strains [29].
Recent efforts to characterize the biosynthesis of structurally unusual cyanobacterial natural products have uncovered a distinct class of halogenating enzymes (Fig. 1c). Using a genome mining approach, Nakamura et al. (2012) discovered the biosynthetic gene cluster (BGC) responsible for the biosynthesis of the cylindrocyclophanes in the cyanobacterium Cylindrospermum licheniforme ATCC 29412 [31]. These natural paracyclophanes were found to be assembled from two identical chlorinated alkylresorcinol units [32]. The paracyclophane macrocycle is created by forming two C-C bonds using a Friedel-Crafts-like alkylation reaction catalyzed by the enzyme CylK [32] (Fig. 1c). Therefore, although many cylindrocyclophanes are not halogenated, their biosynthesis involves a halogenated intermediate [31,32], a process termed a cryptic halogenation [33]. Nakamura et al. (2017) showed that the CylC enzyme was responsible for regio-and stereoselectively installing a chlorine atom onto the fatty acid-derived sp 3 carbon center of a biosynthetic intermediate that is subsequently elaborated to the key alkylresorcinol monomer (Fig. 1c). To date, CylC is the only characterized dimetal-carboxylate halogenase (this classification is based on both biochemical evidence and similarity to other diiron-carboxylate proteins) [32]. Homologs of CylC have been found in the BGCs of the columbamides [34], bartolosides [35], microginin [32], puwainaphycins/ minutissamides [36], and chlorosphaerolactylates [37], all of which produce halogenated metabolites. CylC-type enzymes bear low sequence homology to dimetal desaturases and N-oxygenases [32], functionalize C-H bonds in aliphatic moieties at either terminal or mid-chain positions, and are likely able to carry out gem-dichlorination [34,35]. The reactivity displayed by CylC and its homologs is of interest for biocatalysis, in particular because this type of carbon center activation is often inaccessible to organic synthesis [16,38]. An understanding of the molecular basis for the halogenation of different positions and for chain-length preference will also be of value for biocatalytic applications. Hence, accessing novel variants of CylC enzymes will facilitate the functional characterization of this class of halogenases, mechanistic studies, and biocatalyst development.
Here, we provide an in-depth analysis of the diversity, distribution and context of CylC homologs in microbial genomes. Using both publicly available genomes and our in-house culture collection of cyanobacteria (LEGEcc), we report that CylC enzymes are common in Fig. 1 Selected examples of halogenation reactions catalyzed by different classes of microbial enzymes, with a focus on cyanobacterial halogenases. a Chlorination of carrier protein-bound or free-standing aliphatic substrates by the nonheme iron/2OG-dependent halogenases CurA and WelO5; b Chlorination and bromination of electron-rich substrates by the flavin-dependent halogenases PrnA, Bmp5 and McnD; c Chlorination of terminal or mid-chain alkyl positions by dimetal-carboxylate halogenases CylC, ColD/ColE, ClyC/ClyD and BrtJ. An asterisk denotes that the enzyme has been biochemically characterized. ACPacyl carrier protein cyanobacterial genomes, found in numbers comparable to those of flavin-dependent or nonheme iron/2OGdependent halogenases. We additionally show that CylC homologs are distributed throughout the cyanobacterial phylogeny and are, to a great extent, part of cryptic BGCs with diverse architectures, underlining the potential for NP discovery associated with this new halogenase class.

Sequence similarity networks and Genomic Neighborhood Diagrams
Sequence similarity networks (SSNs) were generated using the EFI-EST sever, following a "Sequence BLAST (Basic Local Alignment Search Tool)" of CylC (AFV96137) as input [39], using negative log e-values of 2 and 40 for UniProt BLAST retrieval and SSN edge calculation, respectively. This SSN edge calculation cutoff was found to segregate the homologs into different SSN clusters, less stringent cutoff values resulted in a single SSN cluster. The 153 retrieved sequences and the query sequence were then used to generate the SSNs with an alignment score threshold of 42 and a minimum length of 90. The networks were visualized in Cytoscape (v3.80). The full SSN obtained in the previous step was used to generate Genomic Neighborhood Diagrams (GNDs) using the EFI-GNT tool [39]. A Neighborhood Size of 10 was used and the Lower Limit for Cooccurrence was 20 %. The resulting GNDs were visualized in Cytoscape (Fig. 2).

Cyanobacterial strains and growth conditions
Freshwater and marine cyanobacteria strains from Blue Biotechnology and Ecotoxicology Culture Collection (LEGEcc) (CIIMAR, University of Porto) were grown in 50 mL Z8 medium [40]   Genomic DNA extraction 50 mL of each cyanobacterial strain were centrifuged at 7000 ×g for 10 min. The cell pellets were used for genomic DNA (gDNA) extraction using the PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific) or NZY Plant/Fungi gDNA Isolation kit (Nzytech), according to the manufacturer's instructions.

Primer design
BLAST searches using CylC [Cylindrospermum licheniforme UTEX B 2014] as query identified related genes (for tBLASTn: 31-93 % amino acid identity). We discarded nucleotide hits with a length < 210 and evalues < 1 × 10 − 10 . The complete sequences (56 cylC homolog sequences, Table S1) were collected from NCBI and aligned using MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [41]. Phylogenetic analysis of the hits was performed using FastTree GTR with a rate of 100. Streptomyces thioluteus aurF, encoding a distant dimetal-carboxylate protein [32] was used as an outgroup (AJ575648.1:4858-5868). We divided the phylogeny of cylC homologs in five groups with moderate similarity (Fig. S1). The regions of higher similarity within each group were selected for degenerate primer design (Table 1).

PCR conditions
The PCRs to detect cylC homologs were conducted in a final volume of 20 µL, containing 6.9 µL of ultrapure water, 4.0 µL of 5× GoTaq Buffer (Promega), 2.0 µL of MgCl 2 , 1.0 µL of dNTPs, 2.0 µL of reverse and 2.0 µL of forward primer (each at 10 µM), 0.1 µL of GoTaq and 2.0 µL of cyanobacterial gDNA (20-100 ng/µL). PCR thermocycling conditions were: denaturation for 5 min at 95°C; 35 cycles with denaturation for 1 min at 95°C, primer annealing for 30 s at different temperatures (55°C for group A; 57°C for group B; 55 ºC for group C; 51°C for group D; 51°C for group X) and extension for 1 min at 72°C; and final extension for 10 min at 72°C.
When not already available, the 16S rRNA gene for a tested strain was amplified by PCR, using standard primers for amplification (CYA106F 5' CGG ACG GGT GAG TAA CGC GTG A 3' and CYA785R 5' GAC TAC WGG GGT ATC TAA TCC 3'). The PCR reactions were conducted in a final volume of 20 µL, containing 6.9 µL of ultrapure water, 4.0 µL of 5× GoTaq Buffer, 2.0 µL of MgCl 2 , 1.0 µL of dNTPs, 2.0 µL of primer reverse and 2.0 µL of primer forward (each one at 10 µM), 0.1 µL of GoTaq and 2.0 µL of cyanobacterial DNA (5-10 ng/µL). PCR thermocycling conditions were: denaturation for 5 min at 95°C; 35 cycles with denaturation for 1 min at 95°C, primer annealing for 30 s at 52°C and extension for 1 min at 72°C; and final extension for 10 min at 72°C.
Amplicon sizes were confirmed after separation in a 1.0 % agarose gel.

Cloning and sequencing
The cylC homolog and 16S rRNA gene sequences were obtained either directly from the NCBI or through sequencing. To obtain high quality sequences, TOPO PCR cloning (Invitrogen) was used. The TOPO cloning reaction was conducted in a final volume of 3 µL, containing 1 µL of fresh PCR product, 1 µL of salt solution, 0.5 µL of TOPO vector and 0.5 µL of water. The reaction was incubated for 20 min at room temperature. 3 µL of TOPO reaction were added into a microcentrifuge tube containing chemically competent E. coli (Top10, Life Technologies) cells. After 30 min of incubation on ice, the cells were placed for 30 s at 42°C without shaking and were then immediately transferred to ice. 250 µL of room temperature SOC medium were added to the previous mixture and the tube was horizontally shaken at 37°C for 1 h (180 rpm). 60 µL of the different cloning reactions were spread onto LB ampicillin/X-gal plates and incubated overnight at 37°C.
Two or three positive colonies from each reaction were tested by colony-PCR. The PCR was conducted in a final volume of 20 µL, containing 10.9 µL of ultrapure water, 4.0 µL of 5x GoTaq Buffer, 2.0 µL of MgCl 2 , 1.0 µL of dNTPs, 1.0 µL of reverse pUCR and 1.0 µL of forward pUCF primers (each at 20 µM), 0.1 µL of GoTaq and the target colony. PCR thermocycling conditions were: denaturation for 5 min at 95°C; 35 cycles with denaturation for 1 min at 95°C, primer annealing for 30 s at 50°C and extension for 1 min at 72°C; and final extension for 10 min at 72°C. Amplicon sizes were confirmed after separation in an 1.0 % agarose gel. Selected colonies were incubated overnight at 37°C (180 rpm), in 5 mL of LB supplemented with 100 µg mL − 1 ampicillin. The plasmids containing the amplified PCR products were extracted (NZYMiniprep) and Sanger sequenced using pUC primers.

Cyanobacteria genome sequencing
Many of the LEGEcc strains are non-axenic, and so before extraction of gDNA for genome sequencing, an evaluation of the amount of heterotrophic contaminant bacteria in cyanobacterial cultures was performed by plating onto Z8 or Z8 with added 2.5 % sea salts (Tropic Marin) and vitamin B 12 (10 µg/L) agar medium (depending the original environment) supplemented with casamino acids (0.02 % wt/vol) and glucose (0.2 % wt/vol) [42]. The plates were incubated for 2-4 days at 25°C in the dark and examined for bacterial growth. Those cultures with minimal contamination were used for DNA extraction for genome sequencing. The selection of DNA extraction methodology used was based on morphological features of each strain (coccoid or noncoccoid strains). Total genomic DNA was isolated from a fresh or frozen pellet of 50 mL culture using the commercial PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific) (for coccoid strains) or the NZY Plant/ Fungi gDNA Isolation kit (NZYTech) (for non-coccoid strains) or using a CTAB-chloroform/isoamyl alcoholbased protocol [43] (if the previously indicated protocols failed). The latter included a homogenization step (grinding cells using a mortar and pestle with liquid nitrogen) before extraction using the standard kit protocol. The quality of the gDNA was evaluated in a DS-11 FX Spectrophotometer (DeNovix) and 1 % agarose gel electrophoresis, before genome sequencing, which was performed elsewhere (Era7, Spain and MicrobesNG, UK) using 2 × 250 bp paired-end libraries and the Illumina platform (except for Synechocystis sp. LEGE 06099, whose genome was sequenced using the Ion Torrent PGM platform). A standard pipeline including the identification of the closest reference genomes for reading mapping using Kraken 2 [44] and BWA-MEM to check the quality of the reads [45] was carried out, while de novo assembly was performed using SPAdes [46]. The genomic data obtained for each strain was treated as a metagenome. The contigs obtained as previously mentioned were analyzed using the binning tool MaxBin 2.0 [47] and checked manually in order to obtain only cyanobacterial contigs. The draft genomes were annotated using the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) [48] and submitted to GenBank under the BioProject number PRJNA667061. In the case of Hyella patelloides LEGE 07179 and Sphaerospermopsis sp. LEGE 00249 the assemblies had been previously deposited in NCBI under the BioSample numbers SAMEA4964519 and SAMN15758549, respectively.

Genomic context of CylC homologs
BLASTp searches using CylC [Cylindrospermum licheniforme UTEX B 2014] as query identified related CylC homologs within the publicly available cyanobacterial genomes and in the genomes of LEGEcc strains. We annotated the genomic context for each CylC homolog using antiSMASH v5.0 [49] and manual annotation through BLASTp of selected proteins. Some BGCs were not identified by antiSMASH and were manually annotated using BLASTp searches.

Phylogenetic analysis
Nucleotide sequences of cylC homologs obtained from the NCBI and from genome sequencing in this study, were aligned using MUSCLE from within the Geneious R11.0 software package (Biomatters). The nucleotide sequence of the distantly-related dimetal-carboxylate protein AurF [32] from Streptomyces thioluteus (AJ575648.1:4858-5868) was used as an outgroup. The alignments, trimmed to their core 788, 673, 506, 299 and 499 positions (for group A, B, C, D and X, respectively), were used for phylogenetic analysis, which was performed using FastTree 2 (from within Geneious), using a GTR substitution model (from jmodeltest, [50]) with a rate of 100 (Fig. S2).
For the phylogenetic analysis based on the 16S rRNA gene (Fig. 3, Fig. S3), the corresponding nucleotide sequences were retrieved from the NCBI (from public available genomes until March 16, 2020) or from sequence data (amplicon or genome) obtained in this study. The sequences were aligned as detailed for cylC homologs and trimmed to the core shared positions (656). A RAxML-HPC2 phylogenetic tree inference using maximum likelihood/rapid bootstrapping run on XSEDE (8.2.12) with 1000 bootstrap iterations in the Cipres platform [51] was performed.
The amino acid sequences of CylC homologs were aligned using MUSCLE from within the Geneious software package (Biomatters). The alignments were trimmed to their core 333 residues and used for phylogenetic analysis, which was performed using RAxML-HPC2 phylogenetic tree inference using maximum likelihood/rapid bootstrapping run on XSEDE (8.2.12) with 1000 bootstrap iterations in the Cipres platform [51] (Fig. 4c).

CylC-like halogenases are mostly found in cyanobacteria
To investigate the distribution of CylC homologs encoded in microbial genomes, we first searched the reference protein (RefSeq) or non-redundant protein sequences (nr) databases (NCBI) for homologs of CylC or BrtJ, using BLASTp (min 25 % identity, 9.9 × 10 − 20 Evalue and 50 % coverage). A total of 128 and 246 homologous unique protein sequences were retrieved using the RefSeq or nr databases, respectively; in both cases, Fig. 3 RAxML cladogram of the 16S rRNA gene of LEGEcc strains (grey squares) and from cyanobacterial strains with NCBI-deposited reference genomes, screened in this study. Taxonomy is presented at the order level (colored rectangles). Strains whose genomes encode CylC homologs are denoted by black squares. Green squares indicate that at least one homolog was detected by PCR-screening and verified by retrieving the sequence of the corresponding amplicon by cloning followed by Sanger sequencing. Gloeobacter violaceus PCC 7421 served as an outgroup. A version of this cladogram including the bootstrap values for 1000 replications is provided as Supplementary Material sequences were primarily from cyanobacteria (96 and 88 %, respectively) (Fig. 2a). We then used the Enzyme Similarity Tool of the Enzyme Function Initiative (EFI-EST) [39] to evaluate the sequence landscape of dimetalcarboxylate halogenases. Using CylC as query, we obtained a SSN (sequence similarity network) composed of 154 sequences (153 retrieved from the UniProt database and the query sequence) [53] (Fig. 2b). The SSN featured two major clusters, one containing homologs from diverse cyanobacterial genera, the other composed of homologs from several cyanobacteria, with a few from proteobacteria (mostly deltaproteobacteria) and two from the cyanobacteria sister-phylum Melainabacteria. A third SSN cluster was composed only by the previously reported BrtJ enzymes and, finally, a homolog from the cyanobacterial genus Hormoscilla remained unclustered. We were unable to recover any SSN that included clusters containing other characterized enzyme functions.

CylC homologs are widely distributed throughout the phylum Cyanobacteria
With the intent of accessing a wide diversity of CylC homolog sequences, we decided to use a degenerateprimer PCR strategy to discover additional homologs in cyanobacteria from the LEGEcc culture collection [54], because the phylum Cyanobacteria is diverse and still underrepresented in terms of genome data [55][56][57][58][59][60]. The LEGEcc maintains cultures isolated from diverse freshwater and marine environments, mostly in Portugal, and, for example, contains all known bartolosideproducing strains [35]. We considered that strains with CylC homologs detected by PCR could then be selected for genome sequencing and subsequent recovery of fulllength sequences and genomic contexts. To carry out the PCR screen, primers were designed based on 54 nucleotide sequences retrieved from the NCBI (these in turn were selected to represent the phylogenetic diversity of CylC homologs) (Fig. S1). Due to the lack of highly conserved nucleotide sequences among all homologs considered, we divided the nucleotide alignment into five groups and designed a degenerate primer pair for each. Upon screening 326 strains from LEGEcc using the five primer pairs, we retrieved 89 sequences encoding CylC homologs, confirmed through cloning and Sanger sequencing of the obtained amplicons. We were unable to directly analyze the diversity of the entire set of LEGEcc-derived cylC amplicons due to low overlap between sequences obtained with different primers. As such, we performed a phylogenetic analysis of the diversity retrieved with each primer pair (Fig. S2), by aligning the PCR-derived sequences with a set of diverse cylC genes retrieved from the NCBI. For some strains, our PCR screen retrieved more than one homolog using different primer pairs (e.g. Nostoc sp. LEGE 12451 or Planktothrix mougeotii LEGE 07231).
To access the full-length sequences of CylC homologs identified among LEGEcc strains, as well as their genomic context, we undertook a genome-sequencing effort informed by our PCR screen. We selected 21 strains for genome sequencing, which represents the diversity of CylC homologs observed in the different PCR screening groups. The resulting genome data was used to generate a local BLAST database and the homologs were located within the genomes. In some cases, additional homologs that were not detected in the PCR screen were identified. Overall, 33 full-length genes encoding CylC homologs were retrieved from LEGEcc strains.
To explore the phylogenetic distribution of CylC homologs encoded in publicly available reference genomes and the herein sequenced LEGEcc genomes, we aligned the 16 S rRNA genes from 648 strains with RefSeq genomes and the LEGEcc strains that were screened by PCR in this study. Using this dataset, we performed a phylogenetic analysis which indicated that CylC homologs are broadly distributed through five Cyanobacterial orders: Nostocales, Oscillatoriales, Chroococcales, Synechococcales and Pleurocapsales (Fig. 3, Fig. S3). We could not detect CylC homologs in the genomes of picocyanobacterial strains (genera Synechococcus and Prochlorococcus), which are overrepresented among currently available cyanobacterial genomes. It was also noteworthy that the cyanobacterial orders for which we did not find CylC homologs (Chroococcidiopsidales, Spirulinales, Gloeomargaritales and Gloeobacterales) are poorly represented in our dataset (Fig. 3, Fig. S3). However, our previous BLASTp search against the nr database did retrieve two close homologs in a couple of Chroococcidiopsidales strains (genera Aliterella and Chroococcidiopsis) and a more distant homolog in a Gloeobacter strain (Gloeobacterales) ( Table S3).

Diversity of BGCs encoding CylC homologs
To characterize the biosynthetic diversity of BGCs encoding CylC homologs, which were found in 78 cyanobacterial genomes (21 from LEGEcc and 57 from RefSeq) from different orders, we first submitted these genome sequences for antiSMASH [49] analysis. 55 CylC-encoding BGCs were detected, which were classified as resorcinol, NRPS, PKS, or hybrid NRPS-PKS. Given the number of CylC homolog-encoding genes detected in these genomes (105), we considered that several BGCs might have not been identified with antiSMASH. Therefore, we performed manual annotation of the genomic contexts of the CylC homologs and were able to identify 40 additional BGCs (i.e. a total of 95 BGCs). Upon analysis of the entire set of CylCencoding BGCs, we classified the BGCs in seven major categories, based on their overall architecture, which we designated as follows (listed in decreasing abundance): Rieske-containing (n = 36), type I PKS (chlorosphaerolactylate/columbamide/microginin/puwainaphycin-like, n = 29), type III PKS (n = 13), dialkylresorcinol (n = 8), PriA-containing (n = 5), nitronate monooxygenasecontaining (n = 3) and cytochrome P450/ sulfotransferase-containing (n = 1) (Fig. 4a, Figs. S4, S5, S6, S7, S8, S9 and S10). Three BGCs were excluded from our classification since they were only partially sequenced (Fig. S11). Examples of each of the cluster architectures are presented in Fig. 4a and schematic representations of each of the classified BGCs are presented in Supplementary Figures S4, S5, S6, S7, S8, S9 and S10. It should be stressed that within several of these seven major categories, there is still considerable BGC architecture diversity, notably within the dialkylresorcinol, type I and type III PKS BGCs. Rieskecontaining BGCs are not associated with any known NP and encode between two and four proteins with Rieske domains. Most contain a sterol desaturase family protein, feature a single CylC homolog and are chiefly found among Nostocales and Oscillatoriales (Fig. S4). PriAcontaining BGCs encode, apart from the Primosomal protein N' (PriA), a set of additional diguanylate cyclase/ phosphodiesterase, aromatic ring-hydroxylating dioxygenase subunit alpha and a ferritin-like protein and were only detected in Synechocystis spp. (Fig. S5). These are similar to the Rieske-containing BGCs; however, in strains harboring PriA-containing BGCs, the additional functionalities that are found in the Rieske-containing BGCs can be found dispersed throughout the genome (Table S4). In our dataset, a single sulfotransferase/P450 containing BGC was detected in Stanieria sp. and was unrelated to the above-mentioned architectures (Fig.  S6). Type I PKS BGCs encode clusters similar to those of the chlorosphaerolactylates, columbamides, microginins and puwainaphycins and typically feature a fatty acyl-AMP ligase (FAAL) and an acyl carrier protein upstream of one or two CylC homologs and a type I PKS downstream of the CylC homolog(s). These were found in Nostocales and Oscillatoriales strains (Fig. S7). Taken together with the known NP structures associated with these BGCs [34,61,62], we can expect that the encoded metabolites feature halogenated fatty acids in terminal or mid-chain positions. BGCs of the dialkylresorcinol type, which contain DarA and DarB homologs [35,63], including several bartoloside-like clusters (found only in LEGEcc strains), were detected in Nostocales, Pleurocapsales and Chroococcales (Fig. S8). Type III PKS BGCs encoding CylC homologs, which include a variety of cyclophane BGCs, were detected in the Nostocales, Oscillatoriales and Pleurocapsales (Fig. S9). Finally, nitronate monooxygenase-containing BGCs, which are not associated with any known NP, were only found in Nostocales strains from the LEGEcc and featured also genes encoding PKSI, ferredoxin, ACP or glycosyl transferase (Fig. S10). A less BGC-centric perspective of the genomic context of CylC homologs could be obtained through the Genome Neighborhood Tool of the EFI (EFI-GNT, [64]). Using the previously generated SSN as input, we analyzed the resulting Genomic Neighborhood Diagrams (Fig. 4b), which indicated that the three SSN clusters had entirely different genomic contexts (herein defined as 10 upstream and 10 downstream genes from the cylC homolog). The SSN cluster that encompasses CylC and its closest homologs indicates that these enzymes associate most often with PPbinding (ACP/PCPs) and AMP-binding (such as FAALs) proteins. Regarding the SSN cluster that includes both cyanobacterial and non-cyanobacterial CylC homologs, their genomic contexts most prominently feature Rieske/[2Fe-2S] cluster proteins as well as fatty acid hydroxylase family enzymes. The cyanobacterial homologs in this SSN cluster are exclusively encoded in Rieske or PriA-containing BGCs. Such homologs may not require a phosphopantetheine tethered substrate as no substrate activation or carrier proteins/domains are found in their genomic neighborhoods. However, acyl transferases and desaturases are encoded in most of these BGCs, so we propose these enzymes may act on central fatty acid metabolism intermediates or their derivatives. Finally, the BrtJ SSN cluster, composed only of the two reported BrtJ enzymes, shows entirely different surrounding genes, obviously corresponding to the brt genes. Overall, there is a considerable number of proteins with unknown function found in the vicinity of dimetalcarboxylate halogenases, suggesting that uncharted biochemistry is associated with these enzymes.
Since SSN analysis generated only three clusters of CylC homologs, we next investigated the genetic relatedness among these enzymes and how it correlates to BGC architecture. We performed a phylogenetic analysis of the CylC homologs from the 98 classified and 3 unclassified BGCs (Fig. 4c). Our analysis indicated that PriAcontaining and Rieske-containing BGCs formed a wellsupported clade. Its sister clade contained homologs from the remaining BGCs. Within this larger clade, homologs associated with the type I PKS, dialkylresorcinol or type III PKS BGCs were found to be polyphyletic. In some cases, the same BGC contained distantly related CylC homologs (e.g. Hyella patelloides LEGE 07179, Anabaena cylindrica PCC 7122) (Fig. 4c). This analysis also revealed that several strains (Fig. 4c) encode two or three phylogenetically distant CylC homologs in different BGCs.

CylC enzymes and other cyanobacterial halogenases
We sought to understand how CylC-type halogenases compare to other halogenating enzyme classes found in cyanobacteria in terms of prevalence and association with BGCs. To this end, we carried out a CORASON [52] analysis of publicly available cyanobacterial genomes (including non-reference genomes) and the herein acquired genome data from LEGEcc strains (a total of 2,115 cyanobacterial genomes). We used different cyanobacterial halogenases as input, namely CylC, McnD, PrnA, Bmp5, the 2OG-Fe(II) oxygenase domains from CurA and BarB1. CORASON attempts to retrieve genome context by exploring gene cluster diversity linked to enzyme phylogenies [52]. The CORASON analysis retrieved 117 (5.6 %) dimetal-carboxylate halogenases, 61 (2.9 %) nonheme iron-dependent halogenases and 226 (10.7 %) flavin dependent halogenases from the cyanobacterial genomes (Fig. 5a). Using the protein homologs detected in BGCs by CORASON, a sequence alignment was performed for dimetal-carboxylate, nonheme iron/ 2OG-dependent and flavin-dependent halogenases. For nonheme iron/2OG-dependent halogenases, we excised the halogenase domain from multi-domain enzyme sequences. After removing repeated sequences and trimming the alignments to their core shared positions, maximum-likelihood phylogenetic trees were constructed for each halogenase class and BGCs were annotated manually (Figs. S12, S13 and S14). Flavindependent halogenases were commonly associated with cyanopeptolin, 2,4-dibromophenol and pyrrolnitrin BGCs and with orphan BGCs of distinct architectures (Fig. S12). Regarding nonheme iron/2OG-dependent halogenases, we identified barbamide, curacin, hectochlorin and terpene/indole [65] BGCs and several distinct orphan BGCs (Fig. S13). For dimetal-carboxylate halogenases, columbamide, microginin, chlorosphaerolactylate, bartoloside and cyclophane BGCs were identified (Fig. S14). However, while some of the CylC homolog-encoding orphan BGCs previously identified by antiSMASH and manual searches were detected by CORASON, the Rieske-and the PriA-containing BGCs were not. Hence, several CylC homologs were not accounted for in this analysis. For the same reasons, the other two halogenase types could also be missing some of its members in the CORASON-derived datasets. To circumvent this limitation and obtain a more comprehensive picture of the abundance of the three types of halogenase in cyanobacterial genomes, we used BLASTp searches against available cyanobacterial genomes in the NCBI database (including non-reference genomes). Several representatives of each halogenase class were used as query in each search (CylC, BrtJ, "Mic"the halogenase in the putative microginin gene cluster -ColD, ColE, NocO and NocN for dimetal-carboxylate halogenases; PrnA, Bmp5 and McnD for flavin dependent halogenases; the halogenase domain from CurA and the halogenases BarB1, HctB, WelO5 and AmbO5 for nonheme iron-dependent halogenases). Non-redundant sequences obtained for these searches using a 1 × 10 − 20 evalue cutoff (corresponding to > 30 % sequence identity) were considered to share the same function as the query. It is worth mentioning that, for nonheme iron/2OGdependent enzymes, a single amino acid difference can convert hydroxylation activity into halogenation [66], so it is possible thatat least for this classthe sequence space considered does not correspond exclusively to halogenation activity. Dimetal-carboxylate and flavindependent halogenase homologs were found to be the most abundant in cyanobacteria, each with roughly 0.2 homologs per genome, while nonheme iron/2OGdependent halogenase homologs are less common (0 .05 per genome) (Fig. 5b).

Discussion
CylC is the single characterized member of the dimetalcarboxylate halogenases. A handful of homologs are encoded in BGCs whose corresponding NPs are known, and their halogenase function can be deduced to some extent from the NP structures. In this study, we show that the remaining homologs, which are mostly found in cyanobacteria, can be used to guide the discovery of new chemistry. In particular, SSN analyses of CylC homologs attests to the uniqueness of these dimetal-carboxylate enzymes in the current protein-sequence landscape, as no homologs with additional functions could be retrieved. CylC homologs therefore represent a region of protein sequence space that is vastly unexplored. Their activity might not be limited to halogenationlike in the case of iron/2OG-dependent enzymes, it is possible that some CylC homologs perform other types of oxidative transformations.
To obtain an expanded representation of CylC homolog sequences, apart from retrieving these from publicly available genomes, we used a PCR-based strategy to screen our in-house culture collection and retrieve additional homologs. To increase effectiveness, we used more than one primer pair. In general, and for each primer pair, the PCR screen retrieved mostly sequences that were closely related and clustered as a single or, at a maximum, two phylogenetic clades. This can likely be explained by the geographical bias that might exist in the LEGEcc culture collection [54] and/or with primer design and PCR efficiency issues, which might have favored certain phylogenetic clades. Overall, the two approaches showed a wide but punctuated presence of CylC homologs among the cyanobacterial diversity considered in this study. In light of this, it is unclear how much of the current CylC homolog distribution reflects vertical inheritance or horizontal gene transfer events.
The phylogenetic analysis and the genomic context analyses that we have carried out for dimetal-carboxylate halogenases show that they have evolved to interact with different partner enzymes to generate chemical diversity, but that their phylogeny is, in some cases, not entirely consistent with BGC architecture. These observations suggest that functionally convergent associations between CylC homologs and other proteins have emerged multiple times during evolution. Examples include the CylC/CylK and BrtJ/BrtB associations, which use cryptic halogenation to achieve C-C and C-O bond formation, respectively [32,67]. However, the role of the CylC homolog-mediated halogenation of fatty acyl moieties observed for other cyanobacterial metabolites is not currently understood. Interestingly, while a number of CylC homologs, including those that are part of characterized BGCs, likely act on ACP-tethered fatty acyl substrates [32,67], those from the PriA-Rieske-and cytochrome P450/sulfotransferase categories do not have a neighboring carrier protein and therefore might not require a tethered substrate. This would be an important property for a CylC-like biocatalyst [16].
When comparing dimetal-carboxylate halogenases with the nonheme iron/2OG-dependent and flavindependent halogenases, we found that the former are clearly a major group of halogenases in cyanobacteria, despite having been the latest to be discovered [32]. Notwithstanding, homologs of each of the three halogenase classes are associated with a large number of orphan BGCs and all classes represent opportunities for NP discovery.

Conclusions
The discovery of a new biosynthetic enzyme class brings with it tremendous possibilities for biochemistry and catalysis research, both fundamental and applied. Their functional characterization can also be used as a handle to identify and deorphanize BGCs that encode their homologs. CylC typifies an unprecedented halogenase class, which is almost exclusively found in cyanobacteria. By searching CylC homologs in both public databases and our in-house culture collection, we report here more than 100 new cyanobacterial CylC homologs. We found that dimetal-carboxylate halogenases are widely distributed throughout the phylum. The genomic neighborhoods of these halogenases are diverse and we identify a number of different BGC architectures associated with either one or two CylC homologs that can serve as starting points for the discovery of new NP scaffolds. In addition, the herein reported diversity and biosynthetic contexts of these enzymes will serve as a roadmap to further explore their biocatalysis-relevant activities. Despite their prevalence and distribution, there is no strong evidence for a role of CylC-like halogenases in primary metabolism and their diverse genomic contexts suggests otherwise. Finally, bartoloside-like BGCs and another CylC-associated BGC architecture (nitronate monooxygenase-containing) were found only in the LEGEcc, reinforcing the importance of geographically focused strain isolation and maintenance efforts for the Cyanobacteria phylum. However, to fully realize the potential of this new halogenase class, biochemical and structural characterization of additional homologs is warranted. This will not only provide mechanistic insight into catalysis but also enable sequence-based predictions of carrier-protein requirement and regioselectivity, all of which are open questions for these new enzymes.
Additional file 1: Table S1. Accession numbers of cylC homologs and aurFgenes used for primer design. Figure S1. (a) Phylogenetic tree (FastTree GTR with a rate of 100) of cylC homologs highlighted according to the groups selected for degenerate primer design. (b) Schematic representation of the different pairs of degenerate primers. Figure S2. PCR-based detection of cylC homologs in the LEGEcc. Five pairs of primers were designed based on conserved regions identified in the cylC gene. Each primer pair was used in a PCR screen of the gDNA obtained from diverse strains (n = 326) of the LEGEcc. The resulting amplicons were cloned and sequenced. Sequences for each primer pair were aligned with the corresponding regions of cylC genes found in the NCBI reference genomes (cyanobacteria only) and those from LEGEcc strains' genomes. Shown are the resulting cladograms (RaxML, 1000 replicates) for each primer pair used in the screening. Blue squares indicate sequences obtained from the PCR screen. Figure S3. RaxML cladogram (1000 replicates) of the 16S rRNA gene of LEGEcc strains (grey squares) and from cyanobacterial strains with NCBI-deposited reference genomes, screened in this study. Taxonomy is presented at the order level (colored ranges). Strains whose genomes encode CylC homologs are denoted by black squares. Green squares indicate that at least one CylC homolog was detected by PCR-screening and verified by retrieving the sequence of the corresponding amplicon through cloning followed by Sanger sequencing. The cladogram topology is the same as shown in Fig. 3 of the main manuscript, but here bootstrap values (equal or above 0.7) are shown. Table S2. GenBank or RefSeq assembly acession number and LEGEcc genome used for CORASON analysis. Table S3. BLASTp search of CylC homologs against Aliterella sp., Chroococcidiopsis sp. and Gloeobacter sp. Figure S4. Rieske-containing biosynthetic gene clusters encoding CylC homolog(s). Figure S5. PriA-containing biosynthetic gene clusters encoding CylC homolog(s). Figure S6. Cytochrome P450/sulfotransferase-containing biosynthetic gene cluster encoding a CylC homolog. Figure S7. Type I PKS (chlorosphaerolactylate/columbamide/ microginin/puwainaphycin-like) biosynthetic gene clusters encoding CylC homolog(s). Figure S8. Dialkylresorcinol biosynthetic gene clusters encoding CylC homolog(s). Figure S9. Type III PKS biosynthetic gene clusters encoding CylC homolog(s). Figure S10. Nitronate monooxygenasecontaining biosynthetic gene clusters encoding a CylC homolog. Figure  S11. Unclassified (likely incomplete) biosynthetic gene clusters encoding a CylC homolog. Table S4. BLAST search of Rieske-containing BGCs genes from Calothrix brevissima NIES 22 against Synechocystis sp. PCC 6803. Figure S12. Phylogenetic tree of FAD-dependent halogenases based on CORASON outputs with illustrative BGC architectures. Figure  S13. Phylogenetic tree of nonheme iron-dependent halogenases based on CORASON outputs with illustrative BGC architectures. Figure S14. Phylogenetic tree of dimetal-carboxylate halogenases based on CORA-SON outputs with illustrative BGC architectures.