Skip to main content

Assembling highly repetitive Xanthomonas TALomes using Oxford Nanopore sequencing

Abstract

Background

Most plant-pathogenic Xanthomonas bacteria harbor transcription activator-like effector (TALE) genes, which function as transcriptional activators of host plant genes and support infection. The entire repertoire of up to 29 TALE genes of a Xanthomonas strain is also referred to as TALome. The DNA-binding domain of TALEs is comprised of highly conserved repeats and TALE genes often occur in gene clusters, which precludes the assembly of TALE-carrying Xanthomonas genomes based on standard sequencing approaches.

Results

Here, we report the successful assembly of the 5 Mbp genomes of five Xanthomonas strains from Oxford Nanopore Technologies (ONT) sequencing data. For one of these strains, Xanthomonas oryzae pv. oryzae (Xoo) PXO35, we illustrate why Illumina short reads and longer PacBio reads are insufficient to fully resolve the genome. While ONT reads are perfectly suited to yield highly contiguous genomes, they suffer from a specific error profile within homopolymers. To still yield complete and correct TALomes from ONT assemblies, we present a computational correction pipeline specifically tailored to TALE genes, which yields at least comparable accuracy as Illumina-based polishing. We further systematically assess the ONT-based pipeline for its multiplexing capacity and find that, combined with computational correction, the complete TALome of Xoo PXO35 could have been reconstructed from less than 20,000 ONT reads.

Conclusions

Our results indicate that multiplexed ONT sequencing combined with a computational correction of TALE genes constitutes a highly capable tool for characterizing the TALomes of huge collections of Xanthomonas strains in the future.

Peer Review reports

Background

Xanthomonas oryzae plant-pathogenic bacteria cause highly devastating diseases of the major food crop rice. X. oryzae pv. oryzae (Xoo) and X. oryzae pv. oryzicola (Xoc) are two related groups of these pathogens which differ by their mode of infection. While Xoo spreads via the vascular systems, Xoc predominantly infects the parenchyma tissue of the rice leaves [1]. The virulence of both pathovars is dependent on a type-III secretion system which injects a collection of different effector proteins directly into plant cells to manipulate and reprogram these to support the bacterial infection [2].

Transcription activator-like effectors (TALEs) constitute the largest effector family in X. oryzae with up to 19 (Xoo) and 29 (Xoc) different TALEs within one strain [3,4,5]. TALEs function inside the plant nucleus as transcription factors to induce transcription of specific target genes [6, 7]. The outstanding feature of TALEs is their modular DNA-binding domain, consisting of a variable number of 34 amino acid-repeats with each repeat facilitating binding to one DNA base in a sequential fashion [8, 9]. The amino acid 13 in each repeat directly contacts the base in the leading strand of the double-stranded DNA and the combinations of amino acids 12 and 13 in each repeat form repeat-variable diresidues (RVD), which determine the binding preference of an RVD and, consequently, the DNA-binding specificity of each TALE protein (Fig. 1A) [8,9,10,11]. TALEs interact with subunits of the transcription initiation factor IIA [12] and initiate transcription via a C-terminal acidic activation domain in a somewhat defined distance downstream of the TALE binding site [13, 14]. Accordingly, bioinformatic prediction of TALE target sites in plant promoters in combination with gene expression analysis of host plant genes after Xanthomonas infection is the standard procedure to identify potential virulence targets of TALEs [4, 7, 15, 16].

Fig. 1
figure 1

(A) Schematic representation of the structure of a TAL effector with four noncanonical repeats (NCR, light red), tandem repeats (red), nuclear localisation signals (NLS) and acidic activation domain (AAD). The 34 amino acids of each repeat are highly conserved, excluding the amino acids at position 12 and 13, which are referred to as repeat variable diresidue (RVD). (B) A magnetic bead-based method to isolate high molecular weight genomic DNA for ONT sequencing

Different strains of Xoo, Xoc and other Xanthomonads carry individual repertoires of TALEs which evolve by base substitution in RVD codons, recombination between TALEs and deletion of repeats within a TALE [17,18,19,20]. TALEs and other effector genes in Xanthomonas species are often flanked by 62 bp inverted repeats [21] suggesting that they can be mobilized by transposition [22]. The structures of TALE gene clusters in Xoc additionally suggest an integron-like mechanism for the reassortment of TALEs [18]. Functional TALEs contain between 7.5 and 33.5 repeats [6] and this repetitiveness precludes the correct assembly of TALE genes using short read-based next generation sequencing techniques. Accordingly, assemblies of Xanthomonas genome sequences following Illumina sequencing typically exclude the TALEs [23]. The first complete Xoo [24] and Xoc [25] genome sequences including TALEs were produced using Sanger sequencing on subcloned genomic fragments, but this approach is now outdated as it is time-consuming and not feasible for high throughput. Shortcuts to determine the set of TALEs (e.g., the TALome) within a strain used either direct subcloning of conserved restriction fragments [26, 27], Gibson assembly of these fragments [28] or PCR to amplify repeat regions of TALEs and subsequently subcloning and sequencing those [29]. However, these approaches omit the information how TALEs are organized in clusters and risk to overlook individual TALEs.

The current gold standard for sequencing Xanthomonas genomes including TALE genes is PacBio sequencing, which can produce high-quality, circular, single contig genome assemblies [4, 17, 19, 30,31,32]. Nevertheless, this technique requires complex protocols and expensive platforms compared to standard short-read technologies. Furthermore, some Xoc TALE gene clusters contain ten consecutive TALE genes, reaching a length of 40.7 kbp (Xoc CFBP2286) [3] which corresponds to only the top five percent of the typical PacBio read length [33]. Also, genomes of Xanthomonas (as well as those of other plant pathogen bacteria) contain high numbers (up to ten percent) of mobile elements including insertion sequences (IS), transposons and phage-related genes which trigger frequent genomic rearrangements [22, 34, 35]. Duplication of TALE clusters as well as rearrangements of TALEs between clusters by inversions can lead to repetitive genomic structures that are not easily resolved.

In general, decoding the correct arrangement of repetitive elements in genomes has been a long-standing challenge for sequencing techniques. A notorious example for this are the centromeric and telomeric regions of eukaryotic genomes [36]. The ultra long reads (> 100 kbp) of single DNA molecules obtained from Oxford Nanopore Technologies (ONT) sequencing now finally provide a solid approach to this problem, although the high error rate of 5–15 percent is currently still a challenge [37]. Combining these long reads with circular PacBio HiFi reads or Illumina reads can significantly increase the read accuracy while solving the genome structure at the same time [38,39,40,41].

Here, we report the establishment of an ONT sequencing approach for Xoo and Xoc, and a specific computational read assembly pipeline including an in silico correction method to increase the accuracy of TALE gene assemblies. With this, we were able to solve the arrangement of a particularly complex genomic TALE cluster in Xoo strain PXO35, which was not possible for us based on PacBio sequencing data. Contiguity and genome completeness were evaluated for assemblies based on Illumina, PacBio and ONT sequencing data, and based on hybrid approaches, respectively. The possibility to multiplex this approach will enable sequencing and characterization of TALomes of multiple Xanthomonas strains in parallel on a single ONT flow cell, and will allow the identification of evolutionary events within Xoo and Xoc populations.

Methods

Bacterial growths

Xanthomonas oryzae pv. oryzae (Xoo) strains PXO35, FXO38 and Huang604 and Xanthomonas oryzae pv. oryzicola (Xoc) strains BAI35 and MAI23 were cultivated in PSA medium at 28 \(^{\circ }\)C.

DNA extraction and library preparation

Illumina

Bacteria were washed in 1M NaCl and genomic DNA was isolated using the Wizard® genomic DNA purification kit (Promega, Charbonnières, France).

PacBio

Genomic DNA from Xoo PXO35 was isolated using the DNeasy Blood & Tissue kit with pretreatment for Gram-negative bacteria according to the protocol from QIAGEN.

Production of Magnetic Nano Particles

Magnetic Nano Particles (MNPs) were made and silica-coated according to Oberacker et al. (2019) [42].

gDNA Extraction from Xanthomonas oryzae using Magnetic Nano Particles

In the following, we describe the protocol for gDNA extraction using magnetic nano particles (MNPs), which is also illustrated in Fig. 1B. Additional details about the preparation of the beads solution is provided as Supplementary Methods.

First, 4 ml Xoo/Xoc overnight culture were collected by centrifugation (5 min, 10,000 g) in a 2 ml reaction tube. The pellet was washed by resuspension in 1 ml of 10 mM MgCl2 and pelleted again by centrifugation. The pellet was then resuspended in 900 \(\mu l\) pre-heated lysis buffer (0.5 M NaCl; 100 mM Tris-HCl, pH 8.0, 50 mM EDTA; 1.5 % SDS; 65 \(^{\circ }\)C) by pipetting up and down. The mixture was incubated at 65 \(^{\circ }\)C for 20 min with shaking of 500 rpm. After 10 min the tube was inverted for 10 times. 4 \(\mu l\) RNase A (10 mg/ml; DNase free) was added and incubated for 10 min at room temperature. After RNA digestion, 300 \(\mu l\) potassium acetate (5 M) was added and mixed by pipetting up and down with a 1,000 \(\mu l\) pipet tip, which was cut 0.5 cm from the tip end. For pelleting the insoluble compounds, the mixture was centrifuged for 10 min at 20,000 g. 800 \(\mu l\) of the supernatant was transferred to a new 2.0 ml tube, mixed by inverting with 800 \(\mu l\) of well homogenized beads solution and incubated for 1 h with gentle shaking. Tubes were then placed on a magnetic rack until all beads were stuck to the tube wall. The supernatant was discarded and the beads were washed in 70 % EtOH for 3 times, by adding the EtOH to the tube with the beads, and rotating the tube for 180\(^{\circ }\) while it stays in a magnetic rack. After a second rotation for 180\(^{\circ }\) and a complete binding of the beads to the tube wall, the EtOH was removed by pipetting. After washing, remaining EtOH was removed by incubation for 5 min at 37 \(^{\circ }\)C with an open lid. To elute the DNA from the beads, beads were resuspended in 100 \(\mu l\) 10 mM Tris-HCl, pH 8.5 and incubated at 37 \(^{\circ }\)C with shaking of 500 rpm for 60 min. Using a magnetic rack, the supernatant containing DNA was seperated from the MNPs into a new 1.5 ml tube. By adding 1/10 Volume of 5 M potassium acetate and 2 volumes of EtOH, a precipitation of the DNA was performed. After an incubation of 30 min at room temperature, the precipitated DNA was centrifuged for 10 min at 20,000 g. The supernatant was removed and the pellet was washed with 70 % EtOH for two times. After washing, the remaining EtOH was removed by incubation at 37 \(^{\circ }\)C for 5 min with an open lid. The DNA was resuspended in 50 \(\mu l\) 10 mM Tris-HCl (pH 8.5) by shaking (500 rpm) for 1 hour at 37 \(^{\circ }\)C.

gDNA Quality Control

To ensure the integrity and determine the overall size of the high molecular weight genomic DNA, purified samples were loaded on a 0.5% agarose gel and compared to a 1 kb extend marker from New England Biolabs. The size of genomic DNA prepared with MNPs is significantly larger than the one prepared using spin columns (Blood and Tissue culture kit, QIAGEN; Supplementary Fig. S1). Purity and concentration of the DNA was measured using a NanoDrop One and a Qubit (dsDNA HS) from Thermo Fischer Scientific.

Sequencing

Illumina sequencing of Xoo PXO35

Strain PXO35 was sequenced using the Illumina HiSeq 2500 platform (Fasteris SA, Switzerland). The shotgun sequencing yielded 2,978,197 100-bp paired-end reads (596 Mb), with insert sizes ranging from of 250 bp to 1.5 kb.

PacBio sequencing of Xoo PXO35

Libraries for Xoo PXO35 were sequenced in two Pacific Biosciences SMRT cells as described previously [4], yielding a total of 143,946 continuous long reads (CLRs) with an N50 read length of 21,760 bp.

Oxford Nanopore

Library was made using Oxford Nanopore Technologies protocol: Native barcoding genomic DNA with EXP-NBD104 and SQK-LSK109 kit. Manufacturer’s instructions were followed with slight changes. 700 ng of each end-prepped sample was barcoded with barcode (1-FXO38, 2-PXO35, 3-MAI23, 4-Huang604, 5-BAI35). Incubation time for end-prep reaction was prolonged to total of 30 minutes, and elution was 5 minutes on 37 degree. 150 ng of each barcoded sample was pooled together (total 750ng) for adapter ligation reaction, and here also incubation was prolonged to 15 minutes, and elution to 20 minutes on 37 degree to improve the recovery of the long fragments. Total of 476 ng library was loaded without loading beads to R9.4 flow cell. Sequencing was performed for all samples (5 isolates) on a MinION Mk1B device and basecalled with Guppy v4.0.11 and the dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg model. Parameters were specified for an increased chunk size of 10000 (--chunk_size) and to enable barcode demultiplexing according to the used kit (--barcode_kits).

General statistics of all sequenced libraries are given in Table 1. Although the N50 read lengths of the PacBio and ONT libraries are similar, the ONT library contains a large fraction of very long reads that are less present in the PacBio data. On the contrary, about half of the ONT reads but only approx. 5% of the PacBio reads are shorter than 1 kbp (cf. Supplementary Table A, Supplementary Fig. S2).

Table 1 Sequencing statistics of libraries

Genome assembly

We used matching assembly tools to generate genome assemblies of reads obtained from the different sequencing methods. We assembled ONT reads of Xoo PXO35, FXO38 and Huang604 and Xoc BAI35 and MAI23 each with Flye (v2.8.3) [43] using parameters “--plasmids -meta --min-overlap 5000”. For the initial assembly, we filtered reads using Filtlong (v0.2.0, https://github.com/rrwick/Filtlong) with parameter “--target_bases 500000000” to obtain a subset of high-quality reads. Mapping of the complete set of ONT reads to the respective assemblies was performed with minimap2 (v2.17-r941) [44]. Assemblies were initially polished using these long-read-mappings and one round of Racon (v1.4.13) [45] followed by medaka (v1.2.6, https://github.com/nanoporetech/medaka).

For all strains, we additionally ran Flye without the “--min-overlap” parameter and proceeded with the assembly with the best contiguity. Only in case of MAI23 the latter works best obtaining 1 instead of 2 contigs.

The origin of replication (ORI) was determined based on the proximity to dnaA, dnaN and gyrB genes and on the G-C content along the genome. The assemblies were rearranged according to their ORI to compare them using the genomic alignement viewer Mauve 2.4.0 [46, 47]. In general, the circular chromosomal contig generated by Flye had sequence sniplets or duplications at the circularisation site, which were not covered by mapped reads.

However, this could be solved by further polishing rounds with medaka: For FXO38 we polished two rounds with medaka and for Huang604 three rounds. For all other strains one polishing round with medaka already solved circularization problems.

For Xoo PXO35, we generated genomic Illumina and PacBio sequencing data in addition to the ONT data described previously. For the assembly of Illumina reads, we used MaSuRCA genome assembler version 4.0.3 [48] with default settings.

For the assembly of PacBio reads, we used HGAP4 of SMRT LINK (v.7.0.1) [49] with default settings. We additionally generated an assembly of the PacBio reads using Flye in analogy to the ONT-based assembly. This assembly was similar in structure to the HGAP4 assembly. However, a lower number of TALE genes could be detected from the Flye assembly than from the HGAP4 assembly based on the same PacBio data and, hence, this assembly will not be considered in the remainder of this manuscript.

For the hybrid assembly of Illumina and PacBio reads, we also used MaSuRCA genome assembler version 4.0.3 [48] with default settings. For the hybrid assembly of Illumina and ONT reads, we used Unicycler version v0.4.8 [50] with default settings.

Polishing

Illumina-based

For the ONT-based assembly (after initial long-read polishing) and the PacBio-based assembly, we investigated the effect of polishing with Illumina reads using Pilon (v1.24) [51]. For this purpose, the reads were mapped to the obtained contigs with BWA-MEM (v0.7.17-r1188) [52].

We also polished the pure Illumina assembly generated with MaSuRCA with Pilon and BWA-MEM mapped Illumina reads.

PacBio-based

To polish the ONT-based assembly with PacBio reads, we use the Resequencing application from SMRT LINK (v.7.0.1) with default parameters and pbmm2 for the mapping, and choose Arrow as the polishing algorithm.

The PacBio assembly resulting from HGAP4 is polished with PacBio reads in the same manner.

Homopolish

Often, Homopolish [53] is used for polishing an ONT-based assembly if additional PacBio or illumina-reads are not available. It polishes the assembly based on homologous sequences. We polish the assembly of Xoo PXO35 with Homopolish using the online available homologous sequence data set for bacteria (http://bioinfo.cs.ccu.edu.tw/bioinfo/mash_sketches/bacteria.msh.gz) with parameters “-s bacteria.msh -m R9.4.pkl” and in a second trial with Xanthomonas oryzae as NCBI taxonomic name with parameters “-g Xanthomonas_oryzae -m R9.4.pkl”.

Comparison of assemblies

To suggest a suitable sequencing method for TALE-containing Xanthomonas genomes, we compare the PXO35 assemblies obtained with the different sequencing and assembly methods mentioned above based on the number of contigs, contig N50, BUSCO completeness, as well as class and number of containing TALEs. A suitable candidate contains few contigs, a high contig N50, a high BUSCO completeness and many TALEs belonging to known TALE classes. In addition, we compare the assembly candidates based on a genomic alignment. Furthermore, we compare the ONT-sequenced Xoo and Xoc strains with each other based on the above mentioned features and additionally include two known representatives from Xoo and Xoc in the comparison of TALomes.

Busco genome completeness

To assess the completeness of the assemblies with regard to the genes they contain, we used BUSCO 5.1.3 [54], which uses universal single-copy orthologs to quantify completeness. We restricted the calculation to the orthologs of the “xanthomonadales_odb10” dataset, which contains 1,152 BUSCO markers. We used BUSCO in genome mode with the parameter “-m genome” and report the proportion of complete BUSCOs of each assembly.

Alignment using progressiveMauve

To visualise similarities and differences between the different PXO35 assemblies on a genomic level, we created a genomic alignment of all PXO35 assemblies with progressiveMauve 2.4.0 [47]. To compare the newly sequenced and assembled strains of the Xoo and the Xoc pathovar, we generate corresponding genomic alignments using progessiveMauve to investigate similarities and rearrangements between the genomes.

TALE annotation using AnnoTALE

For identification and analysis of TAL effectors on the assembled genomes we use the application suite AnnoTALE (v1.5) [4]. We use the tool ’TALE Prediction’ and ’TALE Analysis’ for the detection of TALEs within the sequenced genomes and for determining the RVD composition of TALEs. In addition to the complete TALE DNA sequences, we receive a FastA file with TALE sequences split into the parts N-terminus, individual repeats and C-terminus.

Assignment to TALE classes

To assign predicted TALEs to existing TALE classes or to a new class, we use the AnnoTALE tool ’TALE Class Assignment’ and obtain systematic TALE names due to the nomenclature based on the RVD sequences of the TALEs as introduced with AnnoTALE [4].

Comparison of annotated TALEs

For the different assemblies we compare the detected TALEs, based on the number of complete TALEs, iTALEs, pseudo genes and classes. We consider iTALEs separately, as they contain a truncated C-terminus without activation domain and have a distinct role in pathogenicity by interfering with the disease resistance of the plant [55, 56].

Comparison to other strains

With the AnnoTALE tool ’TALE Class Presence’ one can visualize the TALE class composition of the TALomes of different strains. As a reference for the 5 Xanthomonas strains sequenced with ONT, we choose one known representative for each Xanthomonas oryzae pathovar: Xoo PXO99A and Xoc BLS256.

Computational correction of TALEs in ONT assembly

Using Oxford Nanopore’s sequencing technology to sequence TALE-containing genomes offers the advantage of long reads covering entire TALE clusters, facilitating assembly into a complete genome. However, accuracy in long homopolymer regions is low, even in case of high coverage [57]. In case of low coverage, the lower read accuracy of long-read technologies can also cause a problem in non-homopolymeric regions. We present a method for the computational correction of such sequencing errors within TALE sequences from ONT assemblies. To this end, we learned three separate profile hidden Markov models (HMMs) with HMMER (v3.3.2) [58] based on known TALE sequences from AnnoTALE (v1.5). The first HMM represents the N-terminus, the second one individual repeats, and the third one the C-terminus of a TALE. We trained each HMM separately for Xoo and Xoc TALEs. We then searched for corresponding TALE parts within each ONT assembly using nhmmer [58]. We only considered hits with an average posterior probability (acc) above 0.75 and a bit score above 200 in case of N- and C-terminus and above 45 in case of a repeat hit. We searched for insertions and deletions in comparison to the consensus of the HMMs. Only continuous stretches of insertions are considered, which are at most 3 bp (N- and C-terminus) or 2 bp (repeats) long. Within repeats, the di-codon coding for the RVD is excluded from the correction as i) this di-codon is hyper-variable and, hence, the consensus of the HMM is rather meaningless and ii) otherwise RVDs ’N*’ with the 13th amino acid missing might be corrected erroneously. In case of an insertion, we distinguish between two cases: (i) the insertion occurs within a homopolymer context with at least five identical nucleotides, which results in the insertion of the homopolymer nucleotide, or (ii) otherwise, the corresponding consensus nucleotide of the HMM is inserted. Inserted bases are lower-cased in the corrected TALE assembly to make the correction identifiable in downstream analyses. As an optional step, we aim to improve correction with a custom substitution polishing. First, we map the ONT reads against the TALE-corrected assembly with minimap2 (v2.17-r941) [44] and then adapt the previous correction if another base is 1.5 times as frequent as the base chosen from the HMM consensus. Here, we count occurrence of each base with igvtools count (v2.5.3) [59].

Testing multiplexing capacity by sub-sampling

We investigate, which coverage is sufficient to decode the TALome of Xoo PXO35 and, hence, how many Xanthomonas strains could be multiplexed on a single ONT flow cell, by sub-sampling from the ONT reads. For this purpose we randomly sample 75%, 50%, 25%, 12.5%, 6.25% and 3.12% of the original ONT reads with seqtk sample https://github.com/lh3/seqtk. For each of the generated sub-samples, we generate an assembly with Flye, which we further polish with Racon and medaka as described above for the complete set of ONT reads. Subsequently, we apply the computational correction of TALE genes proposed in this manuscript including custom substitution polishing and finally use AnnoTALE for TALE annotation.

Availability

The scripts for computationally correcting TALE genes within ONT-based assemblies of Xoo and Xoc genomes are available from github at https://github.com/Jstacs/Jstacs/tree/master/projects/talecorrect. All assemblies of the Xoo PXO35 genome considered in Table 3 and the ONT-based assemblies of the remaining strains (cf. Table 5) are available from zenodo (https://doi.org/10.5281/zenodo.6854866, doi: 10.5281/zenodo.6854866). Raw reads are available from the European Nucleotide Archive (https://www.ebi.ac.uk/ena) under study accession PRJEB54678.

Results and discussion

Xoo PXO35 – A genome with highly repetitive TALE genes only fully assembled from ONT reads

For Xoo PXO35, we generated reads from three alternative sequencing technologies, namely Illumina short reads, and PacBio and ONT long reads, each with their specific limitations and error profiles. This gives us the opportunity to systematically assess the assembly quality and contiguity achieved based on each of the technologies, or using hybrid and polishing approaches applied to combinations of these. Here, we especially focus on the correct assembly of TALE genes within their genomic context. TALE genes are notoriously difficult to sequence because of the repetitive structure of their DNA-binding domain (Fig. 2), which is composed of typically 34 amino acid long tandem repeats, corresponding to 102 bp on the DNA level. In addition, the DNA sequence coding for the N-terminal and C-terminal domains is highly conserved among TALE genes as well. Within the genomic sequence, TALEs are often organized in clusters of several TALE genes separated by short linker sequences [18], which further complicates a contiguous assembly, even from longer reads.

Fig. 2
figure 2

Schematic illustration of assembly problems based on read length. (A) Using short Illumina reads, the assembly of a single TALE may be ambiguous due to highly conserved tandem repeats, which often differ only in their RVD di-codon indicated by coloured boxes. Starting from the same set of Illumina reads, two alternative assemblies are consistent with the reads. Using long read technologies, however, the bottom assembly may be identified as the correct one. (B) Even using longer reads spanning complete TALEs, the organization and order of TALE within clusters may be ambiguous due to highly conserved N-terminus, C-terminus and linker sequence between TALEs. Here, very long ONT reads spanning complete TALE clusters are capable of resolving the arrangement of TALEs in large clusters

In the following, we first compare assemblies based on reads from individual sequencing technologies and combinations thereof using general statistics of assembly contiguity and completeness. We then compare the assembly on a broader scale using genomic alignments with a focus on the placement and completeness of TALE clusters. Finally, we compare the TALE annotations on these assemblies with regard to the number of reconstructed TALEs and their assignment into TALE classes.

Comparison of assemblies by general statistics

For the purpose of comparing the different basic sequencing methods, we first assembled the genome of Xoo PXO35 in a non-hybrid manner from either only Illumina, only PacBio, or only ONT reads. For the assembly from Illumina reads only, we used the MaSuRCA method which is a hybrid approach that allows variable read lengths and significant sequencing errors [48]. For the assembly based solely on short Illumina reads, we obtain a very large number of contigs (314) with a relative low N50 weighted median contig length of 26,708 bp. To assess the assembly with regard to the genes they contain, we used BUSCO [54], which relies on single-copy orthologs to quantify completeness. Applying the 1,152 BUSCO markers of the “xanthomonadales_odb10” dataset, the Illumina assembly still had a decent Busco score indicating that despite the low contiguity, many genes are still fully covered by a contig (cf. Table 2).

PacBio reads were de novo assembled using the classical hierarchical genome-assembly process 4 (HGAP4) pipeline (SMRT LINK v.7.0.1) [49]. Even the much longer PacBio reads did not succeed in closing a complete chromosome, resulting in five contigs. Two of these correspond to two plasmid sequences whereas the chromosomal sequence is fragmented into three contigs.

Here, contig N50 is substantially larger than for the Illumina-based assembly (4,203,214 bp) and a large fraction of the approx. 5 Mbp chromosome is covered by the largest contig. Busco completeness is on the same level as observed for the Illumina-based assembly. Polishing the PacBio-based assembly with Illumina reads using pilon only mildly affects contig N50 and leads to a small increase in Busco completeness.

Table 2 General statistics of genome assemblies of PXO35 using different combinations of input reads, assembly tools and polishing methods including computational correction of TALE genes (comp. correction), listing the number of contigs (#contigs), additional contigs that are likely related to plasmid sequences (#plasmid contigs), the contig N50 value, and Busco completeness per assembly

The assembly applying Flye to the ONT reads results in a completely contiguous genome of Xoo PXO35 comprising one chromosomal and two plasmid contigs. One plasmid contig has high similarity to a plasmid of Xoo CIAT and Xoo BXO1 and the other plasmid contig has high similarity to a plasmid of Xoo IX-280. The contig N50 corresponds to the length of the complete chromosomal sequence. Due to the specific error profile of ONT reads resulting in a substantial number of insertions and deletions, Busco completeness is clearly lower than for the Illumina-based and PacBio-based assemblies. Since the computational correction proposed in this paper only affects the TALE genes, which are not covered by the Busco gene set, it does not influence Busco completeness. By contrast, polishing this assembly with pilon using Illumina reads recovers Busco completeness to a similar range as for the previous two sequencing technologies. Similar results are also obtained by polishing with Arrow using PacBio reads, and by polishing with Homopolish using either bacteria or X. oryzae as the reference.

The hybrid assembly by MaSuRCA jointly considering PacBio and Illumina reads leads to a larger number of contigs and a lower Busco completeness than using only PacBio reads in the HGAP4 pipeline. The hybrid assembly by Unicycler jointly considering ONT and Illumina reads, however, yields a similar quality of the assembly as the ONT-based assembly with Illumina polishing.

In summary, we conclude that ONT reads are essential to assemble a fully contiguous Xoo PXO35 genome. Combined with Illumina short reads, either in a polishing step or in a hybrid assembly approach, this also yields a competitive Busco completeness compared with the alternative options.

Comparison of assemblies by genomic alignments

We further compare the different assemblies by genomic alignments using progressiveMauve to obtain an overview of their concordance in the arrangement of larger genomic regions. We find that – as expected – all assemblies using ONT reads are in large agreement (Fig. 3). Here, the different polishing or assembly strategies lead to local differences in the assembled sequence, whereas genomic blocks (indicated by colour) stay in the same order and strand orientation. The general placement of TALE clusters is also identical between all four assemblies. The two PacBio-based assemblies (with and without pilon polishing) may be aligned well to the ONT assemblies, but two contig borders (at approx. 1,230 kbp and approx. 2,000 kbp) are located within TALE clusters according to the ONT-based assemblies. By contrast, for the hybrid assembly based on PacBio and Illumina reads we may visually perceive the large number of contigs with frequently changing strand orientations in the alignment. Again, many of the contig borders are placed within TALE clusters. The solely Illumina-based assembly, finally, results in a highly fragmented assembly, which is lacking the majority of TALE clusters. We provide an archive containing the progressiveMauve alignment file for further exploration at zenodo (doi: 10.5281/zenodo.7123417, https://doi.org/10.5281/zenodo.7123416).

Fig. 3
figure 3

Complete alignment of different PXO35 assemblies using progressiveMauve (c.c: computational correction; i.p. Illumina (pilon); p.a. PacBio (Arrow)). Contig borders are marked by red vertical lines. All ONT-based assemblies yield one contiguous chromosome and two plasmid sequences, whereas PacBio-based assemblies are split into multiple, long contigs with TALE clusters located at the contig borders. The hybrid and Illumina-based assemblies are characterized by a large number of smaller contigs. The location of TALE genes is indicated by boxes below the asssembly representation

The genomic alignments of the different assemblies show that major differences between the ONT-based and the PacBio-based assemblies (and also the Illumina-based assembly) lie within large TALE clusters (Supplementary Fig. S3). To gain further insights into these differences, we consider the region around the first contig border within the PacBio-based assembly in Fig. 4. From the ONT-based assembly, we observe that this cluster contains seven TALEs from different TALE classes. The PacBio-based assembly correctly reconstructed the first two (BJ, AE) and the last two (AQ, AP) of these TALEs, while the remaining TALEs are located on either border of the two contigs and fail to be annotated and classified by AnnoTALE. Finally, the complete TALE cluster is missing from the Illumina-based assembly, where the end of a first contig is placed upstream, while the start of the following contig is located already downstream of this TALE cluster. However, it might be still questionable if the ONT-based assembly of this TALE cluster is indeed correct. Hence, we further consider ONT and PacBio reads mapped to the ONT-based assembly in Supplementary Fig. S4. We find that a large number of long ONT reads spans the complete TALE cluster, which provides further confidence in this assembly result. PacBio reads map also well to this region, although these had not been sufficient to reconstruct this TALE cluster initially. In general, each of the TALE clusters or single TALEs within the genome is fully covered by a sufficient number of ONT reads, which increases confidence in their correct assembly (Supplementary Table D). The second contig border within the PacBio-based assembly (at approx. 2,000 kbp) is also located within a large TALE cluster and we further consider it in Supplementary Fig. S5. The cluster observed in the ONT assembly contains TalAB, TalAL and TalBA on the forward strand as well as TalAC, TalAS and TalAG on the reverse strand. However, in the PacBio-only assembly the TALEs on the forward strand are TalAB, a truncated TalAL and on the reverse strand AS and AG, which means that the two TALEs TalBA and TalAC in the center of the large cluster could not be assembled correctly. In order to further investigate the correctness of this ONT-based TALE cluster, we have considered the mapped ONT and PacBio reads in Supplementary Fig. S6, and find that this region is well covered by long ONT reads as well as the shorter PacBio reads. In summary, our results indicate that both long-read techniques are generally suited to assemble Xanthomonas genomes containing (clusters of) TALE genes as shown previously [4, 16]. However, in case of very large TALE clusters, as in the first and second example with a cluster of seven and six TALEs, respectively, spanning 35 kbp, the repetitive nature and high conservation of TALE genes may require typically longer ONT reads for a complete assembly result. Nevertheless, PacBio reads longer than 35 kbp were produced in our hands (Supplementary Fig. S2) and it is possible that an optimized PacBio run could resolve such large clusters.

Fig. 4
figure 4

A large cluster of repetitive TALE genes in Xoo PXO35 at approx. 1,220 kbp prevents a contiguous assembly. Contig borders are marked by red vertical lines. In the ONT-based assembly, seven TALEs are present in this cluster (boxes), whereas only four of them could be reconstructed from the PacBio-based assembly and none in the Illumina-based assembly. The black vertical rectangles represent the homologous position between the three assemblies

Table 3 Comparison of PXO35 assemblies on level of TALEs. Per assembly, we record the total number of TALE genes (#TALEs), the number of iTALEs with a truncated C-terminus (#iTALEs), the number of TALE pseudo genes (#pseudo genes), and the number of TALE classes (#classes)

ONT sequencing entails specific problems with regard to insertions and deletions, especially in homopolymer regions [60]. Such sequencing errors may also be observed within TALE genes, which are in the focus of this manuscript. With the assemblies of the Xoo PXO35 chromosome polished by either Illumina or PacBio reads, we get insights into the prevalence of these sequencing errors and also derive and test a computational method for their correction in ONT-only assemblies.

In Fig. 5, we present a cartoon of the genomic sequence of one TALE (TalBJ8) from Xoo PXO35. Here, we assume that the Illumina-polished and PacBio-polished variants of this genomic region have the correct base sequence. In addition, this TALE is located in the portion of the large TALE cluster (cf. Fig. 4) that could be correctly assembled from PacBio reads alone. Hence, we have three reference sequences against which we may compare the sequence of the ONT-based assembly. Here, we find four homopolymer stretches, two in the sequence coding for the N-terminus and two in the sequence coding for the C-terminus that appear to be affected by deletions in the ONT-based assembly. Notably, these deletions occur despite the very high coverage with ONT reads (\(\sim\) 662 X), and are unlikely to be resolved by even deeper sequencing.

Fig. 5
figure 5

Typical errors in homopolymers for ONT sequencing and how these may be corrected. The initial ONT-based assembly is lacking nucleotides in homopolymer stretches, which may be corrected computationally by comparison to the corresponding consensus from the HMM. The resulting corrected sequence is in accordance with the genome version after Illumina-based or PacBio-based polishing of the initial ONT-based assembly

Here, we propose a computational correction method for ONT-only assemblies based on hidden Markov models learned from the sequences of N-terminus, C-terminus, and individual repeats of known Xoo or Xoc TALEs. In all four cases, we observe that the consensus of the corresponding HMM as aligned using nhmmer matches the sequence of the three reference sequences at the deletion sites. As such errors predominantly appear within homopolymers, we correct these by filling the deletion sites with the homopolymer nucleotide. In this case, this procedure yields the same result as simply transferring the consensus nucleotide of the HMM to a corrected version of the ONT-based sequence. Outside of homopolymer stretches we, hence, use the HMM consensus for computational correction. Optionally, this may be followed by a read-based custom substitution polishing, which is based on the ONT reads mapped back to the corrected sequence. If in those mapped reads another nucleotide appears at least 1.5 times as frequent as the previous correction choice, the nucleotide from the mapped reads is used instead.

With the computational correction at hand, we may now reconstruct complete TALomes of Xanthomonas strains based on ONT reads only without the need for additional Illumina or PacBio reads used for polishing. However, this comes at the potential risk that true deletions within TALEs are corrected by mistake. Hence, we indicate the sites of computational corrections by lower case letters to make them identifiable in downstream analyses.

Comparison of annotated TALEs

In the remainder of this section, we compare the different assemblies, including the ONT-only assembly combined with the computational correction of TALE genes, on the level of their TALomes as annotated by AnnoTALE. We find that – as already indicated by the genomic alignments – the Illumina-based assembly is able to reconstruct a single TALE gene and one further TALE pseudo gene only (Table 3). The PacBio-based assemblies with and without pilon polishing yield a substantially larger number of TALE genes, but contain a larger number of pseudo genes than might be expected. Still, polishing by pilon helps to recover two of these pseudo genes. A subset of pseudo genes is located within the large TALE cluster as indicated by the question marks in Fig. 4. The largest number of TALE genes is annotated from the ONT-based assemblies. Here, we find up to 20 TALE genes and one iTALE [55, 56]. iTALEs are truncated versions of TALEs which don’t act as gene activators, but instead block TALE recognition by specific rice resistance proteins [55, 56]. They are therefore functionally different from normal TALEs. Notably, the ONT-only assembly combined with the computational correction yields the same TALEs, iTALE, and TALE classes as the PacBio-polished variant, and the TALEs obtained from both assembly variants have completely identical repeat regions including the di-codons coding for RVDs. Based on the Illumina-polished ONT assembly, one of the TALEs (TalAS) is classified as an additional pseudo gene due to a frame shift in the N-terminal sequence. However, the repeat regions including di-codons coding for RVDs are again perfectly identical between the Illumina-polished and the computationally corrected ONT-based assembly. Hence, both alternative polishing variants (PacBio-based and Illumina-based) support the RVDs that are determined by AnnoTALE from the computationally corrected ONT assembly.

The hybrid assembly based on PacBio and Illumina reads, which already showed low contig N50 and Busco completeness in the general assessment, does not contain a single complete TALE gene and only four TALE pseudo genes. Surprisingly, also the ONT + Illumina hybrid assembly that appeared to be of high quality in the general assessment contains only TALE pseudo genes, although the number of pseudo genes indicates that the total set of TALE regions might be complete. One reason for this observation could be that the internal correction of ONT reads using Illumina reads within the Unicycler pipeline fails to adequately correct the highly conserved N-terminus and C-terminus of TALE genes, but especially the highly conserved tandem repeats including the di-codons of RVDs. To illustrate the frequency and location of differences between the computationally corrected, ONT-based assembly and the ONT + Illumina hybrid assembly, we align the TALE DNA sequences determined from the computationally corrected assembly against the genomic sequence of the hybrid assembly using blastn [61], and count the number of mismatches and InDels in TALE regions (Supplementary Table B). Indeed, the hybrid assembly frequently deviates from the ONT-based and computationally corrected assembly within N-terminal, C-terminal, and also repeat regions of all genomic TALE sequences.

Table 4 Quality of assemblies based on sub-samples of fractions of the complete set of ONT reads. In all cases, the assembly yields contiguous sequences of the chromosome and two plasmids. u.u.: uncorrected, unpolished; i.p. Illumina-polished; c.c.: computationally corrected

Typically, ONT sequencing suffers from problems within homopolymers. Homopolish may improve assemblies in homopolymeric regions based on homologous sequences. Applying Homopolish to the ONT-based assembly of Xoo PXO35 results in a reduced number of annotated TALEs compared with the other polishing/correction options due to the introduction of frame shifts and/or premature stop codons. More importantly, we observe that Homopolish may introduce modifications in the di-codon of TALE repeats and, subsequently, their RVD sequence. In the case of Xoo PXO35, this issue occurs for both Homopolish reference data sets, i.e., bacteria and X. oryzae. Supplementary Table C lists all changes within the RVDs of all TALEs of PXO35 for both Homopolish variants, while Supplementary Fig. S7 shows the effect of Homopolish in comparison to other polishing variants within the RVD sequence of TalAO. As polishing with Homopolish may lead to altered RVD sequences, we consider it unsuitable for polishing TALE genes, since a correct RVD sequence is essential for downstream analyses such as the prediction of TALE targets.

In summary, the comparison of different assembly, polishing, and correction strategies leads to the following conclusion for sequencing TALomes of Xanthomonas bacteria, which may likely also apply to similar repetitive structures within genomes: First, long ONT reads are required to yield fully contiguous genomes of strains with large TALE clusters. Second, further polishing by Illumina or PacBio reads is necessary if one is interested in a complete gene annotation. Using Homopolish as an alternative may recover many protein coding genes, but may also introduce undesirable modifications within TALE genes. Third, hybrid assembly strategies starting from an initial short-read graph may fail for conserved, repetitive regions like TALEs. Fourth, ONT-only assemblies combined with a computational correction step are sufficient to yield complete and correct TALomes of Xanthomonas bacteria. Hence, we follow the latter strategy in the remainder of this manuscript and focus on TALE genes.

Table 5 Comparison of the ONT-based assemblies of the three Xoo and the two Xoc strains on the level of TALEs. Per assembly, we record the total number of TALE genes (#TALEs), the number of iTALEs with a truncated C-terminus (#iTALEs), the number of TALE pseudo genes (#pseudo genes), and the number of TALE classes (#classes)

Multiplexing capacity of complete TALome-decoding from Xanthomonas strains using ONT sequencing

As mentioned previously, the total set of ONT reads yields a very high coverage of the approx. 5 Mbp genome of Xoo PXO35, although the true mean coverage after re-mapping the reads to the assembly is slightly lower than could be expected from the raw reads (cf. Table 1). However, it remains an open question how much coverage is indeed required to correctly reconstruct a TALome from ONT reads or, put differently, how many genomes of TALE-carrying Xanthomonas strains could be multiplexed to be sequenced on a single ONT flow cell. The rather complex organization of the Xoo PXO35 genome with its large TALE clusters and two plasmids combined with a large number of available ONT reads provides an ideal scenario to test the potential multiplexing capacity in silico.

To this end, we perform the assembly using Flye on sub-samples of different fractions of the original ONT reads, followed by Illumina-based polishing or our computational correction, respectively. Although sub-samples contain down to \(3.12\%\) of the original number of ONT reads, the N50 read length remains largely stable across all sub-samples (Supplementary Table E).

We find that the contig N50 of the resulting assemblies stays highly stable across the whole range of sub-sampling fractions (Table 4). Even from a subset of only \(3.12\%\) of the original ONT reads, Flye is able to generate a fully contiguous assembly of the chromosome and the two plasmids. Without any polishing, the Busco completeness starts to drop early, and only yields \(73.9\%\) of the Busco genes for the lowest number of reads. However, when polishing with Illumina reads, Busco completeness stays rather constant as well. As the computational correction does only affect TALE genes, Busco completeness after computational correction is identical to the uncorrected, unpolished case.

Regarding the reconstruction of TALE genes, we find that without polishing or correction, none of the TALE genes is reconstructed correctly leading to a large number of pseudo genes due to insertions or deletions. Illumina polishing of the ONT-based assembly yields an (almost) complete set of TALEs and iTALE down to a fraction of \(25\%\) of the original ONT reads. With even lower sub-sampling fractions, however, the number of correct TALEs drops, which is only partly explained by these TALE becoming pseudo genes. By contrast, the computational correction is able to correctly reconstruct all TALE genes and the iTALE even for the lowest sub-sampling fraction. Hence, it would have been possible to determine the complete TALome of Xoo PXO35 from less than 20, 000 ONT reads, which illustrates the huge multiplexing capacity of ONT sequencing combined with a computational correction of TALE genes.

TALEs and TALE classes present in strains

Having established a pipeline for assembling fully contiguous Xanthomonas genomes from ONT reads combined with a computational correction of TALE genes, we applied this pipeline to two further Xoo strains, namely FXO38 and Huang604, and to two African Xoc strains, BAI35 and MAI23.

For these remaining strains with ONT data generated in this study, each assembly leads to a single, circular chromosome with a length in the expected range of approx. 5 Mbp (Table 5). Busco completeness is in the same range as observed for Xoo PXO35 previously. Here, we observe the lowest Busco completeness for Xoc BAI35 (\(93.2\%\)) and the highest Busco completeness for Xoc MAI23 (\(96.0\%\)). The number of TALEs annotated for these genomes using AnnoTALE after computational correction of TALE genes varies between 16 (Xoo Huang604) and 20 (Xoo PXO35, Xoc BAI35 and MAI23), with up to one additional pseudo gene, and one to two iTALEs per strain.

From the genomic alignments of the assemblies of the three Xoo strains (Fig. 6A), we observe substantial rearrangements and inversions, which also affect the location and composition of TALE clusters. Compared with Xoo PXO35, TALE clusters of the other two Xoo strains contain a lower number of TALEs per cluster, where the largest cluster is composed of five (FXO38) and four (Huang604) TALE genes. The TALEs of the large cluster in Xoo PXO35 (TalBJ–TalAP) either do not occur in FXO38 and Huang604 (TalBJ, TalAL), are part of smaller clusters (TalJQ–TalAD in FXO38 and TalAE–TalAP in Huang), or are located in another cluster together with an additional TALE (TalAL–TalAD in Huang604). In all three genomes, the iTALE TalAI is located outside of clusters. There are two types of iTALEs, type A and type B, that differ in the C-terminal truncation [55, 56, 62, 63]. TalAI from PXO35 and Huang604 are type A iTALEs, whereas TalAI from FXO38 is a type B iTALE. In addition, FXO38 contains a type A iTALE with TalDQ.

Fig. 6
figure 6

Complete alignment of assemblies using progressiveMauve. (A) Alignment of the three Xoo strains PXO35, FXO38 and Huang604. (B) Alignment of the two Xoc strains BAI35 and MAI23. Each iTALE is marked with a superscript i

For the two Xoc strains (Fig. 6B), the number of rearrangements and inversions is lower with a large collinear block between approx. 1.5 Mbp and 3.7 Mbp. Within this block, also the location and composition of TALE clusters is largely conserved with one single TALE replaced (TalAZ vs. TalCI) and one switch of TALE positions within a cluster (TalBI, TalAX, TalBD vs. TaBI, TalBD, TalAX). In both strains, many TALEs are located outside of clusters as single TALEs and both have one iTALE of type B (TalCC).

We further compared the TALomes of the newly sequenced strains to those of well-known strains, namely Xoo PXO99\(^A\) and Asian Xoc BLS256 (Fig. 7). In general, we find a clear distinction between the TALE classes of the Xoo and Xoc strains. The two classes that may be present in both pathovars (TalAD and TalAF) do not occur in the two newly sequenced Xoc strains. TalAD is present in all three Xoo strains, while TalAF is missing in Xoo PXO35. Several further common Xoo TALEs are present in all three Xoo strains (TalAA, TalAB, TalAE, TalAH, TalAI, TalAO, TalAQ, TalAR, TalAS). Xoo PXO35 has many unique TALEs that are not present in the other strains (TalBJ, TalBK, TalCA) but have been observed in Xoo strains previously. By contrast, the TALome of Huang604 appears to be composed of only common TALEs.

Fig. 7
figure 7

Presence of TALE classes in 5 ONT sequenced strains compared to 2 well-known strains. The nomenclature for TALEs is based on AnnoTALE [4]. AnnoTALE groups similar TALEs according to their RVDs into classes, as indicated by the last two capital letters in their name. Accordingly, TALEs from the same class can be expected to target the same plant gene(s). To further distinguish TALEs from different strains within a class an allele number is added which is omitted here, because only TALE classes are shown

For the two Xoc strains, the situation is similar with many common Xoc TALEs (TalAT, TalAV, TalAX, TalAY, TalAZ, TalBC, TalBD, TalBE, TalBF, TalBI, TalBL, TalBO, TalBU, TalBW) but also some less frequent TALEs (TalCB, TalCC, TalCF, TalCH, TalCJ, TalCP) that have been observed in other Xoc strains but are not present in BLS256. TalCD and TalAZ are present in BAI35 but not in MAI23, while TalCI and TalCR are present in MAI23 but not in BAI35. However, TalCD, TalAZ, TalCI and TalCR have all been observed in other Xoc strains before [4]. A graphical overview of the TALEs in all fully sequenced Xoo and Xoc strains can be found in Supplementary Figs. S8 and S9, respectively.

Conclusions

We have solved the composition of the largest TALE cluster in Xoo, so far, which contains seven individual TALEs and is a unique rearrangement between cluster VIII and IX. Xoo strain PXO35 was originally isolated in 1972 in the Philippines [64] and belongs to race 1 which is characterized by a relative high diversity between strains. Although the virulence of this strain is high, this particular TALE cluster composition does not occur in other Xoo strains from this geographic location. It is possible, that this strain has a disadvantage on certain rice varieties in comparison to more frequently found strains. Alternatively, we might still underestimate the full diversity and frequency of TALE rearrangements in native Xoo populations.

ONT reads are perfectly suited to yield highly contiguous genomes of Xanthomonas bacterial strains that could not be resolved based on Illumina or PacBio reads due to clusters of highly repetitive TALE genes. While polishing by, e.g., Illumina reads, is still required to obtain complete and correct gene models of all bacterial genes, the computational correction pipeline presented in this manuscript is capable of correctly reconstructing the TALomes of Xanthomonas strains from ONT reads alone. As the quality of ONT reads is developing rapidly with new types of flow cells and improved chemistry, it can be expected that ONT sequencing will become the standard method for sequencing Xanthomonas strains soon. Due to the moderate coverage required for obtaining complete TALomes of Xanthomonas strains, sequencing on an ONT flongle might be sufficient for studying individual strains.

Given the huge multiplexing capacity of ONT sequencing combined with the computational correction of TALE genes, it is now possible to sequence large numbers of Xanthomonas strains in parallel to address pathogen populations and characterize strains from many more geographic regions. The expected expansion of TALE sequences is a good argument to support our unified AnnoTALE (http://www.jstacs.de/index.php/AnnoTALE) nomenclature, which groups similar TALEs into classes with a common name and allows for a quick understanding of TALE cluster compositions (Fig. 7, Supplementary Fig. S8, Supplementary Fig. S9) [4]. The ONT pipeline presented here may lead to new insights into the importance and prevalence of certain TALE classes, but also evolutionary mechanisms shaping their TALomes.

Availability of data and materials

Assemblies of the Xoo PXO35 genome from all three sequencing technologies and the ONT-based assemblies of the remaining strains are available from zenodo (https://doi.org/10.5281/zenodo.6854866, doi: 10.5281/zenodo.6854866). Raw reads are available from the European Nucleotide Archive (https://www.ebi.ac.uk/ena) under study accession PRJEB54678.

References

  1. Liu W, Liu J, Triplett L, Leach JE, Wang GL. Novel Insights into Rice Innate Immunity Against Bacterial and Fungal Pathogens. Annu Rev Phytopathol. 2014;52(1):213–41.

    Article  CAS  PubMed  Google Scholar 

  2. White FF, Yang B. Host and Pathogen Factors Controlling the Rice-Xanthomonas oryzae Interaction. Plant Physiol. 2009;150(4):1677–86.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Wilkins K, Booher N, Wang L, Bogdanove A. TAL effectors and activation of predicted host targets distinguish Asian from African strains of the rice pathogen Xanthomonas oryzae pv. oryzicola while strict conservation suggests universal importance of five TAL effectors. Front Plant Sci. 2015;6. https://doi.org/10.3389/fpls.2015.00536.

  4. Grau J, Reschke M, Erkes A, Streubel J, Morgan RD, Wilson GG, et al. AnnoTALE: bioinformatics tools for identification, annotation and nomenclature of TALEs from Xanthomonas genomic sequences. Sci Rep. 2016;6(1):21077.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Doucouré H, Pérez-Quintero AL, Reshetnyak G, Tekete C, Auguy F, Thomas E, et al. Functional and genome sequence-driven characterization of tal effector gene repertoires reveals novel variants with altered specificities in closely related Malian Xanthomonas oryzae pv. oryzae strains. Front Microbiol. 2018;9. https://doi.org/10.3389/fmicb.2018.01657.

  6. Boch J, Bonas U. Xanthomonas AvrBs3 Family-Type III Effectors: Discovery and Function. Annu Rev Phytopathol. 2010;48(1):419–36.

    Article  CAS  PubMed  Google Scholar 

  7. Zhang B, Han X, Yuan W, Zhang H. TALEs as double-edged swords in plant-pathogen interactions: Progress, challenges, and perspectives. Plant Commun. 2022;3(3): 100318.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Boch J, Scholze H, Schornack S, Landgraf A, Hahn S, Kay S, et al. Breaking the Code of DNA Binding Specificity of TAL-Type III Effectors. Science. 2009;326(5959):1509–12.

    Article  CAS  PubMed  Google Scholar 

  9. Moscou MJ, Bogdanove AJ. A Simple Cipher Governs DNA Recognition by TAL Effectors. Science. 2009;326(5959):1501.

    Article  CAS  PubMed  Google Scholar 

  10. Mak ANS, Bradley P, Cernadas RA, Bogdanove AJ, Stoddard BL. The Crystal Structure of TAL Effector PthXo1 Bound to Its DNA Target. Science. 2012;335(6069):716–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Deng D, Yan C, Pan X, Mahfouz M, Wang J, Zhu JK, et al. Structural Basis for Sequence-Specific Recognition of DNA by TAL Effectors. Science. 2012;335(6069):720–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Ma W, Zou L, Zhiyuan J, Xiameng X, Zhengyin X, Yang Y, et al. Xanthomonas oryzae pv. oryzae TALE proteins recruit OsTFIIAγ1 to compensate for the absence of OsTFIIAγ5 in bacterial blight in rice. Mol Plant Pathol. 2018;19(10):2248–2262.

  13. Hummel AW, Doyle EL, Bogdanove AJ. Addition of transcription activator-like effector binding sites to a pathogen strain-specific rice bacterial blight resistance gene makes it effective against additional strains and against bacterial leaf streak. New Phytol. 2012;195(4):883–93.

    Article  CAS  PubMed  Google Scholar 

  14. Streubel J, Baum H, Grau J, Stuttmann J, Boch J. Dissection of TALE-dependent gene activation reveals that they induce transcription cooperatively and in both orientations. PLoS ONE. 2017;12(3):1–24.

    Article  Google Scholar 

  15. Tran TT, Pérez-Quintero AL, Wonni I, Carpenter SCD, Yu Y, Wang L, et al. Functional analysis of African Xanthomonas oryzae pv. oryzae TALomes reveals a new susceptibility gene in bacterial leaf blight of rice. PLoS Pathog. 2018;14(6):1–25.

  16. Mücke S, Reschke M, Erkes A, Schwietzer CA, Becker S, Streubel J, et al. Transcriptional reprogramming of rice cells by Xanthomonas oryzae TALEs. Front Plant Sci. 2019;10. https://doi.org/10.3389/fpls.2019.00162.

  17. Booher NJ, Carpenter SCD, Sebra RP, Wang L, Salzberg SL, Leach JE, et al. Single molecule real-time sequencing of Xanthomonas oryzae genomes reveals a dynamic structure and complex TAL (transcription activator-like) effector gene relationships. Microb Genomics. 2015;1(4). https://doi.org/10.1099/mgen.0.000032.

  18. Erkes A, Reschke M, Boch J, Grau J. Evolution of Transcription Activator-Like Effectors in Xanthomonas oryzae. Genome Biol Evol. 2017;9(6):1599–615.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Denancé N, Szurek B, Doyle EL, Lauber E, Fontaine-Bodin L, Carrère S, et al. Two ancestral genes shaped the Xanthomonas campestris TAL effector gene repertoire. New Phytol. 2018;219(1):391–407.

    Article  PubMed  Google Scholar 

  20. Teper D, Wang N. Consequences of adaptation of TAL effectors on host susceptibility to Xanthomonas. PLoS Genetics. 2021;17(1):1–23.

    Article  Google Scholar 

  21. Noël L, Thieme F, Gäbler J, Büttner D, Bonas U. XopC and XopJ, Two Novel Type III Effector Proteins from Xanthomonas campestris pv. vesicatoria. J Bacteriol. 2003;185(24):7092–7102.

  22. Ferreira RM, de Oliveira ACP, Moreira LM, Belasque J, Gourbeyre E, Siguier P, et al. A TALE of Transposition: Tn3-Like Transposons Play a Major Role in the Spread of Pathogenicity Determinants of Xanthomonas citri and Other Xanthomonads. mBio. 2015;6(1):e02505–14.

  23. Midha S, Bansal K, Kumar S, Girija AM, Mishra D, Brahma K, et al. Population genomic insights into variation and evolution of Xanthomonas oryzae pv. oryzae. Sci Rep. 2017;7(1):40694.

  24. Lee BM, Park YJ, Park DS, Kang HW, Kim JG, Song ES, et al. The genome sequence of Xanthomonas oryzae pathovar oryzae KACC10331, the bacterial blight pathogen of rice. Nucleic Acids Res. 2005;33(2):577–86.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Bogdanove AJ, Koebnik R, Lu H, et al. Two New Complete Genome Sequences Offer Insight into Host and Tissue Specificity of Plant Pathogenic Xanthomonas spp. J Bacteriol. 2011;193(19):5450–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Cernadas RA, Doyle EL, Niño-Liu DO, Wilkins KE, Bancroft T, Wang L, et al. Code-Assisted Discovery of TAL Effector Targets in Bacterial Leaf Streak of Rice Reveals Contrast with Bacterial Blight and a Novel Susceptibility Gene. PLoS Pathog. 2014;10(2):1–24.

    Article  Google Scholar 

  27. Tran TT, Doucouré H, Hutin M, Jaimes Niño LM, Szurek B, Cunnac S, et al. Efficient enrichment cloning of TAL effector genes from Xanthomonas. MethodsX. 2018;5:1027–32.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Li C, Ji C, Huguet-Tapia JC, White FF, Dong H, Yang B. An efficient method to clone TAL effector genes from Xanthomonas oryzae using Gibson assembly. Mol Plant Pathol. 2019;20(10):1453–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Zárate-Chaves CA, Osorio-Rodríguez D, Mora RE, Pérez-Quintero ÁL, Dereeper A, Restrepo S, et al. TAL Effector Repertoires of strains of Xanthomonas phaseoli pv. manihotis in commercial cassava crops reveal high diversity at the country scale. Microorganisms. 2021;9(2):315. https://doi.org/10.3390/microorganisms9020315.

  30. Cox KL, Meng F, Wilkins KE, Li F, Wang P, Booher NJ, et al. TAL effector driven induction of a SWEET gene confers susceptibility to bacterial blight of cotton. Nat Commun. 2017;8(1):15588.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Carpenter SCD, Kladsuwan L, Han SW, Prathuangwong S, Bogdanove AJ. Complete Genome Sequences of Xanthomonas axonopodis pv. glycines Isolates from the United States and Thailand Reveal Conserved Transcription Activator-Like Effectors. Genome Biol Evol. 2019;11(5):1380–1384.

  32. Goettelmann F, Roman-Reyna V, Cunnac S, Jacobs JM, Bragard C, Studer B, et al. Complete genome assemblies of all Xanthomonas translucens pathotype strains reveal three genetically distinct clades. Front Microbiol. 2022;12. https://doi.org/10.3389/fmicb.2021.817815.

  33. Rhoads A, Au KF. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinform. 2015;13(5):278–89.

    Article  Google Scholar 

  34. Monteiro-Vitorello CB, De Oliveira MC, Zerillo MM, Varani AM, Civerolo E, Sluys MAV. Xylella and Xanthomonas Mobil’omics. OMICS: J Integr Biol. 2005;9(2):146–159.

  35. Varani AM, Monteiro-Vitorello CB, Nakaya HI, Van Sluys MA. The Role of Prophage in Plant-Pathogenic Bacteria. Annu Rev Phytopathol. 2013;51(1):429–51.

    Article  CAS  PubMed  Google Scholar 

  36. Suzuki Y, Morishita S. The time is ripe to investigate human centromeres by long-read sequencing. DNA Res. 2021;28(6):dsab021. https://doi.org/10.1093/dnares/dsab021.

  37. Wang B, Yang X, Jia Y, Xu Y, Jia P, Dang N, et al. High-quality Arabidopsis thaliana genome assembly with Nanopore and HiFi long reads. Genomics Proteomics Bioinforma. 2021;20(1):4–13. https://doi.org/10.1016/j.gpb.2021.08.003.

  38. Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Scott AD, Zimin AV, Puiu D, Workman R, Britton M, Zaman S, et al. A Reference Genome Sequence for Giant Sequoia. G3 Genes|Genomes|Genet. 2020;10(11):3907–3919.

  40. Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Oberacker P, Stepper P, Bond DM, Höhn S, Focken J, Meyer V, et al. Bio-On-Magnetic-Beads (BOMB): Open platform for high-throughput nucleic acid extraction and manipulation. PLoS Biol. 2019;17(1):1–16.

    Article  Google Scholar 

  43. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37(5):540–6.

    Article  CAS  PubMed  Google Scholar 

  44. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27(5):737–46.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 2004;14(7):1394–403.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Darling AE, Mau B, Perna NT. progressiveMauve: Multiple Genome Alignment with Gene Gain. Loss and Rearrangement PLoS ONE. 2010;5(6):1–17.

    Google Scholar 

  48. Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013;29(21):2669–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013;10(6):563–9.

    Article  CAS  PubMed  Google Scholar 

  50. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017;13(6):1–22. https://doi.org/10.1371/journal.pcbi.1005595.

    Article  CAS  Google Scholar 

  51. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9(11):e112963. https://doi.org/10.1371/journal.pone.0112963.

  52. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. arXiv preprint arXiv:1303.3997.

  53. Huang YT, Liu PY, Shih PW. Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biol. 2021;22(1):95. https://doi.org/10.1186/s13059-021-02282-6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.

    Article  PubMed  Google Scholar 

  55. Ji Z, Ji C, Liu B, Zou L, Chen G, Yang B. Interfering TAL effectors of Xanthomonas oryzae neutralize R-gene-mediated plant disease resistance. Nat Commun. 2016;7(1):13435.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Read AC, Rinaldi FC, Hutin M, He YQ, Triplett LR, Bogdanove AJ. Suppression of Xo1-mediated disease resistance in rice by a truncated, non-DNA-binding TAL effector of Xanthomonas oryzae. Front Plant Sci. 2016;7. https://doi.org/10.3389/fpls.2016.01516.

  57. Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;19(1):90.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Wheeler TJ, Eddy SR. nhmmer: DNA homology search with profile HMMs. Bioinformatics. 2013;29(19):2487–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2012;14(2):178–92.

    Article  PubMed  PubMed Central  Google Scholar 

  60. Delahaye C, Nicolas J. Sequencing DNA with nanopores: troubles and biases. PLoS ONE. 2021;16(10):e0257521. https://doi.org/10.1371/journal.pone.0257521.

  61. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(1):421.

    Article  PubMed  PubMed Central  Google Scholar 

  62. Salzberg SL, Sommer DD, Schatz MC, et al. Genome sequence and rapid evolution of the rice pathogen Xanthomonas oryzae pv. oryzae PXO99A. BMC Genomics. 2008;9(1):204.

  63. Ji C, Ji Z, Liu B, Cheng H, Liu H, Liu S, et al. Xa1 Allelic R Genes Activate Rice Blight Resistance Suppressed by Interfering TAL Effectors. Plant Commun. 2020;1(4): 100087.

    Article  PubMed  PubMed Central  Google Scholar 

  64. Leach JE, Rhoads ML, Cruz CMV, White FF, Mew TW, Leung H. Assessment of genetic diversity and population structure of Xanthomonas oryzae pv. oryzae with a repetitive DNA element. Appl Environ Microbiol. 1992;58(7):2188–2195. https://journals.asm.org/doi/abs/10.1128/aem.58.7.2188-2195.1992.

Download references

Acknowledgements

We acknowledge the financial support of the Open Access Publication Fund of the Martin Luther University Halle-Wittenberg (https://www.uni-halle.de).

Funding

Open Access funding enabled and organized by Projekt DEAL. This work was supported by grants from the Deutsche Forschungsgemeinschaft (http://www.dfg.de) (BO 1496/8-2 to JB, GR 4587/1-2 to JG and FZT 118/2 to SK and MM). Work in RK’s laboratory was supported by a grant from the French Agence Nationale de la Recherche (ANR-2010-BLAN-1723). MM and SK were supported by the Ministry for Economics, Sciences and Digital Society of Thuringia (TMWWDG), under the framework of the Landesprogramm ProDigital (DigLeben-5575/10-9). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

AE developed the computational correction pipeline for TALE genes, performed assemblies, analyzed the data, wrote an initial manuscript draft and contributed to revising and finalizing the manuscript. RPG developed the protocol for gDNA extraction using magnetic nano particles, performed DNA extraction, and contributed to data interpretation and revising and finalizing the manuscript. MZ performed ONT sequencing and contributed to revising and finalizing the manuscript. SK acquired funding, contributed to base calling and genome assembly, and contributed to revising and finalizing the manuscript. RK acquired funding, provided Illumina sequencing data and contributed to revising and finalizing the manuscript. RDM performed PacBio sequencing and contributed to revising and finalizing the manuscript. GGW performed PacBio sequencing and contributed to revising and finalizing the manuscript. MH contributed to base calling, genome assembly and data interpretation, and contributed to revising and finalizing the manuscript. MM acquired funding, supervised experiments and analyses, contributed to data interpretation and to revising and finalizing the manuscript. JB conceptualized the study, acquired funding, supervised experiments and analyses, contributed to data interpretation and to revising and finalizing the manuscript. JG conceptualized the study, acquired funding, supervised implementation and analyses, contributed to the design of computational experiments and data interpretation, and to revising and finalizing the manuscript.

Authors’ information

Not applicable.

Corresponding author

Correspondence to Jan Grau.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

GGW and RDM are employees of New England Biolabs Inc., Ipswich, USA. JB is a part owner of patents governing the use of transcription activator-like effectors (TALEs). AE, RPG, MZ, SK, RK, MH, MM, and JG declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

PDF file integrating Supplementary Figs. S1 – S9.

Additional file 2.

PDF file integrating Supplementary Tables A, B, C, E.

Additional file 3.

Number of ONT reads completely spanning individual TALE clusters (+ 1 kbp upstream and downstream) for each of the sequenced Xoo and Xoc strains.

Additional file 4.

Preparation of Beads Solution.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Erkes, A., Grove, R.P., Žarković, M. et al. Assembling highly repetitive Xanthomonas TALomes using Oxford Nanopore sequencing. BMC Genomics 24, 151 (2023). https://doi.org/10.1186/s12864-023-09228-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-023-09228-1

Keywords