Phenotypic and genomic comparison of Photorhabdus luminescens subsp. laumondii TT01 and a widely used rifampicin-resistant Photorhabdus luminescens laboratory strain

Background Photorhabdus luminescens is an enteric bacterium, which lives in mutualistic association with soil nematodes and is highly pathogenic for a broad spectrum of insects. A complete genome sequence for the type strain P. luminescens subsp. laumondii TT01, which was originally isolated in Trinidad and Tobago, has been described earlier. Subsequently, a rifampicin resistant P. luminescens strain has been generated with superior possibilities for experimental characterization. This strain, which is widely used in research, was described as a spontaneous rifampicin resistant mutant of TT01 and is known as TT01-RifR. Results Unexpectedly, upon phenotypic comparison between the rifampicin resistant strain and its presumed parent TT01, major differences were found with respect to bioluminescence, pigmentation, biofilm formation, haemolysis as well as growth. Therefore, we renamed the strain TT01-RifR to DJC. To unravel the genomic basis of the observed differences, we generated a complete genome sequence for strain DJC using the PacBio long read technology. As strain DJC was supposed to be a spontaneous mutant, only few sequence differences were expected. In order to distinguish these from potential sequencing errors in the published TT01 genome, we re-sequenced a derivative of strain TT01 in parallel, also using the PacBio technology. The two TT01 genomes differed at only 30 positions. In contrast, the genome of strain DJC varied extensively from TT01, showing 13,000 point mutations, 330 frameshifts, and 220 strain-specific regions with a total length of more than 300 kb in each of the compared genomes. Conclusions According to the major phenotypic and genotypic differences, the rifampicin resistant P. luminescens strain, now named strain DJC, has to be considered as an independent isolate rather than a derivative of strain TT01. Strains TT01 and DJC both belong to P. luminescens subsp. laumondii. Electronic supplementary material The online version of this article (10.1186/s12864-018-5121-z) contains supplementary material, which is available to authorized users.

displayed in Figure 3b, copies and organization of these regions in the DJC genome is shown in Figure 3c and details are listed in Supplementary Tables S7 and S8. The core region codes for a central gene pair, one gene containing a DNA primase (IPR13264 and IPR034151) and the other an integrase/recombinase (IPR011010) domain. This gene pair is highly conserved among all copies of PhRepA, including a 14 bp overlap between the two genes. Such a long overlap is not uncommon for genes with a -2 bp shift. Adjacent to the integrase is a short gene coding for a DNA-binding protein with a Cro/C1-type HTH domain (IPR001387), which is also found in phage lambda repressor. Genes encoding proteins with such a domain are not well conserved among PhRepA copies but belong to 5 distinct subtypes (plu0545, plu1127, plu3581, plu3715 and PTE_00007). Located next to the gene encoding the DNA-binding protein is a gene coding for a protein with a SymE-like toxin domain (IPR014944), which also occurs in several distinct subtypes. On the other side of the primase gene are the two short, conserved proteins plu1142-like and plu0542, which both lack assigned domains. The plu1142 homologs are located in a 550 bp region, in which various ORFs have been assigned in the published TT01 genome. These ORFs are, however, selfcontradictory. We selected plu1142 as the most likely protein-coding gene in this region, although the plu1142 homolog adjacent to plu0542 is a missing gene call and several other copies in both strains are disrupted by an in-frame stop codon. The other annotated ORFs would have even more pronounced annotation problems if assigned to all PhRepA copies. Therefore, we consider them to be spurious.
The adhesion region of complete PhRepA elements, which can either be singlets or terminal copies of clusters, codes for a long protein (2135-4582 amino acids) with adhesion-related domains. These include several copies of pectin lyase fold domains (IPR012334) and of hemagglutinin repeats (IPR025157). Between this gene and the core region is a rather variable set of 1 to 7 genes. Adjacent tandem-duplicated copies of the PhRepA repeat have tendency to share the same gene set and may contain an adhesion protein remnant as a truncated gene. The PhRepA remnants B3 and D3 in strain DJC lack the whole adjacent sequence element except for a partial copy of the adhesion protein gene. In cluster B3, the gene is truncated by an ISPlu9 transposon.
In cluster D3, a gene fusion couples the N-terminal 2135 residues of the adhesion protein in-frame to the C-terminal 883 residues of the protein preceding the PhRepA cluster. The N-terminal part of the gene in cluster D3 is highly similar to the genes of copy A and of the DJC-specific copy E8. All genes encoded on the different copies are schematically drawn in Figure 3b

TT01/TT01m and DJC genomes
We performed a detailed transposon analysis of the P. luminescens DJC and both TT01 genomes. According to ISFinder, there are 22 distinct transposons present in P.
luminescens [2], some of which have been submitted in the course of this study. Some of these have a high number of copies (up to ~20 complete copies). In addition, there are many disrupted transposons. Only few of those have intact termini, but all of those have a disrupted transposase gene, making them unsuitable for ISFinder. We also identified a few types of MITEs.
Transposons identified in the three P. luminescens genomes. Many insertions in the indels and approximate inserts represent mobile genetic elements. They commonly include a target site duplication (TSD). The relative frequency of individual transposon classes is shown in Supplementary Figure S2. Initially, we found that ISPlu1 and ISPlu4 seemed to lack a TSD. However, we identified extra bases at the insertion points and the extended elements are bounded by a TSD. Therefore, these elements are a few bases longer than annotated by ISFinder. The extended bases are not palindromic and thus were initially not recognized as part of the inverted terminal repeats.
The transposons with the highest mobility are related to IS630. These belong to the IS630/Tc1/mariner superfamily which is found in both, prokaryotes and eukaryotes [3][4][5]. Although this class of transposons has been preferentially analysed in plants, such elements have also been identified in nematodes. Upon in vitro transposition of a bacterial element it could be shown that a sequence-specific TA dinucleotide is used for integration and that the element generates a 2 bp target site duplication (another TA dinucleotide) [6]. Defining element boundaries exclusively by sequence conservation consequently leads to some ambiguity, because the targeting site/TSD will be included as part of the element. We categorized IS630-type elements from P. luminescens as CCC-type (ISPlu3, ISPlu8, ISPlu19) and as AATAA-type (ISPlu10, ISPlu16), according to characteristic sequences at or very close to the beginning of the element (Supplementary Figure S2).
MITEs identified in the three P. luminescens genomes. MITEs are mobile genetic elements, which are too short to carry a transposase gene. However, they have inverted terminal repeats related to other transposons and thus are mobilized in trans by the corresponding transposase [7]. During our analysis, we identified 6 new MITE types and submitted these to ISFinder.
The most frequent repeat with 552 complete copies in the TT01 genome and a typical length of 123 bp is MITEPlu5. Of the 552 complete copies, 467 have a length of 123 bp and were used to compute a sequence logo (Figure 4a), and subsequently a consensus sequence. Given the obvious high sequence conservation, it is remarkable that only a few of these elements are truly identical to each other and that none of the copies matches exactly to the consensus sequence. If fragments, targeted elements, and those with internal deletions are summarized, they result in a total of 660 copies (strain TT01; 649 in DJC). 47 of these elements represent indels between the TT01m and the DJC genome. A related element has been described as an ERIC sequence [10]. We analysed this element in more detail ( Figure 5). We detected sequence similarities between MITEPlu5 and a subset of the IS630-type transposons with marked conservation of a CCC trinucleotide close to the terminus. This makes it likely that MITEPlu5 is related to the CCC subtype of IS630-type transposons, which is represented by ISPlu3, ISPlu8 and ISPlu19. A similar situation exists for MITEPlu2, which is related to the AATAA subtype of IS630-type transposons represented by ISPlu10 and ISPlu16. Of all transposons/MITEs found inserted in one of the two genomes, 49% belong to the CCC subtype of IS630-type, making this the most active element for transposition.
We consider MITEPlu5 as non-coding. However, some of the copies lack stop codons in some frames. This has resulted in protein coding gene annotation by the PGAP pipeline. We have retained these ORFs but have assigned the protein name "pseudocoding frame MITEPlu5" as a warning for annotation robots. We also encountered several cases in which MITEPlu5 is fused in-frame with upstream or downstream genes. We consider these genes disrupted and removed the protein sequence segment translated from MITEPlu5. In very few cases the MITEPlu5 is integrated into the original stop codon, as it targets TA dinucleotides. This led to a MITEPlu5-derived C-terminal extension. To avoid annotation problems caused by such extremely well conserved extensions, we truncated the ORFs at the junction between the genomic sequence and MITEPlu5.
Our observations suggest that MITEs and potentially other transposable elements can lead to mis-annotations by the PGAP pipeline. Short ORFs consisting largely of MITEPlu5 and only few bases from adjacent unique genome sequence (<100 bp) were mis-annotated to have specific protein names. The ORFs were annotated as "riboflavin synthase", "chorismate lyase", "addiction toxin module relE", "SprT family protein", "pirin family protein". We performed BLASTx comparisons against the UniProt and NCBI nr databases to validate that the genome-derived section does not support the mis-assigned protein name. In several cases, identical mis-annotations have been made for both genomes. To avoid mis-annotation in the future, we suggest that automated annotation robots should be optimized to deal with such situations.
Supplementary Table S1: Differences between the newly sequenced P. luminescens genome TT01m and the originally published genome TT01. A total of 30 differences was detected (nr). For each genome, the differing position/region and the sequencespecific bases (or their length) is provided. Alternatively, the sequence of a tandem repeat with its copy number in parenthesis is given.
A deletion is indicated by a hyphen. In this case, the adjacent bases are specified, separated by a slash (/). The type of difference is named as "point" (point mutation), "shift" (one-base indels, eventually as part of a polynucleotide run), "diff" (for adjacent mutations and/or indel of a few bases), "indel" (all of which are longer than 1 kb and are related to the long repeat PhRepA), "inversion", or "copynr" (copy number differences for tandem repeats). Note that the 1 st difference is an inversion (1a) having an internal "shift" difference (1b). for CRISPRs, counts are provided for strain-specific spacers (tagged extra), for matching spacers (tagged shared) and for multiple occurences of an identical spacer (tagged duplicated). The partial deletion of cas1 is indicated (cas1_truncated as opposed to cas1_complete); (d4) for indel, approximate insert, and replace, the sequences are categorized by terms "specific" (strain-specific, may be completely unrelated or may have up to ca 85% sequence identity), "SeqRelated" (sequences have more than 85% sequence identity but below 98% seq_id), "InternallyRepeated" (a genome-internal duplication so that a highly related sequence is found elsewhere), "HighSimElsewhere" (a near-identical sequence which is found at an unrelated genome position in the other strain, precluding its classification as matchSEG) or "FewAdjacent" (which indicates a short extra region of less than 100 bp outside a mobile genetic element).

P. luminescens
A small number of divSEGs are tagged by a more extended wording, including a reference to an atypical sequence of 121 bp which is found exclusively in a subset of the 23S rRNAs of strain TT01. For long insertions or long replacement sequences, overlaps with prophage regions (see Supplementary Table S5)   if a pseudocoding frame has been called for both, the TT01m and TT01 genomes, the start codon assignment may be disrepanct (comment term: StartDiscrep_pseudocoding) (j,k,l) all gene calls in the TT01 genome which cannot be mapped to the TT01m genome have been evaluated and genes were post-predicted for the TT01m genome. Exceptions are (j) TT01 ORFs which are rated to be spurious (comment term: spurious); (k) short remnants of disrupted genes, typically with a homology region of less than 50 codons (comment term: pseudoSkipped); (l) pseudocoding frames annotated only in the TT01 genome; (m) in some cases, a single TT01m gene maps to more than one TT01 gene, indicated by multiple TT01 codes (comment term: multiTT01). Typical cases are disrupted genes where the N-terminal partial coding region is called as one gene and a C-terminal partial coding region is called as an independent gene which an internal ATG codon mis-assigned to be a start codon. It should be noted that the underlying genome sequence is identical between the TT01m and TT01 genomes for the cases marked by this flag; (n) in some cases, a single TT01m gene is mapped to more than one DJC gene; in such cases, the DJC gene has been disrupted and has been targeted by a mobile genetic element; (o) similarly, if the TT01m gene has been targeted by a mobile genetic element, a single DJC gene maps to more than one TT01m gene. Because this table is based on TT01m genes, this situation is indicated by adding "(part)" after the DJC code; (p,q) gene disruption may generate chimeric genes; (p) these are listed only if discrepantly annotated in the TT01m and TT01 genomes (comment term: chimericPseudo); (q) this complication may be combined with mapping of a single TT01m gene to multiple TT01 genes (comment term: multiTT01+chimericPseudo); (r,s,t,u,v,w,x) while all the above refer to genes encoded on regions of sequence identity between the TT01m and TT01 genomes, the genome differences (see Supplementary Table S1) lead to gene differences; (r) some genes were not automatically mapped because they are encoded on an inverted genome region (comment term: onInversion); (s) genes encoded on an insertion in TT01m cannot be mapped (comment term: TT01m_insertion); (t) likewise, genes encoded on an insertion in TT01 cannot be mapped (comment term: TT01_insertion); (u) some genes carry a point mutation (comment term: e.g. plu0049(G177D)); (v) in case of a frameshift, a single TT01m gene may map to more than one TT01 gene, indicated by multiple TT01 codes (comment term: e.g. splitORFs(plu1144+plu1145)); (w) a frameshift may lead to an aberrant terminal sequence (comment term: e.g. aberrantN(plu2762)); (x) genes may be affected by more than one genome difference (comment term: e.g. severalDiffs(plu1426); (y,z) the genome inversions between TT01m and TT01 may affect protein coding genes traversing the junction; (y) in one scenario, automatic mapping failed but manual mapping shows that the encoded proteins are identical, probably due to the sequence duplication outside the inversion (comment term: [seqid;traverseJunction]); (z) in one case, the inversion resulted in gene truncation (comment term: traversesInversionJunction+truncatedAtN(PluTT01m_15460)); (Aa) an inserted sequence may affect a gene encoded across the integration point (comment term: divergentN_byTT01m_insertion); (Ab) one disrupted gene could not be automatically mapped because a CTG codon had been mis-assigned to be a start codon (translated as Met) in the TT01 genome (comment term: plu4790(CTGcodon_notStart)). This sample tables lists all of the exceptions and all combinations thereof. The complete Supplementary table S3b is provided in two formats: as a space-separated list (*.txt) and in Excel format (*.xlsx).

TT01m code TT01 code DJC code comment
PluDJC_06985 severalDiffs(plu1426) PluTT01m_23020 plu4505 PluDJC_22300  consist of a letter indicating the method, a serial number (by method), and a genome indicator. Alternatively, the term "targeting" is assigned (see below). DJC refers to strain DJC, TT01m to the newly sequenced version of strain TT01 and TT01 to the published version of TT01. The letter S indicates a PhiSpy prediction, the letter A a prediction with Prophinder from the ACLAME server. PhiSpy predictions of the newly sequenced genomes were computed with the initial, non-curated PGAP annotation. For the published strain TT01, the Prophinder prediction was retrieved from the ACLAME server rather than being recomputed. For each prediction, the position in the corresponding genome and the predicted prophage length are indicated. For prophages from strain DJC and from the published version of the TT01 genome, the corresponding positions of the TT01m genome were identified to permit positional comparisons (column "relocated to TT01m"). Prophages in DJC inserts do not have an equivalent on the TT01m genome (indicated by "none"). Coordinates in parenthesis indicate that the prophage terminus could not be located on the TT01m genome as it is not located on a matchSEG. The methods term "targeting" refers to a manual assignment of prophage boundardies, which is based on analysis of splitting (or targeting) of protein-coding genes. Relevant details are mentioned in the comment column, commonly referenced to supplementary tables S1 and S2, including the serial number of that  Table S8: Genes and intergenic regions on PhRepA in the element copies. All genes and intergenic regions (as specified in the column "blocktype") occurring in copies of PhRepA are listed. These are the raw data from which Fig. 4b and 4c have been drawn. (a) for intergenic regions, there is commonly only one line for each strain which contains the distance between the adjacent genes in the corresponding column. Negative numbers indicate gene overlaps. The corresponding genes are commonly in the adjacent gene block but in few cases they are several blocks apart. The final intergenic block defines the distance to the start of the element. For some element copies, there is no upstream gene. In this case, the value represents the distance to the end of the element copy.