Complete sequencing of Novosphingobium sp. PP1Y reveals a biotechnologically meaningful metabolic pattern

Background Novosphingobium sp. strain PP1Y is a marine α-proteobacterium adapted to grow at the water/fuel oil interface. It exploits the aromatic fraction of fuel oils as a carbon and energy source. PP1Y is able to grow on a wide range of mono-, poly- and heterocyclic aromatic hydrocarbons. Here, we report the complete functional annotation of the whole Novosphingobium genome. Results PP1Y genome analysis and its comparison with other Sphingomonadal genomes has yielded novel insights into the molecular basis of PP1Y’s phenotypic traits, such as its peculiar ability to encapsulate and degrade the aromatic fraction of fuel oils. In particular, we have identified and dissected several highly specialized metabolic pathways involved in: (i) aromatic hydrocarbon degradation; (ii) resistance to toxic compounds; and (iii) the quorum sensing mechanism. Conclusions In summary, the unraveling of the entire PP1Y genome sequence has provided important insight into PP1Y metabolism and, most importantly, has opened new perspectives about the possibility of its manipulation for bioremediation purposes. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-384) contains supplementary material, which is available to authorized users.


Background
Aromatic compounds are among the most widespread dangerous pollutants [1]. Petroleum and its derivatives are the main sources of aromatic molecules released into the environment. The aromatic hydrocarbon content of petroleum can range from about 20% to more than 40% [2][3][4], whereas the aromatic hydrocarbon content of gasoline and diesel oil is about 30% and 25%, respectively [5,6].
Novosphingobium sp. strain PP1Y is a recently isolated marine α-proteobacterium that is able to grow on a surprisingly wide spectrum of pure mono-, poly-and heterocyclic aromatic hydrocarbons and on complex mixtures of aromatic hydrocarbons dissolved in paraffin oil phases including gasoline and especially diesel-oil which is an optimal growth substrate. Moreover, PP1Y can emulsify diesel-oil by producing small (<1 mm) regular biofilm-covered oil drops that have been described as spherical colonies harbouring a reservoir of growth substrates [7].
Strain PP1Y belongs to the Sphingomonadaceae family, which is characterized by the presence of glycosphingolipids in the outer membrane, instead of the more common lipopolysaccharides. This peculiarity renders the surface of their cells more hydrophobic than those of the other Gram-negative strains and, has probably contributed to the development of the ability to degrade mono-and polycyclic aromatic hydrocarbons (PAHs). Moreover, many Sphingomonadales harbour several (up to six) large conjugative plasmids, ranging in length from less than 50 kbp to more than 500 kbp [8]. Thanks to these megaplasmids, several Sphingomonadales have "collected" genes for the degradation of xenobiotics and continuously exchange them with other bacterial strains [9][10][11]. Interesting examples are Novosphingobium aromaticivorans F199, which uses alkyl-benzenes as the sole carbon and energy source [12], Novosphingobium pentaromativorans US6-1, which degrades PAHs with 3-5 aromatic rings [13], Novosphingobium sp. TYA-1, which simultaneously degrades bisphenol A and 4-alkylphenols [14] Sphingomonas paucimobilis EPA505, which degrades several polycyclic compounds [15], Sphingomonas wittichii RW1, which can grow using dibenzofuran and dibenzo-p-dioxin [16], Sphingomonas sp. TTNP3 which uses alkylphenolic compounds as a source of carbon and energy [17] and Sphingobium chlorophenolicum L-1 which degrades pentachlorophenol [11].
Here, we report the analysis of the genome of Novosphingobium sp. strain PP1Y and its comparison with the genomes of N. aromaticivorans F199 (genome accession number NC_007794.1) [18] and S. wittichii RW1 [19], the closest genomes in terms of nucleotide sequence. This comparison has yielded insights into PP1Y and its ability to encapsulate and degrade the aromatic fraction of fuel oils.

Results and discussion
Complete genome features and chromosomal architecture PP1Y genome sequence assembly produced four replicons classified according to their size, as we previously reported [20] (Figure 1A-D). Because the coverage of "small" plasmid (Spl) sequences was, on average, about twice that of the other replicons, it is expected that Spl is present as a two-copy object within each bacterial cell. At present, very few complete sequences of bacterial chromosomes and plasmids are available for organisms of the genus Novosphingobium (see Table 1). These sequences have a similar G + C content (about 60%), but PP1Y appears to have the largest and most complex genomic organization of the genus.
Various predictive and comparative bioinformatics tools supported by biological databases were used to annotate putative open reading frames (ORFs) and other functional elements [21][22][23][24][25][26][27]. As in other bacteria, most of the genome sequence is predicted to be coding and a substantial fraction of predicted ORFs (12-22%, depending on the replicon) appear to have TTG or GTG as the starting codon. Most of them (73% of the 4,709 coding sequences predicted in the four replicons taken together) and all rRNA and tRNA genes are located on the Chr molecule. The same applies to other RNA elements; the only exceptions being three RNAs predicted on Lpl (see Table 2, "Other RNA elements" section and Additional file 1: Table S1).
Evaluation of the putative DNA replication origins DNA replication was investigated by searching for the putative genome replication origins using a bioinformatic tool. This tool, Orifinder [28], locates predicted bacterial replication origins within each DNA sequence by taking into account base composition asymmetry, distribution of DNA-A boxes and the presence of genes frequently located close to the bacterial replication start (Additional file 1: Table S1). This tool revealed a putative Type-III replication origin on Chr, around base 1, where there is a region of base composition asymmetry containing three DNA-A-boxes, close to the hemE gene (as in the N. aromaticivorans DSM 12444 genome) and to a DNA-A gene. Differently, on Mpl, Lpl and Spl replicons, Orifinder failed to locate an acceptable putative replication origin, suggesting that other mechanisms may be involved in DNA replications origins. Interestingly, a typical plasmid replication parA/parB/parS cluster was found on each of these replicons, and Mpl and Lpl contain also a predicted plasmid replication repA gene close to the parA/parB/ parS cluster, but in a different orientation to those predicted on the N. aromaticivorans DSM 12444 pNL1 and pNL2 plasmids.
On the Spl plasmid. a complete protein killer gene system is also found, namely, an operon containing two genes that force the host bacteria to retain the plasmid [29].

Protein genes identified and their significance
The gene products encoded by the 4,709 ORFs were characterized by searching for sequence similarity with known bacterial proteins contained in various collections ( Figure 2A). About 94% of the ORFs matched at least one protein stored in the Uniref50 or KEGG GENES databases, although the fraction of matched sequences varies and is significantly lower for Lpl and Spl (about 80%). It is noteworthy that about 20% of the ORFs matched proteins annotated as "hypothetical", "putative" or "uncharacterized", and are thus classified as coding for "conserved hypothetical proteins". When the same search was done against protein sequences stored in the COG database, the fraction of identified gene products was lower. In fact, most of the ORFs coding for "conserved hypothetical proteins" did not show any similarity. About 6% of the ORFs did not match any sequence stored in the three databases and are thus classified as coding for "hypothetical proteins".
An all-against-all comparison of the protein sequences encoded within each replicon was done using BLAST [30] under very stringent conditions (see Table 3, PP1Y-PP1Y section, and Additional file 1: Figure S1A) to look for inter-duplicated genes. About 20% of ORFs from Mpl have a counterpart within the main chromosome (Chr), thereby indicating a partial genome duplication. The whole complement of protein-coding genes was also compared to the one encoded within other complete genomes and plasmids from bacteria of the Sphingomonadaceae family (Table 3 and Additional file 1: Figure S1B-D). A number of ORFs ranging between 1,500 and 1,700, i.e. 45-50% of those encoded within Chr in PP1Y, have a counterpart in the main chromosome of the closest analysed bacterial species, the most similar gene set being that from N. aromaticivorans, putatively from the same genus. There is no clear evidence that the three smaller replicons are functionally equivalent to other known plasmids in terms of protein coding genes: many protein genes predicted within Mpl appear to have counterparts in N. aromaticivorans, although only 20% of Lpl ORFs have a counterpart in the pNL1 plasmid (Accession: NC_009426.1), while others are in the pNL2 plasmid (Accession: NC_009427.1) and some are scattered along the main chromosome. Half the Spl-encoded  proteins are encoded by the main chromosome in N. aromaticivorans. Sphingobium japonicum [31] and PP1Y share elements of comparable size, although the latter has an additional smaller chromosome. The two species have a 45% similarity within the main chromosome in terms of protein-encoding content, but diverge more extensively in the plasmids. A plasmid from S. japonicum UT26 pLB1 [32], which is involved in gamma-hexachlorocyclohexane degradation, is somewhat similar to Spl (data not shown).
To assign a putative biological function to protein-coding genes, they were classified, when possible, into COG functional categories based on the result of a BLAST search against COG genes. The predicted protein sequences were also analyzed with the KEGG Automatic Annotation Server KAAS, which assigns a functional annotation to genes following a BLAST alignment against the manually curated KEGG genes database [33] (Additional file 1: Figure S2 A-B). Overall, the Chr sequence of PP1Y contains practically all the core metabolism genes; notably, a number of predicted transporters and transcription factors are present in Mpl ( Figure 2B-C).

Characterization of the PP1Y genes involved in aromatic hydrocarbon degradation
The degradation of aromatic hydrocarbons requires activation of the aromatic ring. This generally occurs by dihydroxylation of the aromatic ring catalyzed by pairs of monooxygenases or dioxygenases/dehydrogenases that constitute the upper pathways. Ring activation is followed by ring cleavage catalyzed by specialized dioxygenases (intra-and extradiol dioxygenases) that start the lower pathways. In the case of methylated aromatic compounds, the initial step can be a monooxygenation reaction of a methyl group followed by oxidation to carboxylate. These reactions can be catalyzed by soluble dioxygenases or by membrane monoxygenases related to xylene monooxygenase XylM. The arylcarboxylate eventually undergoes ring dihydroxylation and cleavage [34]. Analysis of the PP1Y genome revealed at least 81 ORFs ( Table 4) that potentially code for the enzymes of both the upper (ring activation) and lower (ring cleavage) pathways.
No soluble multicomponent monooxygenase that resembled the well characterized methane monooxygenases and toluene/o-xylene monooxygenase [35] was found in the present study. Thirty-eight ORFs, which were predicted to code for 34 different multicomponent aromatic hydroxylating dioxygenases [36], were identifieda number clearly higher than in the closely related strains N. aromaticivorans F199 and N. pentaromativorans US6-1 (27 and 18 dioxygenases, respectively) ( Figure 3). PPIY has a close counterpart of each F199 dioxygenase: three of these are present in double copy with a 100% identity, which is indicative of a very recent duplication event; and four others have a 90-95% identity, which suggests a less recent duplication event followed by divergence. All duplicated ORFs are closely related to seven ORFs coding for hydroxylating dioxygenases found on plasmid pNL1 from strain F199. Indeed, replicon A of strain PP1Y contains two copies of a region of plasmid pNL1 probably derived by multiple fusion/duplication events (Additional file 1: Figure S3A). Six PP1Y oxygenases from the megaplasmids (Mpl6792, Mpl2166, Mpl5621, Mpl5540, Mpl5477, Mpl5466) do not have homologues in strains F199 and US6-1 but are closely related to predicted oxygenases from strain RW1 (Additional file 1: Figure  S4A and B), suggesting that strain PP1Y combined the   dioxygenase pools of strains F199 and RW1 and later expanded the pool by duplication events. This strategy enabled PP1Y to expand the pathway for the degradation of naphthalene and methylnaphthalenes, and to degrade larger PAHs. The predicted pathway is shown in Additional file 1: Figure S3B. Two potential membrane monooxygenases are predicted in PP1Y; they show a 96% identity with each other and a 71-75% identity with the sole membrane monooxygenase found in strain F199, which suggests another recent event of gene duplication. The two PP1Y monooxygenases (Additional file 1: Figure S5) mainly differ in the substrate-binding region, possibly to allow different substrate specificity. No membrane monooxygenase is present in the genomes of Sphingomonas sp. MM-1, Sphingobium japonicum UT26, Sphingobium chlorophenolicum L-1, Sphingobium sp. SYK-6, Sphingobium wittichii RW1 or Novosphingobium pentaromativorans US6-1. This suggests that, also in this case, the PP1Y enzymatic repertoire was expanded by horizontal gene transfer and duplication events.
Homology models of these three RCDs are shown in Additional file 1: Figure S7. The four RCDs were cloned, expressed in Escherichia coli and their cleavage activity was assayed on 3-methylcatechol, 2,3-dihydroxybiphenyl (2,3-DHBP), and 4-hydroxyoestradiol (4-OHE). The latter was used as analogue of dihydroxy PAHs because these compounds are unstable, difficult to synthesize and not commercially available. The protein coded in AT15599/ AT31688 is a very versatile enzyme, able to cleave substrates with 1 to 4 rings ( Table 5). Enzyme AT32663 is only active on polycyclic substrates, while Mpl3065 is active only on 2,3-DHBP, as predicted. Finally, AT15671/AT31616 is very active on monocyclic catechols, even though its substrate specificity is wider than that of P. putida MT2 catechol 2,3-dioxygenase. Taken together, these four enzymes are able to cleave all classes of 3-and/or 4-substituted catechols in complex mixtures. The other three PP1Y RCDs, AT33026, Mpl10251 and Mpl4329/Mpl4634, are poorly characterized. Preliminary modelling studies suggest that they are dioxygenases specialized in cleaving catechols bearing substituents at positions 3,5 and/or 4,5 and/or 3,6. Therefore, these dioxygenases have a substrate specificity complementary to the four described above.
The Neighbor-Joining tree of RCDs (Additional file 1: Figure S6) shows a great heterogeneity among sphingomonads both in the number of potential RCDs (from 1, in the case of strain L-1, to 8 in the case of strain RW1) and in the distribution of the proteins among the RCD subfamilies. Only strains F199 and PP1Y have at least one representative for each subfamily. This particular set of RCDs could allow strain PP1Y to metabolize complex mixtures of catechols deriving from the simultaneous oxidation of several mono-and polycyclic-aromatic hydrocarbons (Additional file 1: Figure S8), which are the preferred substrates for growing this strain.
Besides the seven homomultimeric estradiol RCDs, the PP1Y genome contains also four potential ORFs for heterodimeric extradiol RCDs that are able to cleave catechol rings bearing substituents with carboxylate groups like protocatechuate (see also Additional file 2: Supplementary Results and Discussion). The genome of strain PP1Y contains several other ORFs coding for hypothetical mono-and dioxygenases whose involvement in the degradation of xenobiotics is less clear. Among these, CDS AT10830 is particularly interesting as it codes for a 2-oxoglutarate-dependent oxygenase. These oxygenases cleave different substrates, namely alkyl-sulphonates and fenoxy-acids, by catalyzing monooxygenation reactions of CH bonds adjacent to good leaving groups. Interestingly, no sphingomonad contains a homologous enzyme. Moreover, AT10830 is a member of a group of adjacent ORFs coding for: (i) a hydroxylating dioxygenase (AT10866) that is only distantly related to RW1 and F199 dioxygenases (Additional file 1: Figure S4A); (ii) a heterodimeric extradiol ring cleavage dioxygenase related to 3,4-dihydroxybenzoate dioxygenases; and (iii) a hypothetical acetamidase (AT10838). This cluster of ORFs is present in several distantly related strains including some beta and gamma proteobacteria, thus suggesting a horizontal gene transfer event. At present, nothing is known about the physiological role of this pathway, but its wide diffusion suggests a potentially important ecological role. The data related to Additional file 1: Figures S9-S11 are reported under "Additional file 2: Supplementary Results and Discussion".

Stress response genes and their functions
The PP1Y genome contains several ORFs potentially coding for the so-called resistance-nodulation-cell division (RND)-type efflux pumps [38] that actively excrete toxic molecules, and have thus been implicated in the capacity of PP1Y to grow in close contact with a diesel oil phase.
(RND)-type efflux pumps are constituted by three subunits: the inner membrane, the outer membrane and the membrane fusion component. The PP1Y genome contains eight potential ORFs for the inner membrane subunit and even more for the other components (Additional file 1: Table S2), suggesting the possible formation of hybrid pumps. The evolutionary relationships among the inner membrane subunits are shown in Additional File 1: Figure S12A. Three PP1Y RND pumps belong to a subfamily of pumps specific for neutral molecules like aromatic hydrocarbons, acriflavine and other toxic aromatic molecules. The product of AT9347 is closely related to toluene resistance proteins and is very likely an aromatic hydrocarbon resistance protein. Three PP1Y RND pumps belong to a subfamily specific for mono and divalent transition metals and are closely related to a set of RNDs pumps from Cupriavidus metallidurans CH34, a benchmark among strains able to tolerate very high concentrations of transition metals [39]. The PP1Y genome also contains eight potential ORFs for P-type ATPases (Additional file 1: Figure S12B), which are membrane ATP-dependent efflux pumps specialized in the excretion of metal cations [40]. For comparison, C. metallidurans CH34 genome codes for 9 P-type ATPases.
On the basis of these findings, we assayed the ability of PP1Y to grow in liquid medium containing high concentrations of metal cations. Figure 4 shows that PP1Y can grow in the presence of millimolar concentrations of nickel (2.5 mM), lead (10 mM), copper (10 mM) and zinc (5 mM). At higher concentrations, the growth rate steeply decreases to zero (not shown). Interestingly, all the metals increase the carbohydrates/proteins ratio with respect to the control culture, thus suggesting that modification of the cell envelope could contribute to resistance to metals. These results show that the ability of PP1Y to tolerate heavy metals is comparable to that of heavy metal-tolerating strains like C. metallidurans CH34 [40], which suggests that PP1Y could play a role in the bioremediation of hydrocarbons in environments polluted by heavy metals.
Tellurite anion is highly toxic to microorganisms (much more than arsenate and arsenite) thanks to its ability to catalyze the oxidation of cell thiols and produce radical oxygen species [41]. Therefore, the wide diffusion of tellurite-resistance mechanisms among bacteria is not surprising, and they might include an aspecific increase of the radical scavenger systems and specific tellurite anion transporters [42]. The PP1Y genome contains three ORFs potentially coding for proteins belonging to three different tellurite-resistance mechanisms: telA (from the E. coli kilA/telA/telB system), tehB (from the E. coli tehA/tehB system) and terC from Proteus mirabilis [43]. Due to the scarce knowledge about these systems, it is difficult to predict their role in tellurite resistance. However, all these ORFs are located in a cluster of ORFs coding for proteins probably involved in detoxification. Interestingly, a similar cluster of ORFs is present in the genome of strain RW1, but not in other sphingomonads (data not shown). The importance of glutathione as a radical scavenger and mediator of detoxification systems varies among Figure 3 Neighbor-Joining tree summarizing the relationships among the alpha subunits of the dioxygenases of strains PP1Y, F199 and US6-1. Colours indicate the localization of the ORFs: blue PP1Y/chromosome; green, PP1Y/megaplasmid; red, F199/chromosome; magenta, F199/pNL1; brown, F199/pNL2; black, US6-1/chromosome; gray, US6-1/pLA1. The numbers following the name of the oxygenases refer to the gi accession numbers of the NCBI protein database. The analysis involved 164 amino acid sequences (the sequences used to prepare the tree in Additional file 1: Figure S4  bacteria. However, several bacteria use glutathione and glutathione-dependent enzymes to detoxify reactive organic compounds (like epoxides), halogenated compounds or alkylhydroperoxides, and reactive oxygen species (ROS) such as oxygen radicals [44].
In addition to genes involved in glutathione synthesis and in the reduction of oxidized glutathione, the PP1Y genome codes for 18 glutathione S-transferases (Additional file 1: Table S3). This number is about double that of E. coli and suggests that glutathione could play an important role in detoxification of toxic diesel oil components and of toxic metabolites produced by the oxidation of aromatic hydrocarbons, like epoxides and ROS.
The PP1Y genome also codes for six members of a peculiar family of very small (about 100 amino acids) monooxygenases known as "antibiotic biosynthesis monooxygenases" [45]. These enzymes are the only known monooxygenases not containing any metal or flavin cofactors [46], and that prevalently oxidize phenolic groups to quinines. They are involved in at least two very different physiological processes: (i) the synthesis of the polyketide antibiotics (e.g. the products of ActVA-Orf6 of Streptomyces coelicolor), and (ii) the quinol redox cycle (e.g. quinol monooxygenase YgiN from E. coli). In particular, E. coli YgiN could prevent the accumulation of the semiquinone intermediate formed during the oxidation of quinols to quinones thus minimizing the formation of free radical species [47]. At least some of the six PP1Y antibiotic biosynthesis monooxygenases could have similar functions. However, some of them could be also involved in the synthesis of secondary metabolites. It is noteworthy that PP1Y is able to inhibit the growth of molds (unpublished results), which suggests it secretes antifungal compounds.

Identification of genes involved in extracellular polymer secretion and biofilm formation
The analysis of the PP1Y genome has revealed potential regulatory mechanisms (quorum sensing, QS) and secretion systems for extracellular polymers, including polysaccharides and poly-gamma-glutamate, which may play a role in the complex "social" behavior of PP1Y, a strain able to form different types of multicellular amorphous aggregates and ordered biofilm The control growth shown in all graphs was performed in 1% glutamic acid. Empty squares and circles: total proteins and total carbohydrates, respectively in the control culture. Filled squares and circles: total proteins and total carbohydrates, respectively in the cultures containing metals. Error bars are omitted for clarity; relative error was invariably lower than 8%. (see also Additional file 2: Supplementary Results and Discussion). Quorum sensing is a simple molecular mechanism that results in coordinated behavior in response to cell density [48]. The presence in PP1Y of two QS systems is interesting since they could work simultaneously in response to two different cell densities or, could be activated alternatively under specific conditions. Both possibilities could account for PP1Y's complex behavior.
Although several ORFs for sphingan synthesis have distantly related homologues in the PP1Y genome (identity <30-40%), a gene cluster similar to those present in other Sphingomonas does not exist in PP1Y. Therefore, it is unlikely that PP1Y could produce a sphingan-like polysaccharide. However, several clusters potentially coding for the synthesis of extracellular polysaccharides are distributed among the larger replicons (chromosome and Mpl), as shown in Additional file 1: Table S4, Table S5 and Figure S14A. Lpl contains two regions that are probably involved in the synthesis of exopolysaccharides (Additional file 1: Figure S14B), and are widely distributed among sphingomonads. The closest sequences can be found in S. japonicum UT26 with an identity of 70-90% at protein level. Interestingly, the region between these two couples of ORFs in Lpl contains five ORFs coding for hypothetical glycosyl transferases and four ORFs coding for the subunits of an ABC-type polysaccharide transport system with high homology in several sphingomonads (Additional file 1: Figure S15 A-B). Lpl651 is particularly interesting as it codes for a large protein containing three glycosyl transferase-like domains. No other sphingomonad contains a representative of this subfamily of glycosyl transferases that can be found in distantly related bacteria, suggesting another case of horizontal gene transfer. Taken together these findings suggest that Lpl codes for the synthesis and export of one or more capsular polysaccharide(s) that probably contains mannose and rhamnose, like sphingans, but whose structures could differ from those produced by other sphingomonads.
Several biofilm-forming strains secrete cellulose as a matrix component. Lpl from PP1Y shares with Sphingobium japonicum UT26 a cluster of ORFs coding for a two-subunit cellulose synthase (Additional file 1: Figure  S16A), which implicates Lpl in both biofilm synthesis and remodelling. Another CDS coding for a hypothetical cellulase is located on chromosome (AT36325) not far from a CDS coding for an exo-1,3/1,4-beta-glucanase which could act downstream the cellulase (endo-1,4-beta-glucanase) (Additional file 1: Figure S16B). Interestingly, PP1Y has the largest number of glycosyl hydrolases and glycosyl transferases among sphingomonadales and related groups of alpha proteobacteria (Additional file 1: Table S6).
The PP1Y genome contains three ORFs coding for γ-PGA polymerases (Additional file 1: Figure S16C), which are involved in the synthesis of poly-gammaglutamate, a strongly anionic homopolymer composed of glutamate residues linked by amide bonds between α-amino and γ-carboxyl groups [49]. This polymer can perform different functions, including the stabilization of the extracellular matrix, glutamate storage and toxic metals binding (Additional file 2: Supplementary Results and Discussion).

Conclusions
This analysis of the annotated Novosphingobium sp. PP1Y genome has revealed peculiar biochemical and biotechnological properties, namely, the metabolic pathways specifically involved in: (i) the degradation of a vocabulary of aromatic hydrocarbons, (ii) the resistance to toxic compounds and (iii) the QS social behavior mechanism. This detailed functional evaluation opens new translational perspectives regarding the possible manipulation of the PP1Y genome for bioremediation purposes. Moreover, the comparison between the enzymatic machinery of PP1Y and those of the other sphingomonads able to degrade environmental pollutants suggests that each sphingomonad has independently evolved its own repertoire of degradative enzymes through a complex combination of vertical heredity, horizontal gene transfers, duplications and rearrangements. This process is still ongoing as demonstrated by the presence of multiple copies of pNL1-like regions at different locations of the PP1Y chromosome. As a consequence, even closely related strains like PP1Y, F199 and US6-1, which belong to the genus Novosphingobium, have unique features and adaptations to specific, also polluted, environments. The analysis reported in this paper strongly supports the general belief that sphingomonads are very adaptable bacteria with extraordinary genomic plasticity. It also raises biotechnological perspectives of using sphinomonads in bioremediation processes.

Bacterial growth and DNA extraction
Novosphingobium sp. strain PP1Y was routinely grown and genomic DNA was extracted as previously described [7].

Genome sequencing and assembly
The de novo whole-genome shotgun sequencing of Novosphingobium sp. PP1Y was carried out as described in a preliminary report (EMBL database under accession numbers: FR 856862, FR 856861, FR 856860 and FR 856859 for Chr, Mpl, Lpl and Spl, respectively) [20].

Sequence annotation
Sequence annotation includes predicted ORFs, rRNAs, tRNAs and other ncRNAs, identified by using the following tools: ORFs were predicted by Grc [20] combined with the Uniref50 [21] and KEGG GENES [22] databases; rRNA genes and tRNAs, genes were identified by using RNAmmer [23] and tRNAScan-SE [24] respectively; Other predicted ncRNA elements were found by Infernal using the RFAM database records as models [25,26].
In-house developed pipelines guided the whole annotation process, scheduling and running single applications on a 56-blade cluster. ORFs on chromosome, mega-, largeand small plasmids are identified by a number preceded by "AT", "Mpl", "Lpl" and "Spl" respectively. All the PP1Y ORFs and their protein sequences discussed in the text and/or included in the trees are available on the "Gene" database at http://www.ncbi.nlm.nih.gov/gene/.

Phylogenetic analysis
The sequences included in this study were selected by searching public protein databases with BLAST and PSI-BLAST [50]. Clustal Omega [http://www.ebi.ac.uk/ Tools/msa/clustalo/] was used to obtain multiple alignments. Alignments were visualized and examined using JalView [51] and MEGA5.1 [52]. Phylogenetic trees were obtained, visualized and manipulated using MEGA5.1. Bootstrap confidence analysis was performed on 1,000 replicates using the Neighbor-Joining method [53]. The evolutionary distances were computed using the Poisson correction method [54] and were expressed as the number of amino acid substitutions per site. All positions containing gaps and missing data were eliminated.

Subcloning, expression and activity analysis of RCDs
Open reading frames coding for RCDs were amplified by PCR using genomic DNA as template. Gene sequences were engineered to introduce an NdeI site at the 5'-end and a HindIII site at the 3'-end. PCRs were performed in a total reaction volume of 50 μl, containing 50 ng of genomic DNA, 1 μM of each primer, 0.2 mM dNTPs (Roche, Basel, Switzerland), 1× PCR buffer and 2.5 U of Platinum pfx polymerase from Pyrococcus sp. (Invitrogen). The amplification program was optimized as follows: initial denaturation at 95°C for 2 min, amplification for 20 cycles of denaturation at 92°C for 1 min, annealing at 56°C for 1 min, extension at 68°C for 1 min. The amplified fragments cut with NdeI and HindIII were cloned into pET22b (+) expression vector (Novagen) previously cut with the same enzymes. RCDs were expressed in E. coli strain BL21(DE3), transformed with the appropriate expression vector, purified by ionexchange chromatography on Q-Sepharose FF resin and analyzed for quality as described previously [55]. Assays were performed at 25°C in 50 mM Tris/HCl (pH 7.5) in a final volume of 500 μl by spectrophotometric determination of the product of the reaction as described elsewhere [55]. The amount of the products was measured using their extinction coefficients: ε 388 = 13,800 M −1 cm −1 for the product of 3-methylcatechol (3-MC) [55]; ε 434 = 13,200 M −1 cm −1 for the product of 2,3-dihydroxybiphenyl (2,3-DHBP) [56]; ε 298 = 9,100 M −1 cm −1 for the product of 4-hydroxy-oestradiol (4-OHE). One unit of enzyme activity was defined as the amount of enzyme required to form 1 μmol of the product per minute under the assay conditions. Specific activity is given as units per milligram of protein.
Synthesis of 4-OHE was achieved by Dr. Pezzella (Department of Chemistry, University of Naples Federico II) via the o-Iodoxybenzoic acid (IBX)-mediated phenolic oxygenation procedure as previously described [57]. All chemicals were of the highest grade available and were from Amersham Biosciences, Promega, New England Biolabs, Sigma, ABCR GmbH, Fluka, or Applichem. Escherichia coli strain BL21 (DE3) and plasmid pET22b (+) were purchased from Novagen (Madison, WI, USA). DNA sequences and oligonucleotide synthesis were performed by Eurofins MWG Operon (Germany).

Availability of supporting data
The following additional data are available with the online version of this paper: Additional file 2, which includes Supplementary Results and Discussion; and Additional file 1, which includes Tables S1 to S6 and Figures S1 to S16. Phylogenetic tree newick files are available online as Additional file 3