Complete genome of Phenylobacterium zucineum – a novel facultative intracellular bacterium isolated from human erythroleukemia cell line K562

Background Phenylobacterium zucineum is a recently identified facultative intracellular species isolated from the human leukemia cell line K562. Unlike the known intracellular pathogens, P. zucineum maintains a stable association with its host cell without affecting the growth and morphology of the latter. Results Here, we report the whole genome sequence of the type strain HLK1T. The genome consists of a circular chromosome (3,996,255 bp) and a circular plasmid (382,976 bp). It encodes 3,861 putative proteins, 42 tRNAs, and a 16S-23S-5S rRNA operon. Comparative genomic analysis revealed that it is phylogenetically closest to Caulobacter crescentus, a model species for cell cycle research. Notably, P. zucineum has a gene that is strikingly similar, both structurally and functionally, to the cell cycle master regulator CtrA of C. crescentus, and most of the genes directly regulated by CtrA in the latter have orthologs in the former. Conclusion This work presents the first complete bacterial genome in the genus Phenylobacterium. Comparative genomic analysis indicated that the CtrA regulon is well conserved between C. crescentus and P. zucineum.


Background
Phenylobacterium zucineum strain HLK1 T is a facultative intracellular microbe recently identified by us [1]. It is a rod-shaped Gram-negative bacterium 0.3-0.5 × 0.5-2 μm in size. It belongs to the genus Phenylobacterium [2], which presently comprises 5 species, P. lituiforme (FaiI3T) [3], P. falsum (AC49T) [4], P. immobile (ET) [2], P. koreense (Slu-01T) [5], and P. zucineum (HLK1 T ) [1]. They were isolated from subsurface aquifer, alkaline groundwater, soil, activated sludge from a wastewater treatment plant, and the human leukemia cell line K562, respectively. Except for P. zucineum, they are environmental bacteria, and there is no evidence that these microbes are associated with eukaryotic cells. The HLK1 T strain, therefore, represents the only species so far in the genus Phenylobacterium that can infect and survive in human cells. Since most, if not all, of the known microbes that can invade human cells are pathogenic, we proposed that HLK1 T may have pathogenic relevance to humans [1]. Unlike the known intracellular pathogens that undergo a cycle involving invasion, overgrowth, and disruption of the host cells, and repeating the cycle by invading new cells, HLK1 T is able to establish a stable parasitic association with its host, i.e., the strain does not overgrow intracellularly to kill the host, and the host cells carry them to their progeny. One cell line (SW480) infected with P. zucineum has been stably maintained for nearly three years in our lab (data not shown).
In this report, we present the complete genome sequence of P. zucineum.

Genome anatomy
The genome is composed of a circular chromosome (3,996,255 bp) and a circular plasmid (382,976 bp) (Figure 1; Table 1). The G + C contents of chromosome and plasmid are 71.35% and 68.5%, respectively. There are 3,861 putative protein-coding genes (3,534 in the chromosome and 327 in the plasmid), of which 3,180 have significant matches in the non-redundant protein database. Of the matches, 585 are conserved hypothetical proteins and 2,595 are proteins with known or predicted functions. Forty-two tRNA genes and one 16S-23S-5S rRNA operon were identified in the chromosome.
There are 7 families of protein-coding repetitive sequences and a family of noncoding repeats in the genome (Table  2). Notably, identical copies of repeats 02-04 were found in both the chromosome and the plasmid, suggesting their potential involvement in homologous recombination.
On the basis of COG (Cluster of Orthologous Groups) classification, the chromosome is enriched in genes for basic metabolism, such as categories E (amino acid transport and metabolism) and I (lipid transport and metabolism), accounting for 8.29% and 6.09% of the total genes in the chromosome, respectively. On the other hand, the plasmid is enriched for genes in categories O (posttranslational modification, protein turnover, chaperones) and T (signal transduction mechanisms), constituting 12.96% and 9.72% of the total genes in the plasmid, respectively.
As to genes in the plasmid that cope with environmental stimuli, about half of the genes in category O are molecular chaperones (17/32), including 2 dnaJ-like molecular chaperones, 2 clusters of dnaK and its co-chaperonin grpE (PHZ_p0053-0054 and PHZ_p0121-122), a cluster of groEL and its co-chaperonin groES (PHZ_p0095-0096), and 9 heat shock proteins Hsp20. Of 23 genes in category T, there is one cluster (FixLJ, PHZ_p0187-0188), which is essential for the growth of C. crescentus under hypoxic conditions [6].

General metabolism
The enzyme sets of glycolysis and the Entner-Doudoroff pathway are complete in the genome. All genes comprising the pentose phosphate pathway except gluconate kinase were identified, consistent with our previous experimental result that the strain cannot utilize gluconate [1]. The genome lacks two enzymes (kdh, alpha ketoglutarate dehydrogenase and kgd, alpha ketoglutarate decarboxylase), making the oxidative and reductive branches of the tricarboxylic acid cycle operate separately. The genome has all the genes for the synthesis of fatty acids, 20 amino acids, and corresponding tRNAs. Although full sets of genes for the biosynthesis of purine and pyrimidine were identified, enzymes for the salvage pathways of purine (apt, adenine phosphoribosyltransferase; ade, adenine deaminase) and pyrimidine (cdd, cytidine deaminase; codA, cytosine deaminase; tdk, thymidine kinase; deoA, thymidine phosphorylase; upp, uracil phosphoribosyltransferase; udk, uridine kinase; and udp, uridine phosphorylase) were absent. The plasmid encodes some metabolic enzymes, such as those participating in glycolysis, the pentose phosphate pathway, and the citric acid cycle. However, it is worth noting that the plasmid has a gene (6-phosphogluconate dehydrogenase) that is the only copy in the genome (PHZ_p0183).
Like most other species in the genus Phenylobacterium, the strain is able to use L-phenylalanine as a sole carbon source under aerobic conditions [1]. A recent study revealed that phenylalanine can be completely degraded through the homogentisate pathway in Pseudomonas putida U [7]. P. zucineum may use the same strategy to utilize phenylalanine, because all the enzymes for the conversion of phenylalanine through intermediate homogentisate to the final products fumarate and acetoacetate are present in the chromosome (Table 3).  (Table 4). Notably, we annotated the proteins of 93 bacteria (see  1 Size in base pairs of the consensus and the direct repeat (DR) generated by insertion into the genome target site. 2 A copy is complete if the length of the repeat is ≥ 90% of the consensus, otherwise, the copy is partial. 3 One complete copy which harbors a 7 bp direct repeat (TCCTAAC) that disrupts the VirD4. 4 The partial copy is located in the plasmid. 5 Two partial copies are located in the chromosome, of which a "partial" copy with full length is inserted by a copy of repeat04. 6 repeats 01-04 are IS elements methods -comparative genomics) with the same annotation criteria used for P. zucineum and found that the fraction of two-component signal transduction proteins and transcriptional regulators was positively correlated with the capacity for environmental adaptation ( Figure 2). The genome contains 17 extracytoplasmic function (ECF) sigma factors (3 in the plasmid) ( Table 5). ECFs are suggested to play a role in environmental adaptation for Pseudomonas putida KT2440, whose genome contains 19 ECFs [8]. P. zucineum has 3 heat shock sigma factors rpoH (2 in the plasmid) and 33 heat shock molecular chaperons (17 in the plasmid) ( Table 6), which can cope with a variety of stresses, including cellular energy depletion, extreme concentrations of heavy metals, and various toxic substances. [9].
The genes for cell motility include 3 chemotaxis operons, 7 MCP (methyl-accepting chemotaxis) genes, 15 other genes related to chemotaxis (Table 7), and 43 genes for the biogenesis of the flagellum ( Table 8).
The genome contains sec-dependent, sec-independent, typical type II (Table 9) and IV secretion systems (Table  10), which are known to play important roles in adapting to diverse conditions [10,11].
To better understand the roles of proteins responsible for environmental transition, we computed the distributions of those proteins in 5 representative alphaproteobacteria with typical habitats (see methods -comparative genomics). Like other multiple bacteria and facultative bacteria, which can survive in multiple niches, P. zucineum encodes a higher fraction of ECFs, transcriptional regulators and two-component signal transduction proteins than obligate bacteria (Table 9). Notably, P. zucineum has the largest number of heat shock related proteins (Table 6), in comparison to the 5 representative alphaproteobacteria  and 93 bacteria (data not shown). Among the plasmidencoded heat shock related proteins are 2 RpoH (PHZ_p0049 and PHZ_p0288) and 2 DnaK-GrpE clusters (PHZ_p0053-0054 and PHZ_p0121-0122). Further phylogenetic analysis suggested that the plasmid-encoded DnaK-GrpE clusters may have undergone a genus-specific gene duplication event ( Figure 3C &3D).

Adaptation to an intracellular life cycle
To survive intracellularly, P. zucineum must succeed in adhering to and subsequently invading the host cell [12], defending against a hostile intracellular environment [13][14][15][16], and capturing iron at very low concentration [17].
It is well known that the pilus takes part in adhering to and invading a host cell [12]. We identified one pili biosynthesis gene (pilA) and 2 operons for pili biosynthesis ( Table 11).
The genes involved in defense against oxidative stress include superoxide dismutase (PHZ_c0927, PHZ_c1092), catalase (PHZ_c2899), peroxiredoxin (PHZ_c1548), hydroperoxide reductase (ahpF, alkyl hydroperoxide  Since intracellular free Fe is not sufficient to support the life of bacteria, to survive intracellularly, they must use protein-bound iron, such as heme and transferrin, via transporters and/or the siderophore system. The P. zucineum genome has one ABC type siderophore transporter system (PHZ_c1893-1895), one ABC type heme transporter system (PHZ_c0136, PHZ_c0139, PHZ_c0140), and 60 TonB-dependent receptors which may uptake the iron-siderophore complex (Table 12).

Comparative genomics between P. zucineum and C. crescentus
Comparative genomic analysis demonstrated that P. zucineum is phylogenetically the closest to C. crescentus [18] (Figure 4), consistent with the phylogenetic analysis based on 16S RNA gene sequences ( Figure 5).
Though the genome size and protein number of P. zucineum (4.37 Mb, 3,861 proteins) are similar to those of C. crescentus (4.01 Mb, 3,767 proteins), no large-scale synteny was found between the genomes. The largest synteny region is only about 30 kb that encodes 24 proteins. The conservation region with the largest number of proteins is the operon encoding 27 ribosomal proteins. In addition, the species share only 57.8% (2,231/3,861) of orthologous proteins. Categories J (translation, ribosomal structure and biogenesis), F (nucleotide transport and metabolism), and L (replication, recombination and repair) are the top 3 conservative COG categories between the species, sharing 88.01%, 81.67%, and 80.65% of the orthologs, respectively.

Comparison of cell cycle genes between P. zucineum and C. crescentus
Since P. zucineum is phylogenetically closest to C. crescentus, and since the latter is a model organism for studies of the prokaryotic cell cycle [19,20], we compared the genes regulating the cell cycle between these species.
The cell cycle of C. crescentus is controlled to a large extent by the master regulator CtrA, which controls the transcription of 95 genes involved in the cycle [19,20]. On the other hand, ctrA is regulated at the levels of transcription, phosphorylation, and proteolytic degradation by its target genes, e.g., DNA methyltransferase (CcrM) regulates the transcription of ctrA, histidine kinases (CckA, PleC, DivJ, DivL) regulate its activity, and ClpXP degrades it. These regulatory 'loops' enable CtrA to precisely control the progression of the cell cycle.
P. zucineum has most of the orthologs mentioned above (Table 13). Among the 95 CtrA-regulated genes in C. crescentus, 75 have orthologs in the P. zucineum genome (Additional file 1). The fraction of CtrA-regulated genes with orthologs in P. zucineum (76.9%, 73/95) is significantly greater than the mean level of the whole genome (57.8%, 2,231/3,861), indicating that the CtrA regulatory system is highly conserved. Genes participating in regulating central events of the cell cycle, such as CcrM (CC0378), Clp protease (CC1963) and 14 regulatory proteins, except for one response regulator (CC3286), are present in the P. zucineum genome. The genes without counterparts in P. zucineum are mostly for functionally unknown proteins.
Notably, the sequence of CtrA is strikingly similar between P. zucineum and C. crescentus, with 93.07% identity of amino acid sequence and 89.88% identity of nucleotide sequence. In addition, they share identical promoters (p1 and p2) [21] and the motif (GAnTC) recognized by DNA methyltransferase (CcrM) (Figure 6) [22], suggesting that they probably share a similar regulatory loop of CtrA.
Consistent with the results from in silico sequence analysis, the CtrA of P. zucineum can restore the growth of temperature-sensitive strain LC2195 (a CtrA mutant) of C. crescentus [23] at 37°C, indicating that the CtrA of P. zuci-Comparative analysis of transcriptional regulators and two-component signal transduction proteins in 6 groups of bacte-ria classified according to their habitats neum can functionally compliment that of C. crescentus in our experimental conditions (data not shown).
Taken together, the comparative genomics of P. zucineum and C. crescentus suggests that the cell cycle of the former is likely to be regulated similarly to that of the latter.

Presence of ESTs of the strain in human
Since P. zucineum strain HLK1 T can invade and persistently live in several human cell lines [1], we were curious about whether this microbe can infect humans. By blasting against the human EST database (dbEST release 041307 with 7,974,440 human ESTs) with the whole genome sequence of P. zucineum, we found 9 matched ESTs (Table  14), of which 3 were from a library constructed from tissue adjacent to a breast cancer, and 6 were from a library constructed from a cell line of lymphatic origin. The preliminary data suggest that P. zucineum may invade humans.

Conclusion
This work presents the first complete bacterial genome in the genus Phenylobacterium. Genome analysis reveals the fundamental basis for this strain to invade and persistently survive in human cells. P. zucineum is phylogeneti-

Methods
Bacterial growth and genomic library construction P. zucineum strain HLK1 T was grown in LB (Luria-Bertani) broth at 37°C and then harvested for the preparation of genomic DNA [1]. Genomic DNA was prepared using a bacterial genomic DNA purification kit (V-Gene Biotech., Hangzhou, China) according to the manufacturer's instructions. Sheared DNA samples were fractionated to construct three different genomic libraries, containing average insert sizes of 2.0-2.5 kb, 2.5-3.0 kb and 3.5-4.0 kb. The resulting pUC18-derived library plasmids were extracted using the alkaline lysis method and subjected to direct DNA sequencing with automated capillary DNA sequencers (ABI3730 or MegaBACE1000).

Sequencing and finishing
The genome of P. zucineum was sequenced by means of the whole genome shotgun method with the phred/ phrap/consed software packages [24][25][26][27]. Sequencing and   [42]. 2 According to our recent publication [1], P. zucineum was classified as "facultative". 3 Given that G. oxydans is often isolated from sugary niches (such as flowers and fruits) and associated soil (such as garden soil and baker's soil) [43], we classified G. oxydans as "multiple". subsequent gene identification was carried out as described in our earlier publications [28][29][30]. Briefly, during the shotgun sequence phase, clones were picked randomly from three shotgun libraries and then sequenced from both ends. 44,667 successful sequence reads (>100 bp at Phred value Q13), accounting for 5.47× sequence coverage of the genome, were assembled into 563 sequence contigs representing 60 scaffolds connected by end-pairing information.
The finishing phase involved iterative cycles of laboratory work and computational analysis. To reduce the numbers of scaffolds, reads were added into initial contig assembly by using failed universal primers as primers and by using plasmid clones that extended outwards from the scaffolds as sequence reaction templates. To resolve the low-quality regions, resequencing of the involved reads in low quality regions with universal primers and primer walking the plasmid clones were the first choice, otherwise, rese-  quencing with alternate temperature conditions resolved the remaining low-quality regions. New sequence reads obtained from the above laboratory work were assembled into existing contigs, which yielded new contigs and new scaffolds connected by end-pairing information. Then, consed interface helped us to do nest round of laboratory work based on new arisen contig assembly. After about four iterative cycles of the above "finish" procedures to close gaps and to resolve the low-quality regions, the PCR product obtained by using total genomic DNA as template was sequenced from both ends to close the last physical gap. In addition, the overall sequence quality of the genome was further improved by using the following criteria: (1) two independent high-quality reads as minimal coverage, and (2) Phred quality value = Q40 for each given base. Collectively, 3,542 successful reads were incorporated into initial assembles during the finishing phase. The final assembly was composed of two circular "contigs", of which a smaller one with a protein cluster (including repA, repB, parA and parB) related to plasmid replication was assigned as the plasmid, and the larger one was the chromosome.

Neighbor
Annotation tRNA genes were predicted with tRNAscan-SE [31]. Repetitive sequences were detected by REPuter [32,33], coupled with intensive manual alignment. We identified and annotated the protein profiles of chromosome and plasmid with the same workstream. For the chromosome, the first set of potential CDSs in the chromosome was established with Glimmer 2.0 trained with a set of ORFs longer than 500 bp from its genomic sequence at default settings [34]. The resulting 5,029 predicted CDSs were BLAST searched against the NCBI non-redundant protein database to determine their homology [35]. 1,174 annotated proteins without the word "hypothetical" or "unknown" in their function description, and without frameshifts or in-frame stop codons, were selected as the second training set. The resulting second set of 4,018 predicted CDSs (assigned as "predicted CDSs") were searched against the NCBI non-redundant protein database. Predicted CDSs that accorded with the following BLAST search criteria were considered "true proteins": (1) 80% of the query sequence was aligned and (2) E-value ≤ 1e -10 . Then, the ORFs extracted from the chromosome region among "true proteins" were searched against the NCBI non-redundant protein database. The ORFs satisfying the same criteria as true proteins were considered "true ORFs". Overlapping proteins were manually inspected and resolved, according to the principle we described previously [30]. The final version of the protein profile comprised three parts: true proteins, true ORFs, and predicted CDSs located in the rest of the genome. The translational start codon of each protein was identified by the widely used RBS script [36] and then refined by comparison with homologous proteins [30].
To further investigate the function of each protein, we used InterProScan to search against the InterPro protein family database [37]. The up-to-date KEGG pathway database was used for pathway analysis [38]. All proteins were searched against the COG database which included 66 completed genomes [39,40]. The final annotation was manually inspected by comprehensively integrating the results from searching against the databases of nr, COG, KEGG, and InterPro.

Phylogenetic tree construction
16S rRNA genes were retrieved from 63 alphaproteobacteria, P. zucineum and Escherichia coli O157:H7 EDL933. A neighbor-joining tree with bootstrapping was built using List of top 10 complete sequenced bacteria closest to P. zuci-neum Figure 4 List of top 10 complete sequenced bacteria closest to P. zucineum. All 10 are alphaproteobacteria. Among all the sequenced bacterial genomes, C. crescentus shares the greatest number of similar ORFs with P. zucineum Neighbor-joining tree of the alphaproteobacteria, inferred from 16S rRNA genes Figure 5 Neighbor-joining tree of the alphaproteobacteria, inferred from 16S rRNA genes. The node labels are bootstrap values (100 replicates). C. crescentus is phylogenetically the closest to P. zucineum.

Comparative genomics
Sequence data for comparative analyses were obtained from the NCBI database ftp://ftp.ncbi.nlm.nih.gov/gen bank/genomes/Bacteria/. The database has 520 completely sequenced bacterial genomes (sequences downloaded on 2007/06/05). All P. zucineum ORFs were searched against the ORFs from all other bacterial genomes with BLASTP. The number of P. zucineum ORFs matched to each genome with significance (E value = 1e -10 ) was calculated.
To illustrate the contribution of transcriptional regulators and two-component signal transduction proteins to environmental adaptation, we compared the mean fraction of these two types of proteins in bacteria living in 6 different habitats, as described by Merav Parter [42]. These are: (1) obligate bacteria that are necessarily associated with a host, (2) specialized bacteria that live in specific environments, such as marine thermal vents, (3) aquatic bacteria that live in fresh or seawater, (4) facultative bacteria, freeliving bacteria that are often associated with a host, (5) multiple bacteria that live in many different environments, and (6) terrestrial bacteria that live in the soil. For bacteria with more than one sequenced strain, we chose only one strain for the comparative study. The numbers of bacterial species in each group were: 26 obligate, 5 specialized, 4 aquatic, 28 facultative, 27 multiple, and 3 terrestrial. We annotated the proteins of these 93 species with the same workflow used for P. zucineum and calculated the mean fraction of transcriptional regulators and two-component signal transduction proteins.
In addition, we annotated the ORFs of 5 representative alphaproteobacteria with different habitats (multiple bacteria S. meliloti 1021 and G. oxydans 621H, facultative bacterium B. suis 1330, aquatic bacterium C. crescentus CB15, and obligate bacterium R. conorii str. Malish 7) using the same workflow and computed the distributions of proteins involved in environmental adaptation.

Ortholog identification
All proteins encoded by one genome were BLASTP searched against a database of proteins encoded by another genome [35], and vice versa. The threshold used in these comparisons was 1e -10 . Orthology was identified if two proteins were each other's best BLASTP hit (best reciprocal match).

Data accessibility
The sequences reported in this paper have been deposited in the GenBank database. The accession numbers for chromosome and plasmid are CP000747 and CP000748, respectively. Nucleotide acid sequence alignment of the ctrA promoter regions (-200 to +21) of C. crescentus and P. zucineum Figure 6 Nucleotide acid sequence alignment of the ctrA promoter regions (-200 to +21) of C. crescentus and P. zucineum. Blue background: identical nucleotides; "-": gaps; red and black box: P1 and P2 promoter; black underline: motif recognized by CcrM; red underline: first 21 nucleotides starting with initial codon "ATG.".