Bacteria strains sequenced
The four bacteria strains that were sequenced in this study were isolated from the hemolymph of wild caught Drosophila melanogaster. They are the Providencia sneebia Type strain (DSM 19967) [GenBank:AKKN00000000], Providencia rettgeri strain Dmel1 [GenBank:AJSB00000000], Providencia alcalifaciens strain Dmel2 [GenBank:AKKM00000000], and Providencia burhodogranariea Type strain (DSM 19968) [Genbank:AKKL00000000]. The versions described in this paper are the first versions, XXXX01000000.
Genome sequencing and assembly of P. rettgeri and P. sneebia
Bacterial DNA was extracted using Puregene DNA Purification Kit (Gentra Systems Inc., Minneapolis, MN) according to the manufacturer’s directions for Gram-negative bacteria. The DNA was then sequenced using FLX Roche/454 Sequencing Technology at Cornell University’s Life Science Core Laboratory Center in Ithaca, NY.
P. sneebia and P. rettgeri were sequenced with approximately 500,000 reads of an average length of 250 bp, providing about 30X coverage for each genome. The sequences for each species were obtained from separate full-plate sequencing runs and independently assembled using the LaserGene SeqMan (version 8) software with the manufacturer’s recommended parameters (Additional file
1: Figure S1). An additional 180,000 reads were obtained from a 454 sequencing plate on which the DNA from P. sneebia and P. rettgeri was separated by a rubber gasket. During the sequencing process, this gasket leaked allowing a very small amount of reciprocal contamination. We did not want to discard these sequences entirely, but we also wanted to avoid any contaminating reads fouling our assemblies. Therefore our second assembly step was to assemble the reads from the half-plate to those contigs initially assembled with the uncontaminated full-plate reads using SeqMan (Additional file
1: Figure S1). From this second step of assembly, we retained: (1) contigs that contained half-plate reads assembled to full-plate contigs, increasing the depth of those contigs, (2) contigs in which half-plate reads bridged previously separate contigs from the full-plate assemblies, allowing them to be stitched together, and (3) novel contigs containing only half-plate reads but with a coverage depth of 30X or greater. Contaminating sequences in the half-plate reads would presumably fail to map to full-plate assemblies or would result in low-coverage contigs, so we infer that the small number of molecules that leaked through the gasket have been effectively discarded. After the second round of assembly, the P. sneebia genome was mapped into 72 contigs and the P. rettgeri genome was mapped into 71 contigs.
As we were annotating the P. sneebia and P. rettgeri genome sequences (see “Annotation methods” section below), we noticed several instances of sequential open reading frames (ORFs) that were annotated with the same predicted function and whose combined length equaled the size of genes with the same functional annotation in other bacteria. Closer inspection revealed that these instances were generally due to a stop codon or frameshift mutation that interrupted the ORF, causing it to be annotated as two genes with identical function. These inferred mutations tended to happen after short homopolymer runs. Individual reads varied in the lengths of these homopolymer sequences, and the contig assembly often did not reflect the most common sequence length among the reads. It is a known problem that Roche/454 Sequencing often results in errors in homopolymer run lengths
. To improve the accuracy of inferred homopolymer lengths, we re-aligned all of the Roche/454 sequencing reads to our assembled reference sequences using the program BWA
 (Additional file
1: Figure S1). The consensus homopolymer length from the BWA alignment was used to fix the assembled contigs before any further analysis was performed. This correction improved our gene annotations by eliminating sequencing errors that interrupted ORFs.
We aimed to improve the assemblies of our P. sneebia and P. rettgeri genomes by joining contigs through PCR followed by direct Sanger sequencing. However, the order and orientation of the contigs was unknown. We hypothesized that there would be synteny among the genomes of Providencia species and isolates as well as species in the closely related genus Proteus, which we could use to predict the order and orientations of the contigs in our assemblies (Additional file
1: Figure S1). We used MUMmer (version 3.22)
 to compare P. sneebia and P. rettgeri to the draft genomes of Providencia rettgeri DSM 1131 (283 contigs) [GenBank:ACCI00000000], Providencia alcalifaciens DSM 30120 (79 contigs) [GenBank:ABXW00000000], Providencia stuartii ATCC 25827 (120 contigs) [GenBank:ABJD00000000], and Providencia rustigianii DSM 4541 (127 contigs) [GenBank:ABXV00000000] as well as the completed genome of Proteus mirabilis strain HI4320 [GenBank:NC_010554.1]. Where two of our P. rettgeri or P. sneebia contigs had similarity to a single contig of one of the other genome sequences, we designed PCR primers to amplify across the inferred gap. Successful amplifications were sequenced by primer walking and the resultant sequences were used to bridge contigs in the assemblies. PCR and sequencing methods are described below in the “PCR and Sanger sequencing methods” section. We found that designing the primers inset about 900 bp from the contig breakpoints helped to ensure specificity in amplification, especially because repetitive sequences in the genome can be the cause of contig breaks in genome assemblies. Using this method, we reduced the number of contigs in the P. sneebia assembly from 72 to 67 and in the P. rettgeri assembly from 71 to 64.
To further connect the P. sneebia and P. rettgeri assemblies, we contracted the MapIt optical mapping service from OpGen, Inc. (Gaithersburg, MD) and analyzed the resulting data using their program MapSolver (version 2.1.1) (Additional file
1: Figure S1). An in silico digestion of our contigs allowed them to be oriented onto an in vitro restriction digestion map of each bacterium’s physical genome. This ordered and oriented the contigs and allowed us to identify those contigs that comprised the majority of each genome. We used the optical map to identify physically consecutive contigs, then designed primers for PCR and Sanger sequencing to close most of the remaining gaps. The optical map also indicated a small number of computational misassemblies and allowed them to be fixed. After optical mapping and final gap closing, our draft genome sequences were assembled into 14 contigs for P. sneebia and 9 contigs for P. rettgeri.
Genome sequencing and assembly of P. alcalifaciens and P. burhodogranariea
P. alcalifaciens and P. burhodogranariea were sequenced by paired-end 454 sequencing. Libraries were constructed for each bacterium with an approximately 3 kb insert size, and roughly 1 million paired-end sequence reads of an average length of 250 bases were collected. The paired-end reads were assembled using Roche/454’s Newbler Assembler (version 2.5.3), which resulted in 15 scaffolds for P. alcalifaciens and 8 scaffolds for P. burhodogranariea, sequenced to roughly 35X coverage.
PCR and Sanger sequencing methods
PCR primers for gap closing were designed either using Primer3
 or with a primer design function within SeqMan. PCRs were performed using a genomic DNA template with a final concentration of 1.2 ng/μl in each PCR reaction volume.
When the size of the expected product was unknown or was expected to be less than 5 kb, the PCR was done with Taq polymerase (New England Biolabs, Beverly, MA) with an annealing temperature gradient ranging from 2°C higher to 2°C lower than the melting temperature of the primers. PCR cycling parameters were as follows: (1) 2 minutes at 95°C, (2) 30 seconds at 95°C, (3) 30 seconds at annealing temperature gradient, (4) 1 minute at 72°C, (5) repeat steps 2–4 for 34 more cycles, (6) 5 minutes at 72°C. 3.5 μl of each PCR product was prepared for sequencing by treatment with 5 units of Exonuclease I (USB Corp., Cleveland, OH) and 0.5 units of shrimp alkaline phosphatase (USB Corp., Cleveland, OH) at 37°C for one hour before heat-killing the enzymes at 65°C for 15 minutes. PCR products were then directly sequenced using ABI BigDye Terminator (Applied Biosystems, Foster City, CA) according to the manufacturer’s directions.
PCR for products with an expected size greater than 5 kb was done using high fidelity iProof polymerase (Bio-Rad, Hercules, CA). Annealing temperatures and extension times were determined using manufacturer’s suggested methods. The PCR cycling parameters were: (1) 30 seconds at 98°C, (2) 7 seconds at 98°C, (3) 20 seconds at appropriate annealing temperature, (4) appropriate extension time at 72°C, (5) repeat steps 2–4 for 29 more cycles, (6) 7 minutes at 72°C. Products were prepared for sequencing with PCR purification clean up columns (Invitrogen, Calsbad, CA) before being sequenced directly.
Genomic open reading frames were determined and annotated using the RAST Server (version 4)
. Gene ontology terms (GO terms) were assigned to the ORFs identified by RAST using Blast2GO (version 2.5)
. Fisher’s Exact Test for enriched GO categories was done within Blast2GO using a p-value cut off of 0.05 after adjusting for a standard false discovery rate (FDR) of 0.05 for multiple testing.
Plasmid identifications and analysis
The circular DNA structure of plasmids means that they will appear to be arbitrary broken when forming linear contigs during assembly of sequencing reads. We tested all potential plasmid contigs for a circular physical structure by designing PCR primers approximately 500 bases from the ends of the contig facing outward off each end of the contig. This primer design means that a product would be formed only if the ends of the contig were connected in the physical DNA. Any PCR product amplified from such primers was then sequenced with Sanger sequencing to confirm that the sequence supported a circular physical arrangement of the sequence. PCR reactions and Sanger sequencing was done as described above in “PCR and Sanger sequencing methods”.
Putative plasmids were identified in multiple ways. We speculated that one P. rettgeri contig might be a plasmid because it had substantially higher depth of coverage than other contigs, and when we compared the contig to itself using MUMmer, we found that it was composed of the same sequence repeated multiple times. We hypothesized that the contig might actually represent a completely sequenced, high copy number plasmid, and that the circular shape of the physical DNA sequence was resulting in a tandem repeat of the sequence in the in silico assembly. PCR and Sanger sequencing confirmed that this contig is a plasmid.
To more systematically assess whether contigs from the assemblies were plasmids, we looked for contigs with identical sequence present at both ends. We hypothesized that the arbitrary break point of the physical circular structure to form a linear contig could result in identical sequence at each end of the contig. We used MUMmer to compare contigs to themselves. For P. sneebia and P. rettgeri, we examined all contigs which did not align to the optical map since we do not expect plasmids to map to the chromosomal genome. We identified three P. sneebia plasmids using this method, all of which were confirmed by PCR and Sanger sequencing. We identified no additional P. rettgeri contigs as putative plasmids using this approach. For P. alcalifaciens and P. burhodogranariea, we examined every scaffold smaller than 6 kb in length, but none contained the same sequence at both ends of the contig. It should be noted that this method may result in an underestimation of plasmids since it will not detect plasmids that assemble into multiple contigs.
When examining the synteny of the genomes (see “Synteny and regional comparisons” section below), we noticed that some of the contigs of P. alcalifaciens had no similarity to sequences in any of the other genomes. We hypothesized that these contigs could be plasmids that are unique to P. alcalifaciens. We tested four contigs by PCR and Sanger sequencing. One contig was confirmed to be a plasmid while the other three did not produce PCR products and therefore showed no evidence of being plasmids.
All putative plasmids were compared to each other using MUMmer. In order to determine whether our confirmed plasmids or previously sequenced Providencia plasmids
[25–27] were present but undetected in the remaining sequences from this study, we constructed a BLAST database of all of the reads from the Roche/454 sequencing of each individual species and searched for reads matching each Providencia plasmid using BLAST+ (version 2.2.25). Four previously identified Providencia plasmids were used as query sequences: pDIJ09-518a [GenBank:HQ834472.1], pGHS09-09a [GenBank:HQ834473.1], pMR0211 [GenBank:JN687470.1], and R7K [Genbank:NC_010643.1].
Identification of orthologs
Orthologous genes were identified as shared among three different sets of bacteria: (1) the strains of P. sneebia, P. rettgeri, P. alcalifaciens, and P. burhodogranariea sequenced in this study; (2) the strains of P. sneebia, P. rettgeri, P. alcalifaciens, and P. burhodogranariea sequenced in this study plus the strains of P. rettgeri, P. alcalifaciens, P. stuartii, and P. rustigianii sequenced as part of the Human Microbiome Project
; (3) the strains of P. sneebia, P. rettgeri, P. alcalifaciens, and P. burhodogranariea sequenced in this study plus the strains of P. rettgeri, P. alcalifaciens, P. stuartii, and P. rustigianii sequenced as part of the Human Microbiome Project plus the outgroup Proteus mirabilis strain HI4320. Orthologous gene clusters were identified using OrthoMCL (version 2.0.2)
. BLAST results used within OrthoMCL were performed with an e-value cut off of 10-10. The output from OrthoMCL was parsed using custom Python scripts.
Alignments of orthologs
Orthologous gene clusters found among the strains sequenced in this study, those Providencia sequenced as part of the Human Microbiome Project, and Proteus mirabilis were aligned for phylogenetic analysis (see “Phylogenetic analysis” section below). Those orthologs of the strains sequenced in this study only were aligned for use in the recombination and positive selection analyses (see “Recombination analysis” and “Positive selection analysis” sections below). Only clusters with clear single gene orthology across each genome were retained. Alignments of the protein translation of the genes were done using ClustalW (version 2.1)
 followed by back-translation to the nucleotide alignment using PAL2NAL (version 13)
. Alignments were visually inspected and poor alignments were removed as follows. We eliminated alignments where the difference in amino acid identity between the most-similar and least-similar pairs of species were greater than 60% out of concern that these might not be true orthologs. We also excluded alignments that had both an average protein identity that was less than 65% and a difference between the highest and lowest pair-wise protein identities greater than 35%. Those alignments that had an average protein identity of less than 75% were examined by hand to ensure proper alignment.
The alignments of all 1651 ortholog clusters that included all eight sequenced Providencia genomes and Proteus mirabilis were concatenated using FASconCAT (version 1)
. RAxML (version 7.2.8)
 was used to construct the phylogenetic trees for the concatenation of all orthologous genes. Proteus mirabilis was set as an outgroup.
Synteny and regional comparisons
Synteny among genomes was examined using Mauve (version 2.3.1)
, Artemis Comparison Tool (version 1)
, and MUMmer. Comparisons of particular regions of the genomes were done using EasyFig (version 1.2)
Evidence for recombination was examined by executing the programs GENECONV which implements the Sawyer method
 and PhiPack
. GENECONV was run using the default settings, which estimates p-values on 10,000 permutations of each alignment. PhiPack runs 3 separate tests: Pairwise Homoplasy Index, Maximum χ2, and Neighbor Similarity Score. The Pairwise Homoplasy Index was calculated on a window size of 50 while Maximum χ2 was calculated on a window that is 2/3 the size of the polymorphic sites. We did 1000 permutations in PhiPack to calculate each p-value. The p-values of all tests were corrected for multiple testing using the program Q-value
 with a FDR of 10%, acknowledging that the multiple tests conducted are highly intercorrelated and a stricter correction would therefore be overly conservative
Positive selection analysis
Positive selection analysis was done using PAML (version 4.4)
 on the 1937 orthologous clusters found in the Providencia sequenced in this study. Site-model tests were implemented using codeml to compare model M8a (beta+ω=1) to M8 (beta+ω)
[41, 42]. The log-likelihoods from each test were compared in a likelihood ratio test assuming a χ2 distribution of the test statistic. RAxML
 was used to construct the phylogenetic trees for each ortholog cluster examined in this analysis. We corrected for multiple testing using a q-value cut off which was calculated with the program Q-value
 using a FDR of 20%, reflecting the conservative nature of the positive selection test
Phage genes were identified and classified using PHAST
Predicting the core genome size
Custom scripts were written in Python to determine the number of orthologs for all combinations of 2, 3, 4, 5, 6, or 7 genomes of the eight sequenced Providencia genomes by parsing the OrthoMCL output described in the “Identification of orthologs” section.