Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation

Background Ongoing technological advances in genome sequencing are allowing bacterial genomes to be sequenced at ever-lower cost. However, nearly all of these new techniques concomitantly decrease genome quality, primarily due to the inability of their relatively short read lengths to bridge certain genomic regions, e.g., those containing repeats. Fragmentation of predicted open reading frames (ORFs) is one possible consequence of this decreased quality. In this study we quantify ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and we propose a solution to ameliorate this problem. Results A survey of draft-quality genomes in GenBank revealed that fragmented ORFs comprised > 80% of the predicted ORFs in some genomes, and that increased fragmentation correlated with decreased genome assembly quality. In a more thorough analysis of 25 Streptomyces genomes, fragmentation was especially enriched in some protein classes with repeating, multi-modular structures such as polyketide synthases, non-ribosomal peptide synthetases and serine/threonine kinases. Overall, increased genome fragmentation correlated with increased false-negative Pfam and COG annotation rates and increased false-positive KEGG annotation rates. The false-positive KEGG annotation rate could be ameliorated by linking fragmented ORFs using their orthologs in related genomes. Whereas this strategy successfully linked up to 46% of the total ORF fragments in some genomes, its sensitivity appeared to depend heavily on the depth of sampling of a particular taxon's variable genome. Conclusions Draft microbial genomes contain many ORF fragments. Where these correspond to the same gene they have particular potential to confound comparative gene content analyses. Given our findings, and the rapid increase in the number of microbial draft quality genomes, we suggest that accounting for gene fragmentation and its associated biases is important when designing comparative genomic projects.


Background
Beginning with the bacteriophage X174 genome in 1977 by Sanger et al. [1], microbes have spearheaded technological advances in genome sequencing. This is almost certainly because their small size and limited genomic complexity (at least compared to Eukaryotes) makes genomic analysis especially tractable, and because of their founding role and long history in the study of molecular biology. Microbial genomics also yields enormous and otherwise inaccessible informational gains compared to the relatively few easily observable phenotypes of microbes (at least compared to macroorganisms). This preponderance shows no signs of abating; the current Genomes Online Database (GOLD; http:// www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi; accessed August 15, 2011) lists 8212 bacterial and 308 archaeal genome projects either completed or ongoing. These statistics do not include a myriad of sequenced viruses or projects not registered in GOLD.
For approximately 30 years following its original publication in 1977 [2], the Sanger dideoxy terminator method was the standard for DNA sequencing. Whereas this technology is now fully mature and capable of routinely yielding reads > 800 bp long, its reliance on capillary electrophoresis created a ceiling for the volume and speed of sequences that can be generated inexpensively by this approach [3]. So-called "next-" or "second-generation" sequencing approaches emerged in 2005 following the release of the 454 pyrosequencing platform (Roche), followed closely thereafter by the Illumina/ Solexa (Illumina) and SOLiD (Applied Biosciences/Life Technologies) systems [3][4][5][6]. Second-generation sequencing platforms all produce reads shorter than those achieved by modern Sanger sequencing, although new PacBio and Roche FLX Titanium XL+ technologies may offer alternatives. Regardless, short-read sequencing remains the most cost-effective on a per-base basis (L. A. Pannacchio, unpublished data), leading to the dominance of these technologies (especially Illumina) in terms of market share http://www.genomicslawreport. com/wp-content/uploads/2011/04/JP-Morgan-NGS-Report.pdf.
Next-generation sequence assembly is an inherently challenging computational problem, stemming from the relatively low amount of data contained within each sequencing read, the huge volumes of data produced, and platform-specific error frequencies and profiles [7] including difficulties with regions of high %GC bias [8] and homopolymeric regions leading to frame-shifts [9]. Simulations suggest that completely unbroken coverage of a bacterial genome is impossible using short read lengths [10,11], although to some extent this can be compensated for using (more expensive) paired-end approaches [12]. Improving a draft-quality genome to completion is typically costly and laborious due to its general requirement for targeted PCR and Sanger sequencing. Fragmented genomes should therefore be the expectation from modern genome sequencing projects, at least for the foreseeable future, given the current economic reality and the market dominance of short-read sequencing platforms.
Consequently, data occurring between genomic contig ends are omitted from draft-quality genomes. This affects comparative analyses using phylogenetic techniques; because the state of these missing characters cannot be otherwise estimated, their phylogenetic analysis is impossible. In contrast, analyses based on presenceabsence of particular open reading frame (ORF) types, e. g., using BLAST [13], can be performed using ORF fragments. These estimations are primarily susceptible to two types of errors: false-negatives, where ORFs that should be present are not annotated; and false-positives, where multiple fragments actually belonging to the same ORF are annotated separately. In the present work, we attempt to understand the extent and consequences of the biases partial ORFs introduce into annotation analyses. Furthermore, we attempt to mitigate these biases by linking fragmented ORFs based on their relatedness to homologs in closely-related organisms. We acknowledge that ORF fragmentation can introduce other biases into comparative analyses, e.g., due to the increased difficulty of correct ORF modeling [14] or potentially, misannotation due to the lower informational content of ORF fragments relative to full-length sequences. However, these errors fall outside the scope this current work.

Extent of predicted ORF fragments in publicly available draft genomes
Whereas it is intuitive to expect ORF fragmentation in a draft genome due to the existence of multiple contigs, the extent of this problem is currently unknown. The majority of draft genomes in the NCBI database at the time of this study contained tens to hundreds of partial ORF fragments, although some fell both above and below this range (Figure 1a). The largest number of ORF fragments in a particular genome was 8,717 and the maximum percentage of the total number of predicted ORFs composed of partial sequences was 81.4% ( Figure 1b). The increased numbers of ORF fragments is a result of decreased assembly quality, as indicated by the negative correlation of both the number of fragments and the percentage of the total number of predicted ORFs composed of partial sequences with N50 (Figures 1c and 1d). Note that, whereas N50 values from different genomes cannot be directly compared due to their dependence on genome size, the strong negative correlation with genome fragment abundance supports its use here as an estimate of genome quality. These distributions likely reflect the technological transition from Sanger sequencing towards 454-, Illuminaand SOLID-based methods.
The effect of partial ORFs on annotation efficacy As noted, genome fragmentation can affect ORF annotation either by erroneously omitting ORFs or by duplicating annotations for fragments originating from the same ORF. We examined the effect of fragmentation on ORF annotation in 25 Streptomyces genomes using three different annotation types, representing different annotation targets (e.g., Pfam primarily annotates domains whereas COG and KEGG focus more on the entire protein) and sensitivities (e.g., KAAS incorporates annotation heuristics [15] whereas COG and Pfam annotations were based solely on RPSBLAST). Genome assembly quality was approximated using the percentage of predicted ORFs composed of fragments and the average ORF length. Note that these metrics reflect genome quality oppositely: high quality genome assemblies have longer mean ORF lengths and lower percentages of partial ORFs.
Regardless of the genome quality metric used, ORF fragmentation caused significant over-annotation using KEGG and significant under-annotation using Pfam (Table 1). Some under-annotation was also observed using COG, although this was not as significant as that observed for Pfam (Table 1). Under-annotation using Pfam may reflect the increased omission of partial or entire domains due to ORF fragmentation. Over-annotation using KEGG suggests that this annotation type is especially sensitive. All trends remained, although with lowered statistical significance, when Streptomyces sp. PP-C42, the only genome in our dataset generated using solely Illumina sequencing [16], was excluded (Table 1). This may suggest that annotation difficulties are especially exacerbated by some applications of short-read genome sequencing technologies.

Identifying functional categories enriched in partial ORFs
To further explore the effect of ORF fragmentation, we separately annotated the partial and complete ORFs according to their COG superfamilies in the draft genomes from our Streptomyces dataset ( Figure 2). We aggregated the COG superfamily annotations for partial and complete ORFs in this analysis because the low number of partial fragments in some genomes became non-representative when converted to a percentage. Furthermore, the average percentage of ORFs belonging to some COG superfamilies for these genomes exhibited a strongly bimodal distribution (i.e., some genomes differed greatly between partial and complete ORFs, whereas others did not), somewhat undermining statistical inferences based on the mean. Despite these limitations, most COG superfamilies were represented by complete and partial ORFs approximately equally ( Figure 2). This suggests that ORF fragmentation is a largely stochastic process driven by local, sequencespecific characteristics. However, three COG superfamilies Log N50 d c Figure 1 Quality of draft genomes available in the NCBI database as of May 10, 2011. The number of ORF fragments predicted by Prodigal and the percent of the total number of predicted ORFs (including ORF fragments) composed of ORF fragments, respectively, ordered by increasing number of ORF fragments (a and b) or plotted versus N50, the size of the contig for which 50% of the genome is contained in contigs of greater than or equal size (c and d). The number of ORF fragments, the percent of the total number of predicted ORFs composed of ORF fragments and N50 were logarithmically transformed to de-emphasize extreme values.
were substantially enriched in partial ORF fragments: "replication, recombination and repair", "signal transduction mechanisms" and "secondary metabolites biosynthesis, transport and catabolism" (Figure 2). Inspection of the underlying data revealed that these enrichments were almost entirely due to three COG families: COG1020non-ribosomal peptide synthetase (NRPS) modules and related proteins, on average 0.26% of complete ORFs but 2.25% of partial ORFs; COG3321-polyketide synthase (PKS) modules and related proteins, on average 0.20% of complete ORFs but 6.70% of partial ORFs; and COG0515-serine/threonine protein kinase, on average 0.41% of complete ORFs but 2.01% of partial ORFs. COG1020 and COG3321 both belong to the "secondary metabolites biosynthesis, transport and catabolism" superfamily, and COG0515 belongs to both the "replication, recombination and repair" and "signal transduction mechanisms" superfamilies. Fragmentation of PKSs, NRPSs and serine/threonine kinases are likely due to their multi-modular structures which are composed of homologous modules [17,18]. The repetitive nature of these enzymes, along with the lower per-base information content inherent to the GC-bias characteristic of actinomycetes, is likely to complicate genome-based drug-discovery efforts based on short-read sequencing platforms.

Linkage of partial ORFs using homologs in related genomes
Whereas missing data in fragmented genomes cannot be compensated for without improvements to the genome sequencing and assembly procedure itself, the same is not necessarily true for fragments leading to falsely inflated annotation estimates. We hypothesized that linkage between some fragmented ORFs could be inferred based on their common homology to complete ORFs in other genomes. After extensive parameterization to find the most sensitive and specific similarity and coverage thresholds for fragment assembly using our algorithm (Additional files 1 and 2), a substantial proportion of ORF fragments could be so linked, ranging from 9.3% for S. clavuligerus ATCC27064 #2 up to 46.2% for S. roseosporus NRRL11379 ( Table 2). The parameters used yielded a low overall false positive rate for the entire dataset (< 2%) as determined by the congruency of the linked ORF fragment sets with scaffold information (Additional files 1 and 2). Inspection of the relationship between genetic relatedness and partial ORF linkage revealed insights into the varied rate of partial ORF linkage observed between genomes in our test dataset ( Figure 3). In some cases, genome relatedness influenced partial ORF linkage, with genomes having average amino acid identities (AAIs) < 80% (the majority of the dataset; Figures 3 and 4) linking few sets of partial ORFs ( Figure 3). However, many exceptions to this trend could also be found resulting and an overall lack of correlation between the AAI between two compared genomes and the number of linked fragments (R 2 = 0.006; P = 0.061). Using the percentage of orthologous full-length proteins as the metric of genome similarity yielded equivalent results (data not shown). Even more important than genome relatedness per se was the degree to which related samples sampled  Figure  3). (Streptomyces sp. XylebKG-1 and S. griseus subsp. griseus NBRC13350 are 98.4% related to each other by AAI; Figure 4.) Note that 95-96% AAI correlates to 70% DNA-DNA hybridization, i.e., bacterial species as currently defined [20]. In contrast, the three replicate S.
clavuligerus ATCC27064 genomes (derivatives of the same strain sequenced by different institutes) could link relatively few of each other's partial ORFs despite the high relatedness of these genomes ( Table 2). This may indicate that: (i) some genome fragments are inherently not linkable using homology e.g., due to the presence of multiple, closely related paralogs in a genome; (ii) some highly similar genome regions may fragment similarly, requiring slightly more divergent homologs for linkage; (iii) even extremely highly related genomes (such as the S. clavuligerus ATCC27064 replicates) can differ in some respect, e.g., plasmid content [21]. Regardless, these results suggest that this homology-based approach to fragment linkage might become more effective as more related genomes become available, as is expected due to the ever-decreasing cost of genome sequencing. However, selecting appropriate reference genomes remains somewhat empirical due to the current lack of a strong correlation between genome similarity and fragment linkage.
To determine whether our partial ORF linkage method could compensate for errors in KEGG overannotation, we tested the correlation between the number of KEGG annotations after correction and both the mean ORF length and percentage of the total number of ORFs composed of fragments as described above (Table  1). All correlations between the proxies used to represent genome quality and the percentage of KEGG annotations decreased in significance or became statistically non-significant following fragment linkage. These results suggest that over-annotation can at least be partially compensated for by fragment linkage, especially as the variable genome for each species becomes better sampled. Similar correction of the COG and Pfam annotations increased the degree of under-annotation where this was previously statistically significant, albeit only slightly (Table 1).

Discussion
Next-generation DNA genome sequencing has irrevocably moved microbiology (and much of biology, for that matter) towards extensive data volumes and the concomitant challenges of their analyses. These advances are not without attendant compromises, especially the economic trade-off between dataset breadth and quality which appears to be increasingly towards the former at the expense of the latter. Emerging technologies such as PacBio and 454 FLX Titanium XL+ offer enhanced read lengths, but these formats still cannot match the perbase costs of shorter-read sequencing platforms. Cheaper short read techniques therefore seem poised to dominate the sequencing market for the foreseeable future, especially given that human genome re-sequencing, the main economic driver for innovation in genome sequencing technologies, is focused on clinical personal genomics where short reads appear largely sufficient for reference assembly and SNP detection. Given this technological reality, draft genomes are likely to dominate microbial genomic data for the immediate future. Here, we have attempted to address some of the limitations of this data type for comparative genomics analysis. We find that ORF fragmentation has the potential to dramatically cause false-negatives in some annotation types, especially those relying on relatively small stretches of sequence (e.g., Pfam domains; Table 1). These biases will be difficult to address without improvements in genome quality. Reciprocally, we also find that ORF fragmentation can cause false-positive annotations of multiple ORF fragments belonging to the same complete ORF where sufficiently sensitive detection methods exist (e.g., KEGG; Table 1). However, we also suggest that these problems can likely be at least partially mitigated by linkage of protein fragments based on their complete homologs in related organisms (Table 1, Figure 3). The impact of a well-sampled variable genome on ORF fragment linkage is especially intriguing (Figure 3), and suggests that sampling multiple related strains will not only more accurately represent biological diversity but will also improve the data quality of all these strains by unmasking otherwise obscured homologies among ORF fragments. This is likely to be particularly important for Streptomyces and other similar organisms, for which the most biotechnologically interesting genes involved in secondary metabolism are both variable between strains [22] and highly fragmented (Figure 2), and assembly is particularly challenged by low per-read information content due to %GC bias.
Although our analyses indicate the potential for biases, they in no way diminish the value of draft bacterial genome sequencing. Indeed, as described above, taking advantage of low cost-sequencing is essential to begin generating sufficient representation of the vast microbial diversity present in nature. With this increased sampling we can begin to differentiate between neutral and adaptive (or at least conserved) genome contents. However, we must also be aware of the limitations of these data types. For example, if > 80% of the predicted ORFs in a given bacterial genome are likely to be fragmented, analyses that can accommodate ORF fragmentation, even if imperfectly, will be crucial. Extensively fragmented draft genomic data will be insufficient for some applications, even with downstream in silico correction. This will necessitate proper resource allocation to ensure data usability for a given experimental objective. Furthermore, data analysis requirements scale even faster than data accumulation, e.g., allversus-all comparisons of n sequences most simply require n 2 comparisons. It may therefore be necessary to use judicious subsets of a global dataset when attempting fragment linkage. In the end, there is no substitute for well designed, tractable research; draft sequencing, for all its promise and pitfalls, is no different.   Figure 4 Phylogenetic tree of the Streptomyces genomes used in this study. This neighbor-joining phylogenetic tree is based on the average amino acid identity, calculated using each non-fragmented bidirectional best BLAST pair having ≥ 30% identities over ≥ 70% of both protein lengths. The tree was rooted based on a 16S rRNA gene tree constructed for all actinobacterial type strains which was consistent in all major respects to the tree presented here.

Conclusions
Draft genome quality will be the dominant form of microbial genomic data produced for the immediate and foreseeable future. This approach produces highly fragmented genomes, which concomitantly translates into abundant predicted ORF fragments. These fragmented ORFs lessen annotation efficacy, especially of repeating, multi-modular proteins. However, this problem can be somewhat ameliorated by increased sampling of a species' variable genome. Therefore, despite their great utility in many applications, there exist disadvantages to draft genome sequencing that need to be addressed by appropriate experimental design and deployment of sequencing resources to effectively answer the study questions posed.

Dataset construction
The contigs comprising all 1,510 partial genome sequences were obtained from the NCBI FTP folder "genomes/Bacteria_DRAFT" on May 10, 2011. Genes and their corresponding proteins were annotated using Prodigal v2.00 [23] according to default parameters, and the numbers and fragmentation of predicted proteins, the N50 and the %GC for each genome enumerated using a custom Perl script. (N50 is the contig size for which 50% of the genome sequence exists in contigs of at least that size. We define %GC bias as deviation from 50% (C+G)/(A+C+G +T), as would be expected if all possible nucleotide substitutions were equally probable.) ORF fragments were automatically annotated by Prodigal as ORFs that were bounded by a contig end. The list of Streptomyces genomes used for more detailed analysis is given in Table 3. These genomes were selected because they vary considerably in quality, from complete genomes to one generated exclusively using Illumina sequencing. They also range considerably in their phylogenetic relatedness (Figure 4), including three separate sequencing projects for the S. clavuligerus ATCC27064 using derivatives of the same strain. Streptomyces genomes are also expected to be especially challenging to assemble using short-read techniques due to their substantial %GC bias [22]. All of the selected genomes (with the exception of Streptomyces sp. PP-C42) had accompanying scaffold information, which allowed estimation of the false positive rate of our partial ORF linkage procedure (see below). Annotated protein sequences were downloaded from NCBI for those strains having complete genomes; otherwise, proteins were predicted from nucleotide contigs using Prodigal v2.00 [23] as described above.

Gene annotation
The COG v1.0 and Pfam v25.0 databases were downloaded from the NCBI FTP site and queried by RPSBLAST with an expectation cut off of 1 × 10 -5 using the predicted proteins in the Streptomyces dataset. The top COG and all Pfam hits for each query sequence were recovered for analysis. A Pfam annotation conducted using an expectation cut off of 1 × 10 -3 yielded similar but slightly stronger statistical correlations compared with those shown in Table 1 (data not shown). KEGG classifications for the predicted proteins in the Streptomyces dataset were annotated using the KAAS web server using the single direction best BLAST method [15]. Statistical analyses were all conduced using R v2.13.1.

Partial ORF linkage
Our partial ORF linkage algorithm was implemented as a Perl script, freely downloadable from the Currie lab website http://currielab.wisc.edu/. Note that this current implementation was designed to use pair-wise BLAST searches of entire proteomes due to the additional utility of these files for calculating AAIs. Other configurations (e.g., BLASTing only partial sequences) may be more computationally efficient depending on particular experimental needs. Pair-wise comparisons of each predicted proteome were conducted using BLASTp [13] with the following parameters: -a 8 -b 20 -v 20 -e 1e-05 -F F. AAIs were calculated according to Konstantinidis and Tiedje [20]. Pair-wise AAIs were averaged for each genome to construct a double-sided distance matrix, from which a neighbor-joining tree was constructed using NEIGHBOR in the PHYLIP package v3.69 [24].
A flow diagram of our approach is presented in Figure  5. Essentially, complete homologs in genome #2 were sought for protein fragments predicted in genome #1, which were then used to detect other protein fragments predicted in genome #1 likely to comprise a matching set with the first predicted protein fragment. The stringency of this matching process was constrained by user specified thresholds for: (i) the minimum percent identity required to define homologs between genomes #1 and #2; (ii) the minimum percent overlap required between the homologs in genomes #1 and #2; (iii) the minimum difference in the percent identities of a set protein fragments to a complete reference homolog required to consider multiple predicted protein fragments in genome #1 as belonging to the same predicted protein fragment set; and (iv) the number of overlapping amino acids allowed between predicted protein fragments in a set. Searches were also conducted reciprocally using homologs in genome #1 to find matching sets of predicted partial proteins in genome #2. The fragment-matching process was further constrained by allowing only two predicted partial proteins truncated at either their 5'-or 3'-ends in the same predicted protein fragment set; no limit was set for predicted partial proteins truncated at both their 5'-and 3'-ends. Note that this step is especially sensitive to errors in predicting whether a protein is truncated at one or both ends, which were often encountered in our test dataset, but was included to be more conservative. Finally, this same matching process was repeated using multiple predicted proteomes, with matched sets being rejected and replaced if alternative sets are found in another predicted proteome that satisfy the above criteria but have a higher mean AAI between the matched protein fragments and their complete protein homologs in another proteome. This step assumes that homologs with higher AAIs are more likely to be orthologous than more divergent ones and are therefore more reliable as scaffolds for matching partial protein fragments. We note that orthology is not an explicit requirement for fragment  Figure 5 Flow diagram of the partial ORF linkage approach used in this work. recruitment; a reference paralogous protein might the closest homolog to a query when the ortholog in that reference genome has been lost. Whereas we used translated protein sequences for fragment linkage to better compensate for the genetic divergence between the genomes that we analyzed, nucleotide sequences could be similarly analyzed if several closely-related genomes are available.

Additional material
Additional file 1: Parameterization of the fragment linkage algorithm by varying the maximum percentage identities, sequence overlap and minimum difference in the percent identities of a set protein fragments to a complete reference homolog. False and true positive linkage results were defined based on their concordance with the adjacent location of each fragment pair on the same contig. Each analysis was conducted on the entire Streptomyces test dataset except for Streptomyces sp. PP-C42, for which there was not available scaffold information.
Additional file 2: Parameterization of the fragment linkage algorithm by varying the maximum percentage of identities and sequence overlap while holding the minimum difference in the percent identities of a set protein fragments to a complete reference homolog constant at 40%. The height of the surface represents the number of fragments matched using each parameter combination, and the color represents the percentage of true positives recovered using those parameters by reference to the available scaffold information.