Here we present the first report of the sequencing and de novo assembly of a chloroplast genome using the PacBio RS sequencing platform in which we recovered a single contig containing 154,959 bp that covered the entire P. micrantha chloroplast genome. To enable an evaluation of the relative performance of the PacBio RS sequencing platform for sequencing and de novo assembly of the P. micrantha chloroplast genome, we compared the results obtained to an assembly performed with a single library from the Illumina HiSeq2000 platform. Since the data from the two platforms were assembled by necessity using different assembly programs and assembly parameters, the results obtained clearly cannot be compared on a like-for-like basis, and the experimental design did not provide for the ‘optimal’ results that could be obtained for the assembly of a chloroplast genome using the Illumina HiSeq2000 platform. Nevertheless, Illumina data are recognised as being of immense utility to sequencing and de novo assembly of draft genome sequences, and thus, whilst the comparison is not intended to be a reflection of the performance of the HiSeq2000 platform per se, the resultant Illumina assembly provided a useful yardstick with which to judge the relative merits and short-comings of the PacBio RS sequencing platform.
Short-read sequencing platforms, including the Illumina HiSeq2000, derive sequencing reads from template DNA that has undergone pre-sequencing amplification by PCR . This amplification step results in sequencing bias, and thus poor or no sequencing coverage in certain regions of the genome, and a strong positive correlation between %GC content and read coverage . This lack of coverage is evident even when average depths of sequence coverage are high. Such bias leads to regions of no sequence coverage within sequencing datasets and thus assemblies that contain multiple small gaps, leading to a large number of contigs and scaffolds even in modest sized genomes such as those of bacteria [8, 17] and chloroplast genomes . In this investigation, the P. micrantha chloroplast genome was sequenced on the HiSeq2000 platform to an average depth of 9,111× from a single Illumina Truseq library, but despite this depth of coverage, there remained a total of 14,588 (9.41%) nucleotides of the genome which were not assembled from the Illumina data and thus seven contigs were recovered from the genome assembly. The gap regions contained a much lower average GC content than the entire chloroplast genome, in line with other studies that have reported a lower GC content in low coverage and gapped regions in Illumina assemblies  and reinforcing evidence of a strong positive dependency between coverage and GC content observed in the Illumina data set. In contrast, despite a lower depth of sequence coverage (320×) achieved following error-correction, data from the PacBio RS platform were assembled into a single contig spanning the entire P. micrantha chloroplast genome. Coverage of PacBio reads across the entire chloroplast consensus sequence was relatively even, demonstrating that data from this platform does not suffer from % GC and other context-specific biases affecting data produced by short-read ‘second-generation’ sequencing platforms . Our data were also in accord with the recently reported findings of Tang et al  who recovered two contigs spanning the mitochondrial genome of tomato in an assembly using 122 × of PacBio data, in contrast to 835 scaffolds covering the same genome using 4098× of Illumina data, suggesting longer read length and less genome coverage bias can result in significantly longer contigs in de novo plastid genome assemblies.
It is possible that if multiple Illumina libraries, including mate-pair libraries and overlapping fragment libraries, were sequenced, then a single scaffold covering the chloroplast genome would have been recovered. However, due to the inherent biases in the PCR amplification performed prior to sequencing, it is likely that the scaffold would still have contained gaps associated with the regions of poor and no coverage as was found in this investigation and in other studies of chloroplast assembly using second generation sequencing platforms .
Indeed, assemblies performed following a titration of sequence depths for both PacBio and Illumina datasets demonstrated that the high depth of coverage of the Illumina dataset did not confound the assembly process, and no assembly at a lower depth of coverage performed better than the assembly utilising the entire Illumina dataset. PacBio assemblies at depths of coverage of 35× and above, recovered a single contig spanning the chloroplast genome, suggesting that de novo non-hybrid assemblies with PacBio data could be possible at relatively low depths of sequencing coverage.
Error-rates from single read data generated from the PacBio RS platform have been reported to be relatively high, in the region of 15.4 – 18.7% [5, 6]. However, since sequencing errors are introduced randomly into the reads generated and are thus largely non-context specific , they are likely to have minimal effect on the final assembled sequence if sufficient depth of coverage is achieved and error-correction is performed prior to assembly. Since data generated from the Illumina HiSeq2000 platform has been established as the ‘gold standard’ for second-generation sequencing technologies, we evaluated the error-rate in the assembly of the PacBio RS data by comparison to Illumina data and where both assemblies resolved the same result for a nucleotide, we took this as an indication that the base had been called correctly in both assemblies. In this investigation, error rates of 1.3% were observed in the PacBio RS data following processing and error correction using HGAP  when compared to the chloroplast consensus sequence. Illumina sequencing data has been shown to contain non-random distribution of errors, with 3% of all error positions accounting for 24.7% of all substitution errors in one study  and no universal motif responsible for the occurrence of these error-prone positions. This type of error was observed at 187 nucleotide sites in the contigs derived from the Illumina assembly of the P. micrantha chloroplast genome in this investigation which despite high sequence coverage, returned ambiguous base calls following assembly. In all cases however, these ambiguous nucleotides were unambiguously called in the assembly derived from PacBio RS data as one of the alternative bases in the Illumina assembly. The PacBio and Illumina assemblies were concordant at all other bases within the assemblies, indicating that post-error correction and assembly PacBio data are potentially as robust as data derived from other sequencing platforms if sufficient depth of coverage is achieved to permit reliable error-correction. Indeed, recent reports suggest that with the latest chemistry and the most recent version of the HGAP algorithm, accuracy rates in PacBio RS datasets post-error-correction as high as 99.999% could be achieved . It is important to highlight here however that the analyses performed for creating the consensus sequence favour the PacBio assembly since it contains more nucleotides than the Illumina assembly. Thus where no Illumina data were available for comparison, the PacBio data may contain a low percentage of errors that could not be verified in this study.
In previous studies, PacBio RS data have been reported to contain maximum read lengths of up to 23,000 nucleotides and median lengths of 2,446 nucleotides . Such read lengths have been shown to significantly improve the quality of sequence assemblies when used for hybrid assemblies . In this investigation, the maximum and mean un-corrected read lengths were 17,407 and 3,937 nucleotides respectively, with an average read length following error-correction of 1,902. The data generated using the PacBio RS platform covered a greater proportion of the chloroplast genome and was able to resolve the small percentage of ambiguities that were present in the Illumina assembly. Thus the data from the chloroplast assembly reported here supports previous findings that PacBio RS data can produce high quality sequence assemblies covering a greater proportion of the genome than can be achieved by Illumina sequencing alone .
PacBio RS data is significantly less expensive to generate than data from traditional Sanger sequencing, and reports indicate that for targeted exon sequencing, for use in genomic profiling of tumor biopsies, PacBio RS sequence data was in 100% concordance with traditional Sanger sequencing . Additionally, other researchers demonstrated the utility of PacBio RS data for SNP validation in medical re-sequencing projects, where Sanger sequencing has traditionally been employed .
However, the additional read length of PacBio RS data comes at the cost of a higher cost per base than ‘second generation’ short read technologies , and higher single molecule error-rates necessitates the need for a greater depth of sequence coverage to be achieved to permit consensus accuracies of an acceptable level for de novo sequence assembly with currently available software. Additionally, since the PacBio sequencing platform performs real-time sequencing from single molecules, a greater quantity of DNA is required than second generation sequencing platforms, which could be a limiting factor for sequencing from organisms from which DNA is hard to obtain or which are difficult to culture. Despite the advantages to the use of PacBio RS sequencing data, and recent significant increases in throughput, the cost per base for de novo sequencing and assembly of larger genomes, such as those of plants are still significantly more expensive than data derived from the Illumina HiSeq platform . Thus de novo assemblies of the genomes of minority species at the time of writing may be best served through the combination of PacBio data with data from other platforms. Koren et al.  demonstrated that the addition of a modest amount of Illumina error-corrected PacBio data to supplement 454 sequencing data from multiple libraries resulted in a 32% increase in N50 sizes in the parrot (Melopsittacus undulatus) genome sequence assembly and other researchers have demonstrated the utility of PacBio sequence data for gap filling and genome finishing . The data presented here support the findings of those previous studies and illustrate the power and utility of PacBio RS sequencing data for sequencing and de novo assembly, as well as demonstrating that despite high initial single read error rates, following error-correction and assembly, the data produced by the platform are robust and reliable.