Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data

Background The development of long-read sequencing technologies, such as single-molecule real-time (SMRT) sequencing by PacBio, has produced a revolution in the sequencing of small genomes. Sequencing organelle genomes using PacBio long-read data is a cost effective, straightforward approach. Nevertheless, the availability of simple-to-use software to perform the assembly from raw reads is limited at present. Results We present Organelle-PBA, a Perl program designed specifically for the assembly of chloroplast and mitochondrial genomes. For chloroplast genomes, the program selects the chloroplast reads from a whole genome sequencing pool, maps the reads to a reference sequence from a closely related species, and then performs read correction and de novo assembly using Sprai. Organelle-PBA completes the assembly process with the additional step of scaffolding by SSPACE-LongRead. The program then detects the chloroplast inverted repeats and reassembles and re-orients the assembly based on the organelle origin of the reference. We have evaluated the performance of the software using PacBio reads from different species, read coverage, and reference genomes. Finally, we present the assembly of two novel chloroplast genomes from the species Picea glauca (Pinaceae) and Sinningia speciosa (Gesneriaceae). Conclusion Organelle-PBA is an easy-to-use Perl-based software pipeline that was written specifically to assemble mitochondrial and chloroplast genomes from whole genome PacBio reads. The program is available at https://github.com/aubombarely/Organelle_PBA. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3412-9) contains supplementary material, which is available to authorized users.


Background
Single Molecule Real Time (SMRT) sequencing technology developed by Pacific Biosciences (PacBio), can produce millions of long reads (1 Kb or longer, with a current average of 12Kb) per run. SMRT sequencing is based on single molecule real-time imaging of the incorporation of fluorescently tagged nucleotides to a DNA template molecule [1]. This technology has been successfully applied to a wide range of experiments and species such as the sequencing of DNA amplicons [2] and transcriptomes [3]. Nevertheless the most popular application is whole genome sequencing. It has been used for the sequencing of bacterial genomes such as the plant pathogen Xanthomonas oryzae [4]. PacBio reads have also been used for the sequencing of complex plant nuclear genomes, such as that of the Adzuki bean, Vigna angularis [5], demonstrating the advantage of this technology for resolving repetitive regions during sequence assembly.
Sequence assembly is the process whereby one or more consensus sequences are reconstructed from hundreds to billions of individual DNA sequence reads. Although there are dozens of programs to produce consensus sequences, they can be classified into two groups based on the algorithm they use: Overlap-Layout-Consensus (OLC) and De Bruijn Graph (DBG). OLC algorithms are best suited for low coverage long read approaches. The most popular PacBio assemblers such as HGAP [6] and Falcon (https:// github.com/PacificBiosciences/falcon) utilize OLC algorithms. A popular OLC-based program, the Celera Assembler (CA; [7]), has been updated to assemble PacBio reads [8] (this program, called PBcR, is being replaced by Canu [http://canu.readthedocs.org/] and is no longer maintained). Another option is the use of Sprai [9], a pipeline that employs the CA assembly algorithm. This pipeline pre-selects the best 20X coverage reads using BLAST searches [10], then corrects and assembles them using CA. PacBio sequencing and OLC assemblers have been successfully applied to the sequencing of yeast mitochondrial genomes [11] and chloroplast genomes such as those of Potentilla micrantha [12], Nelumbo nucifera [13], and sugar beet (Beta vulgaris) [14].
Mitochondrial and chloroplast DNA markers are the bridge between population genetics and systematics, primarily because they are maternally inherited and do not recombine; thus they can facilitate the reconstruction of maternal lineages [15]. Mitochondrial genomes vary in size depending on the eukaryotic lineage. For animals, the lengths range from 28,757 bp (Breviceps adspersus) [16] [19]. Chloroplast genomes, on the other hand, are typically much more conserved in their size and structure, ranging from 11,348 bp (Pilostyles aethopica) [20] to 521,168 bp (Floydiella terrestris) [21], with an average size of ca. 148,000 bp. At present, >10,000 mitochondrial genomes have been sequenced, while comparatively fewer (~990) chloroplast genomes have been sequenced (http://www.ncbi.nlm.nih.gov/genome).
The application of PacBio long-read DNA sequencing technology to organelle genome sequencing will duplicate the numbers given above in the next 2 years. As previously described, there are several tools designed to assemble PacBio reads (e.g. HGAP, Falcon, Canu, and Sprai); however, no single tool is available to assemble organelle genomes using total DNA sequencing reads derived from the PacBio platform. We have developed a new Perl-based tool, Organelle_PBA, designed expressly to reconstruct whole organelle genomes from PacBio data. First, the program selects the specific organelle reads by mapping raw reads to a reference organelle genome. Then, it produces a de-novo assembly using Sprai, a new re-scaffolding program, and removes the redundancy produced by the circular organization of these genomes. Organelle_PBA also resolves the inverted repeats found in chloroplast genomes. The tool is available at https://github.com/aubombarely/Organelle_PBA.

Material and methods
PacBio reads from Arabidopsis thaliana (SRR1284093, SRR1284094, SRR1284095, SRR1284703, SRR1284704), Mus musculus (ERR731675) and Picea glauca (SRR2148116) were downloaded from the SRA repository using the Prefech program from the SRA Toolkit. The SRA file format was then converted to Fastq format using the Fastq-dump program in the SRA Toolkit.
Sinningia speciosa PacBio reads were obtained by de novo PacBio DNA sequencing. Briefly, the S. speciosa variety ' Avenida Niemeyer' [22] was grown under fluorescent lighting at room temperature (~23°C). DNA was extracted from young flower buds using the Qiagen DNEasy® extraction kit. DNA was quantified using a Nanodrop® ND-1000 spectrophotometer, and its integrity was evaluated by agarose gel electrophoresis. DNA was sent to the Duke Center for Genomics and Computational Biology facility, where a SMRTBell™ long insert PacBio library (15-20Kb fragments) was prepared and then sequenced using a PacBio RSII system (P6-C4 Chemistry). PacBio reads were used directly in Organelle_PBA without any extra processing.
Organelle reference genomes were downloaded from the NCBI nucleotide database. For organelle genome coverage evaluation, the PacBio reads were mapped using BlasR [23] with the sam output format parameter. The result was piped into SAMtools for filtering of the unmapped reads [24]. Coverage was calculated using BEDtools [25], and variants, SNPs, and InDels were called using FreeBayes [26].
The program is divided into the following steps (Additional file 1: Figure S1): 0. Argument check and analysis of the input stats such as organelle reference size and number of reads.
1. Mapping of the PacBio reads to the organelle reference genome using BlasR.

Parsing of the BlasR results and selection of the
PacBio reads that map to the organelle reference. The percentage of the read length aligned to the reference can be used to filter these hits. 3. Read correction and assembly using Sprai using the reads selected in the previous step. Reads can be filtered by length before the assembly using the Sprai arguments. 4. Assembly evaluation comparing the total assembly size and the longest contig size with the reference sequence size. If the longest contig is longer than the reference genome, the script moves to step 6 (circular assembly check), otherwise it continues to step 5. 5. If the longest contig size is smaller than the reference, Organelle_PBA interprets the assembly as being fragmented. It then runs BlasR and SSPACE-LongRead with the entire read dataset to find any reads that it could not select during the BlasR mapping (Step 1). After this, it evaluates the assembly again reporting the new sizes and then moving to step 6. 6. Circular assembly check by homology search (Blast) of the assembled sequence with itself. The program also checks for a possible origin based on the reference through a homology search (Blast). If it finds any of these, it will break the contig/scaffold, reorganizing the pieces to remove the redundancy from a circular assembly and orient the assembly based on the reference genome sequence. 7. Check the completeness of the assembly.
Chloroplast genomes are composed of four parts: Long Single Copy (LSC) section, Short Single Copy (SSC), and two Inverted Repeats: IRa and IRb. The inverted repeats, IRa and IRb, are identical and sometimes are only partially assembled, so the assembly could appear to be complete with a size smaller than the reference. If this is the case, the program will move to step 8, if it is not, it will move to step 9. 8. Inverted repeat evaluation and resolution.
Organelle_PBA will map all the reads back to the assembly using BlasR to calculate the coverage for each part of the assembly. Inverted repeats appear with twice the coverage of the non-repeated region as a result of the multiple mapping sites reported by BlasR (see Results). Additionally, the program will break these regions looking for sequence homology using Blast to analyze if they present a certain level of homology reported by BlastN. If the script finds it, it will remove the redundancy and rebuild the assembly using SSPACE-LongRead. 9. Final assembly analysis and assembly statistics report printing.

Results
Mus musculus mitochondrial genome assembly The average running time for this process was 117 s. The PacBio read remapping showed that the assembly was fully covered (from~5X to~25X) with no gaps ( Fig. 1a and b). Additionally, a comparison of the assembled mitochondrial genome with the M. musculus mitochondrial reference (NC_005089.1) showed a perfect alignment with 395 SNPs distributed across the entire assembly (Fig. 1c).
To compare the performance of this approach compared with a whole dataset assembly and posterior identification of the organelle genome, we performed a whole dataset assembly using Canu and Sprai using similar configuration parameters. Canu assembled the 163,477 reads in 3970 s producing 7 contigs with a L50 = 17,784 bp. We identified a 15,206 bp mitochondrial genome contig by BlastN homology search. Sprai assembled the same dataset in 3660 s producing 5 contigs with a L50 = 5858 bp. None of the Sprai contigs were identified as the mitochrondrial genome.
Based on these results, we can conclude that >50X sequencing depth and a reference genome sequence from the same taxonomic family is recommended for the assembly of a full mammalian mitochondrial genome.
Additionally we performed the whole dataset assembly and posterior chloroplast identification by BlastN sequence homology to compare with our approach where the read identification is performed before the assembly. Canu produced 31 contigs in 6040 s with a L50 of 32,701 bp. We identified three chloroplastic contigs with lengths of 117,666, 53,847, 30,575 bp respectively. Sprai produced 2 contigs in 8591 s. The longest sequence represented the complete chloroplast genome with a length of 163,611 bp including a redundant region of 9133 bp caused by the circularity of the chloroplast genome.
Based on these results, we conclude that it is necessary to have at least 200X sequencing depth to obtain a fully assembled chloroplast genome. Considering the taxonomic distance of the reference sequence, we can suggest that almost any chloroplast genome from a species in the same subclass should be usable in selecting the chloroplast reads. We did not test the use of chloroplast reference genomes from other subclasses (e.g. an asterid reference to assemble the rosid A. thaliana chloroplast genome) because there are enough reference genomes from the same subclass publically available. The use of a whole dataset assembly with Sprai and posterior identification delivered the complete chloroplast genome sequence, although this approach used 64% more time without counting a final result curation to remove the redundant region.

Arabidopsis thaliana mitochondrial genome assembly
Sets of 5000, 10,000, 50,000, 100,000, 163,448, 490,143, and 817,099 randomly selected PacBio reads from Arabidopsis thaliana (SRA datasets: SRR1284093, SRR1284094, SRR1284095, SRR1284703 and SRR1284704) were used to test the mitochondrial genome assembly using the A. thaliana mitochondrial reference sequence (NC_001284.2). The results of the assemblies are summarized in Table 3. No complete mitochondrial genome assembly was obtained using this methodology. The mapping of the full A. thaliana dataset SRR1284093 (163,448 reads) to the A. thaliana mitochondrial genome NC_001284.2 selected 3046 reads (equivalent to 111X) but a close inspection showed an average coverage of 9X with 14,533 non-covered positions;

Discussion
Organelle_PBA is a script designed to assemble organelle genomes from long-read whole genome sequencing data by selecting the organelle reads and then mapping them to a closely related reference sequence. Animal mitochondria and plant chloroplast genomes are generally highly conserved across different lineages, so the use of a reference sequence from the same family is usually enough to select reads and then perform a focused assembly, thereby reducing the use of computational resources.
Our results indicate that coverage between 50 and 200X is usually enough to obtain a full organelle genome assembly. Our analysis shows that an average whole genome sequencing project contains~0.2% of animal mitochondrial DNA and~20% of plant chloroplast DNA, which, in most cases, is enough to reach >50X organelle genome coverage. Mapping of these reads shows that they are equally distributed across the organelle genome such that the mapping strategy employed effectively captures the full representation of the organelle genome. Nevertheless, the assembly of plant mitochondrial genomes can be difficult because they are more variable in size and genomic composition, and they are usually poorly represented in whole genome sequencing datasets. The assembly of the A. thaliana mitochondrial genome was incomplete, likely due to incomplete mapping to the reference genome. Additionally, Sprai selects only the best 20X coverage of the PacBio reads to perform the assembly, so unequal mapping could introduce bias into the Sprai read selection. We also found, through coverage analysis, two peaks of high coverage (>1000X) that probably represent highly repetitive regions, although a more detailed analysis needs to be performed to verify this hypothesis. Finally, even if assembly of an organelle genome is not the final goal of a whole genome sequencing project, there are some advantages to assembling the organelle genome prior to launching the nuclear genome assembly: 1) It can facilitate assembly of the nuclear genome by reducing the amount of data used in the whole genome assembly that in some cases can reach 25% or more (e.g. chloroplast DNA in plants); 2) It can be used as a method to evaluate the relative quality of the PacBio sequencing data by assembling a small batch of reads.

Conclusions
Organelle_PBA is a program designed to assemble organelle genomes from PacBio whole genome sequencing data. Pre-selection of the organelle DNA sequencing reads using a mapping approach facilitates the organelle genome assembly and optimizes the computational requirements. It also removes the assembly redundancy caused by a circular assembly and resolves the chloroplast genome inverted repeats. Organelle_PBA performed The mitochondrial genome assembly was considered complete when the size difference compared to the reference was <100 nucleotides