An improved genome reference for the African cichlid, Metriaclima zebra

Background Problems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly. We set out to improve the draft genome assembly of the African cichlid fish, Metriaclima zebra, using a set of Pacific Biosciences SMRT sequencing reads corresponding to 16.5× coverage of the genome. Here we characterize the improvements that these long reads allowed us to make to the state-of-the-art draft genome previously assembled from short read data. Results Our new assembly closed 68 % of the existing gaps and added 90.6Mbp of new non-gap sequence to the existing draft assembly of M. zebra. Comparison of the new assembly to the sequence of several bacterial artificial chromosome clones confirmed the accuracy of the new assembly. The closure of sequence gaps revealed thousands of new exons, allowing significant improvement in gene models. We corrected one known misassembly, and identified and fixed other likely misassemblies. 63.5 Mbp (70 %) of the new sequence was classified as repetitive and the new sequence allowed for the assembly of many more transposable elements. Conclusions Our improvements to the M. zebra draft genome suggest that a reasonable investment in long reads could greatly improve many comparable vertebrate draft genome assemblies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1930-5) contains supplementary material, which is available to authorized users.


Background
Advances in high-throughput genome sequencing have allowed relatively inexpensive genome projects to be conducted for almost any organism. Projects such as the 'Genome 10K Project' , which aims to sequence 10,000 vertebrate genomes [1], and the 'Bird 10K' project, which aims to sequence 10,500 bird species [2] have accelerated the production of draft genome sequences. Although attempts have been made to establish standards for declaring a genome sequence 'complete' [3], the quality of draft genomes varies dramatically. The limitations of using these draft genomes for downstream analyses have been documented [4,5]. Still, it is clear that such draft genomes will continue to be the basis for genetic research on many species for the foreseeable future.
Short read sequencing technologies are appealing, as the cost per base is relatively cheap [6]. However, short reads (up to several hundred bp) make the de novo assembly process more difficult when the genome contains repeats that exceed the read length, which is typical for even relatively small genomes [7]. In addition, sequencing coverage biases caused by variation in base composition and PCR amplification further complicate the task of the assembler [8,9]. Many different molecular biology and computational techniques have been developed that attempt to circumvent the problems associated with short read length, while keeping the cost of genome sequencing projects low. One technique is the use of paired-end and mate-pair jumping libraries. The power of this technique was demonstrated when a usable human draft genome assembly was produced using a combination of differently sized short read jumping libraries (180 bp to 40 kb) with the ALLPATHS-LG assembler [10].
The Assemblathon2 contest was organized as a friendly competition to assess current methods and evaluate the state of genome assembly by providing datasets of primarily short reads for three different vertebrate genomes. Assemblathon2 demonstrated that there was a lot of variability among submitted assemblies, and still plenty of room for improvement [11]. One of the three species used in the Assemblathon2 was the Lake Malawi cichlid fish, Metriaclima zebra. African cichlid fish are an ideal system for studying evolutionary mechanisms due to their phenotypic diversity and rapid speciation [12]. Draft genomes of M. zebra and four other African cichlid fish were recently published [13]. According to most assembly metrics, this M. zebra draft assembly ('M_zebra_v0') was among the best entries submitted to Assemblathon2. However, our extensive use of this assembly has revealed problems with gene models in or near assembly gaps, misassemblies encountered during the course of chromosome walks, and spurious spikes of differentiation statistics near gap and scaffold edges. These problems are not unique to this genome project, and complicate the use of many other draft genomes.
To improve the M. zebra draft assembly, we generated a 16.5× set of Pacific Biosciences SMRT (Single Molecule, Real-Time) sequencing reads. These 'long' PacBio reads can be used to improve draft assemblies by spanning gaps around repetitive regions and joining contigs and scaffolds [14]. Here we set out to improve the M_zebra_v0 genome assembly both to create a better reference assembly for the cichlid research community and to explore the improvements made possible with the addition of 16.5× of PacBio reads to even a relatively good draft vertebrate genome assembly.

Overview
Our new 'M_zebra_UMD1' assembly is based on the recently published M_zebra_v0 assembly [13], made available by the Broad Institute [15]. We identified misassemblies in the M_zebra_v0 assembly as regions poorly supported by the existing Illumina mate-pair libraries. The assembly was 'broken' at these locations. A newly generated 16.5× coverage PacBio read set was error-corrected to improve base accuracy and identify potentially chimeric reads. These corrected PacBio reads were then used to fill in gaps and to join together scaffolds in the broken M_zebra_v0 assembly. The new M_zebra_UMD1 assembly was then evaluated by comparison to the sequence of individual bacterial artificial chromosome (BAC) clones, alignment of independently assembled transcriptomes, and assembly completeness and likelihood statistics. Figure 1 provides an overview of this assembly process with several assembly statistics shown at each step. Additional details of the steps in this process are provided below.

Illumina datasets
The M_zebra_v0 assembly was originally created using seven different Illumina insert size libraries [13] as input Fig. 1 Genome assembly overview. Input datasets and the various steps involved in the assembly of M_zebra_UMD1 are diagrammed along with relevant metrics provided at each step to the ALLPATHS-LG assembler [10]. Table 1 provides details of each of the different Illumina libraries used.

REAPR consensus breaking
Recognizing Errors in Assemblies using Paired Reads (REAPR) is a tool that uses paired-read libraries to evaluate genome assembly accuracy, flag regions with potential errors, and break incorrectly joined scaffolds [16]. We ran REAPR version 1.0.17 on the M_zebra_v0 assembly using each of the libraries in Table 1 separately. First, the REAPR 'smaltmap' task was run to align each of the libraries to the M_zebra_v0 assembly using SMALT version 0.7.6. The alignments for the two separate 2-3 kb libraries listed in Table 1 were merged using the 'samtools merge' command. The REAPR 'perfectfrombam' task was run on the SMALT alignment of the short-insert fragment library to generate read-depth information and identify repetitive regions. The REAPR 'pipeline' task was then run separately for each of the jump libraries. The high-quality short-insert alignment from the 'perfectfrombam' task was supplied to the 'pipeline' task for each of the jumping libraries. Aggressive breaking ('-break a=1') was also performed as it breaks scaffolds at regions where the fragment coverage distribution is low and potentially misassembled. The output of the REAPR 'pipeline' task includes the locations where REAPR broke the M_zebra_v0 assembly. Locations in the M_zebra_v0 assembly that were broken by a majority (four or more) of the insert libraries were compiled and the M_zebra_v0 assembly broken based on this consensus. A Venn diagram of the overlap of REAPR breaks between the libraries (Fig. 2) was created using jvenn [17].
In addition to breaking the M_zebra_v0 assembly using REAPR, we also randomly broke the assembly to evaluate how well random breaks could be put back together with the PacBio reads. The M_zebra_v0 assembly was randomly broken the same number of times as the REAPR-broken assembly described above.

Pacific Biosciences SMRT sequencing
The Qiagen MagAttract HMW DNA kit was used to extract high-molecular weight DNA from a nucleated blood cell sample from a new individual from the same population used for the Broad Institute sequencing project. Size selection was performed at the University of Maryland Genomics Resource Center using a Blue Pippin pulse-field gel electrophoresis instrument. A library was constructed and 24 SMRT cells were sequenced on their PacBio RS II using the P5-C3 chemistry.

Proovread error correction
Proovread is a hybrid error correction pipeline for correcting PacBio SMRT reads using short read data [18]. This step is important as the raw PacBio subreads are only~85 % accurate [19] and contain chimeric reads at a rate of 1-2 % [20].
As shown in Fig. 1, we used the existing~60× Illumina fragment library for Proovread error correction. This Illumina library was designed so that pairs would overlap and slightly longer reads could be generated. We first trimmed and filtered these reads using Trimmomatic ver-

Gap closure and scaffolding with PBJelly
PBJelly is a pipeline for improving genome assemblies using PacBio reads [14]. PBJelly version 14.9.9 was run using the error corrected PacBio reads as described above. This set of error corrected reads also included a portion of the raw PacBio reads where there was no Illumina coverage and no error correction could be performed. This maximized the use of the PacBio reads, by using both the error corrected PacBio reads as much as possible, while still using the remaining portions that could not be corrected. The initial PBJelly 'setup' step was run with the '-minGap' parameter set to 19 to reflect the smallest gap size in the M_zebra_v0 assembly. The PBJelly 'mapping' step aligned the corrected PacBio reads to the consensus REAPR broken M_zebra_v0 assembly using BLASR [22]  following parameters: −minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 -maxScore −500 -noS-plitSubreads. The PBJelly 'assembly' step was run with the '-maxWiggle' parameter set to 2000 to account for predicted gap size error in the M_zebra_v0 assembly. The other PBJelly steps ('support' , 'extraction' , 'output') were run with default parameters.

Quality assessment and validation
GMAP [23] version 2014-12-06 was used to align existing RNA-seq transcriptome assemblies of eleven M. zebra tissues. The transcriptome assemblies were created using Trinity [24] as part of the cichlid genome project [13] and made available as supplementary information [25]. Three BAC clones that were previously sequenced and assembled using Sanger sequencing technology were aligned to the existing and newly produced assemblies for validation. These published BACs correspond to several opsin gene loci: SWS2A/SWS2B/LWS (GenBank accession JF262084.1, 107.6kbp), SWS1 (GenBank accession JF262085.1, 77.6kbp), and RH2B/RH2A (GenBank accession JF262089.1, 83.5kbp) [26]. The BAC sequences were aligned to the corresponding M_zebra_v0 and M_zeb-ra_UMD1 assembly sequences using Gepard [27]  The likelihoods of the intermediate and final M_zeb-ra_UMD1 assemblies were evaluated using ALE [29]. Each of the Illumina libraries were aligned to the assemblies using Bowtie2 [30] version 2.0.2 with the '-verysensitive' preset parameter. The uncorrected PacBio reads were aligned to assemblies with BLASR version 1.3.1.127046 using the same parameters used above for PBJelly and the '-sam' option to produce a SAM file for input to ALE. ALE was then run on each of the respective alignment files to produce likelihood and mapping statistics for each library.
Summary statistics of the assemblies were compiled using the assemblathon_stats.pl script [31].

RepeatMasker comparisons
RepeatModeler [32] version open-1.0.8 was used to identify and classify de novo repeat families in each of the respective assemblies. To obtain a reasonable comparison, RepeatModeler was run using both the M_zebra_v0 and M_zebra_UMD1 assemblies separately. The consensus repeat sequences generated by RepeatModeler for each assembly were combined with the Repbase RepeatMasker library version 20140131. RepeatMasker [33] version open-4.0.5 was run with NCBI/RMBLAST version 2.2.27+ using the '-lib' option to specify the respective RepeatModeler and Repbase combined library so that repeats predicted for M_zebra_v0 were modeled using the M_zebra_v0 assembly and repeats predicted for M_zeb-ra_UMD1 were modeled using the M_zebra_UMD1 assembly.

REAPR consensus breaking identifies misassemblies in M_zebra_v0
A genetic linkage map of M. zebra consisting of 834 RAD-tag markers was previously constructed [34]. Comparison of this map to the original M_zebra_v0 assembly identified a misassembly on the largest scaffold (scaf-fold_0). Table 2 shows the alignment of scaffold_0 to markers on two separate constructed linkage groups (LG7 and LG14) within the genetic map. Based on the map data we narrowed the location of the misassembly to a 1.7Mbp region between 3,426,502 (LG14) and 5,124,400 (LG7) on scaffold_0.
Within this 1.7Mbp region there was a 19 bp gap at scaffold_0:3,622,144 where REAPR also predicted a misassembly for 5 out of the 6 Illumina insert libraries listed in Table 1. The 40 kb library was the only library where REAPR did not predict a misassembly. The 40 kb library was also the only jumping library that had mate-pairs that properly spanned this gap. REAPR predicted a misassembly at this gap for the other 5 jumping libraries either because they did not have spanning mate-pairs, had mate-pairs improperly oriented, and/or had mate-pairs aligning at a distance much different than the expected insert size. This small 19 bp gap also had no PacBio reads that spanned it. It is likely that this is the exact location of the misassembly identified by the genetic map data. In addition to this known misassembly, REAPR identified many additional putative misassemblies in the M_zebra_v0 assembly. Figure 2 shows the number of breaks that REAPR predicted using the Illumina insert libraries listed in Table 1. Inspection of paired-read mappings from the 11 kb library revealed that it was much less complex than any of the other libraries. Using this 11 kb library, REAPR broke the M_zebra_v0 assembly 35,135 times. This was far more REAPR breaks than any other library and more than twice that of the 5 kb library (14,629 breaks). We elected to remove this 11 kb library from subsequent analyses.
The number of REAPR breaks shared by 5, 4 or more, 3 or more, 2 or more and 1 or more libraries was 40, 649, 3073, 9835 and 32107 respectively (Fig. 2). To begin our reassembly process we had to choose the appropriate number of REAPR breaks of the M_zebra_v0 assembly. Breaking the assembly too few times could leave unidentified misassemblies, while breaking too many times would fragment the assembly more than necessary. PacBio provides the SMRT View tool [35] for visualizing PacBio read alignments created using their BridgeMapper SMRT Pipe module within the SMRT-Analysis software suite [36]. The BridgeMapper module creates split read alignments with BLASR that can be used to identify misassemblies. Using these tools we were able to manually inspect the PacBio split read alignments and estimate that there are~200-1000 misassemblies in the M_zebra_v0 assembly.
We also evaluated the rate of false positive breaks by comparing the number of REAPR breaks that could be re-joined with PBJelly and the corrected PacBio reads to the number of random breaks that could be re-joined with the same protocol. For the M_zebra_v0 assembly that was broken randomly, 541/649 (83.4 %) of the breaks were reassembled in the original M_zebra_v0 assembly order. In contrast, only 75 (11.6 %) of the 649 REAPR breaks were reassembled in the original M_zeb-ra_v0 order. The random breaks are reassembled in the original order about 82 % of the time across all 5 libraries ( Table 3). The percentage of REAPR breaks that are reassembled by PBJelly increases as the number of REAPR breaks increases, but is still far from the percentage of random breaks that were rejoined by PBJelly. It is clear that the consensus REAPR breaks have identified regions of the M_zebra_v0 assembly that were poorly supported and often misassembled. These regions are difficult to reassemble even with the corrected PacBio reads and likely represent complex and highly repetitive regions of the genome.
Based on the manual inspection of split read alignments and the rate of false positive breaks that were introduced we chose to break the M_zebra_v0 assembly wherever REAPR had predicted a misassembly in 4 or more of the Illumina insert libraries. This resulted in an assembly that was broken 649 times (40 breaks found in 5 or more libraries plus 609 breaks found in 4 or more libraries, Fig. 2).

Proovread error correction
We generated a 16.5× set of PacBio reads using the P5-C3 chemistry. However, PacBio reads are error prone  (80-85 % accuracy [9]) and known to contain chimeric reads at a rate higher than 1 % [20]. In addition, the SMRTbell adapter sequences are not always removed properly and may persist in up to 3 % of filtered PacBio reads depending on the sequencing protocol and library quality (Thomas Hackl, personal communication). These particular sequences are deemed "siameric" reads because they contain twin reads connected by the adapter.
To detect and clip both chimeric and siameric reads, as well as improve the base-level accuracy of the PacBio reads, we ran Proovread [18]. The~60× short-insert Illumina library was first overlapped to produce longer reads (mean overlapped read length = 154 bp,~30× coverage) which were then used for the Proovread errorcorrection (Fig. 1). Additional file 2 provides summary statistics of the PacBio reads before and after the Proovread error-correction. While the mean and N50 read length decreased, Proovread detected raw PacBio reads that were potentially chimeric and siameric at the expected rates and split them at these junctions. This resulted in the number of raw reads increasing from 3,031,205 to 3,891,278 Proovread error-corrected reads. Any portion of raw PacBio reads that had no Illumina coverage were not split and were left in their original state. There was a tradeoff between having longer PacBio reads with a small percentage of chimeric reads or somewhat shorter but error-corrected PacBio reads. We chose to remove the chimeric reads and use the set of slightly shorter and error-corrected PacBio reads, especially considering the modest 16.5× coverage and the potential for chimeric/siameric introductions into the assembly in regions of low PacBio coverage.

Gap filled assembly
Once the known and putative misassemblies were broken, and the errors in the PacBio reads were corrected, the M_zebra_v0 assembly was ready to be improved using PBJelly. Table 4 provides summary statistics of three assemblies: 1) the original M_zeb-ra_v0 draft assembly, 2) M_zebra_v0 after being broken 649 times by REAPR, 3) and the broken assembly after gap-filling with PBJelly using the corrected PacBio reads (M_zebra_UMD1).
Most of the 649 REAPR breaks occurred at gaps. REAPR typically broke the M_zebra_v0 assembly twice, once on each side of the gap, generating 326 more scaffolds. This process effectively removed the gaps between these REAPR breaks. However, many of these broken scaffolds were put back together with the corrected Pac-Bio reads in the new M_zebra_UMD1 assembly. The new assembly has 190 (5 %) fewer scaffolds relative to M_zebra_v0, and 516 (12.7 %) fewer scaffolds relative to the REAPR broken assembly. These may not seem like sizeable differences, but the M_zebra_v0 assembly was scaffolded using a~40 kb jumping library, with a mean insert size (38,038 bp) that is longer than the longest error-corrected PacBio read in our dataset (33,000 bp). Therefore, since the M_zebra_v0 assembly was already relatively well placed into scaffolds, we did not see a large reduction in the number of scaffolds. We expect that draft assemblies that do not include mate pair libraries at this scale will experience a greater improvement in scaffolding using the long PacBio reads.
The total length of the M_zebra_UMD1 assembly increased by 11.1Mbp (+1.3 %) compared to M_zebra_v0.   However, this leaves out the fact that 79.5 Mbp of gaps were filled, for a total of 90.6 Mbp of new sequence. The total length of the assembly contained in gaps decreased from 15.93 to 6.47 % of the assembly length, a 59 % improvement. The number of gaps decreased by 70 %, from 68,336 to 21,436. Further assembly metrics are provided in Additional file 3. We mapped existing transcriptome assemblies from 11 tissues of M. zebra [13] to each of the genome assemblies using GMAP. The total number of mapped exons increased by 99,085 (+2.20 %, Table 4).

Assembly completeness
To assess the completeness of the assemblies we ran CEGMA [28], which scores the presence of 248 core eukaryotic genes (CEGs) in a given assembly. Table 5 provides the CEGMA completeness report for both the original M_zebra_v0 and the new M_zebra_UMD1 assemblies. The total number of complete plus partial CEGs is the same in both assemblies (237). However, the new M_zebra_UMD1 assembly contains 7 (2.6 %) more complete CEGs than the original M_zebra_v0 assembly. This increase in complete CEGs can be attributed to filling gaps that occur within gene models. One example of this was seen in the assembly of the predicted piwi-like protein (NCBI accession XM_004544701.1). Fig. 3 shows this piwi-like RefSeq mRNA sequence aligned to the M_zebra_UMD1 assembly. When the transcriptome assemblies were mapped to the M_zebra_UMD1 assembly, it became evident that the gaps in the original M_zebra_v0 assembly had left out at least 10 of the exons in the gene.
The new M_zebra_UMD1 assembly contains an increased number of CEGs that have multiple orthologs according to CEGMA (62, increased from 58). Some of these may represent paralogs that were collapsed in the M_zebra_v0 assembly and have been separately assembled in the M_zebra_UMD1 assembly. Extrapolated across the genome, the difference in the number of genes with multiple paralogs amounts to hundreds of new genes.

Comparison with BACs from opsin loci
Three M. zebra BAC clones previously sequenced and assembled using Sanger sequencing technology [26] were used to evaluate the accuracy of the errorcorrection and gap-filling procedures. Figure 4 shows dotplot alignments of these sequenced BACs to both the M_zebra_v0 and M_zebra_UMD1 assemblies. Most of the gaps in the M_zebra_v0 assembly have been filled in the M_zebra_UMD1 assembly. Several small gaps remain in the M_zebra_UMD1 assembly, as can be seen in Fig. 4b and d. BAC clone JF262085.1 (encompassing the SWS1 opsin) was the only BAC of the three that had gaps in the original assembled BAC sequence. The incongruence in the lower left portion of the Fig. 4d dotplot represents a difference in the size of the gap between the JF262085.1 BAC and the M_zebra_UMD1 assemblies. The abnormal alignment in the upper right portion of the dotplot in Fig. 4d represents a small 20 bp gap in the M_zebra_v0 assembly that has been "overfilled" by PBJelly with 779 bases. Both of these differences likely represent some structural sequence variation between the individual fish used for the BAC, M_zeb-ra_v0 and M_zebra_UMD1 sequencing. These fish were collected from a natural population in Lake Malawi that has a small effective population size, so heterozygosity should be low, but some variation among individuals is expected.

Assembly likelihood
The assembly summary metrics provided in Table 4 indicate the new M_zebra_UMD1 assembly is better in all respects except maximum scaffold length (−21 %) and scaffold N50 (−15 %). However, these decreases in continuity are accompanied by an overall improvement in accuracy and completeness of the assembly. To further quantify the accuracy of the new assembly we ran the Assembly Likelihood Evaluation (ALE) program [29]. This tool integrates read quality, mate-pair orientation, insert size, coverage and k-mer frequencies to provide a   Table 6). Some of these regions with 0 coverage can be explained by the 55.6Mbp of gaps that remain in the M_zebra_UMD1 assembly, since ALE calculates gaps as bases with 0 coverage. The other 66Mbp of non-gap sequence with 0 coverage (121Mbp minus 55.6Mbp for the short-insert Illumina library in Table 6) is mostly covered by the PacBio library. The PacBio library had about 10Mbp of non-gap sequence with 0 coverage and this reflects regions where the library either did not have any reads by chance or where only the Illumina libraries were able to sequence through. Additional PacBio coverage will help to more precisely describe such regions.

Analysis of transposable elements and repetitive sequences
A large amount of the sequence that was added in the new M_zebra_UMD1 assembly is composed of repetitive sequences and transposable elements that were either collapsed or not assembled in the original M_zebra_v0 assembly. We analyzed the total amount of repetitive sequences in both assemblies to understand the repeat content of the sequence that was added in M_zeb-ra_UMD1. Table 7 lists several of the most abundant transposable element super families in the two assemblies. For most of the transposable element super families, the number of elements increased in the M_zebra_UMD1 assembly. Those transposable elements super families that decreased in number still increased in total bp, which means that the sequences of individual transposable element copies were longer in the M_zebra_UMD1 assembly. The assemblies of longer repeat copies can be seen for both the DNA hAT-Ac and LINE L1 transposable elements (Fig. 5). Additional file 5 provides a detailed list of hundreds of transposable elements and low complexity repeats that were annotated in both assemblies.
Compared to M_zebra_v0, the M_zebra_UMD1 assembly had fewer total lineage specific repeats identified (15,585 vs. 17,320), but a greater total amount of lineage ). Again, this shows that longer lineage specific repeats have been assembled in the M_zebra_UMD1 assembly. In terms of total repetitive sequence, the new M_zebra_UMD1 assembly contained 63.5Mbp of additional sequence that was classified as repetitive. This is consistent with the idea that most of the gaps in the original M_zebra_v0 assembly spanned sequences consisting of transposable elements and other repetitive sequences.