A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions
BMC Genomics volume 18, Article number: 341 (2017)
Tilapias are the second most farmed fishes in the world and a sustainable source of food. Like many other fish, tilapias are sexually dimorphic and sex is a commercially important trait in these fish. In this study, we developed a significantly improved assembly of the tilapia genome using the latest genome sequencing methods and show how it improves the characterization of two sex determination regions in two tilapia species.
A homozygous clonal XX female Nile tilapia (Oreochromis niloticus) was sequenced to 44X coverage using Pacific Biosciences (PacBio) SMRT sequencing. Dozens of candidate de novo assemblies were generated and an optimal assembly (contig NG50 of 3.3Mbp) was selected using principal component analysis of likelihood scores calculated from several paired-end sequencing libraries. Comparison of the new assembly to the previous O. niloticus genome assembly reveals that recently duplicated portions of the genome are now well represented. The overall number of genes in the new assembly increased by 27.3%, including a 67% increase in pseudogenes. The new tilapia genome assembly correctly represents two recent vasa gene duplication events that have been verified with BAC sequencing. At total of 146Mbp of additional transposable element sequence are now assembled, a large proportion of which are recent insertions. Large centromeric satellite repeats are assembled and annotated in cichlid fish for the first time. Finally, the new assembly identifies the long-range structure of both a ~9Mbp XY sex determination region on LG1 in O. niloticus, and a ~50Mbp WZ sex determination region on LG3 in the related species O. aureus.
This study highlights the use of long read sequencing to correctly assemble recent duplications and to characterize repeat-filled regions of the genome. The study serves as an example of the need for high quality genome assemblies and provides a framework for identifying sex determining genes in tilapia and related fish species.
Aquaculture plays an increasingly important role in providing sustainable seafood products and has significantly outpaced capture fisheries in the past several decades . Tilapias are among the most important farmed fishes, and tilapia production continues to expand exponentially across the globe . An important aspect of commercial production is the control of sexual differentiation. Males grow to market-size earlier than females. Females also start to reproduce at a smaller size, filling production ponds with small fish . It is therefore advantageous to grow-out only male fish. At one time, all-male populations were produced through interspecific crosses , but the strains supporting this technology have been lost or contaminated. Currently, the standard way of achieving all male or nearly all male tilapia populations is via hormonal masculinization [3, 5]. A reliable way of producing genetically all-male tilapia would allow the replacement of hormonal masculinization, which is banned in several major producing countries (although not enforced in most cases). It is therefore important to understand the genetic basis of sex determination in current aquaculture stocks.
Sex determination in tilapias is largely genetic, although environmental factors also play a role [6,7,8]. In Nile tilapia (Oreochromis niloticus), distinct XY sex determining loci have been identified on both linkage group (LG) 1 and LG23 [9, 10]. The closely related blue tilapia (O. aureus) segregates both an XY locus on LG1, and an epistatically dominant ZW locus on LG3 . Additional sex determining loci have been identified on LGs 5, 7, 13, 18 and 20 in closely related species of East African cichlid [12,13,14]. As a group, tilapias and related species of other cichlid fishes are a promising model system for understanding the gene network controlling sex determination in vertebrates.
Work to identify the genes underlying each of these sex determiners has been hampered by the incomplete nature of previous draft genome assemblies, and by the discovery that many of these sex determiners are located in large blocks of highly differentiated, and sometimes repetitive sequence. To date, the molecular genetic basis for sex determination in cichlids has been determined for only the LG23 XY locus in O. niloticus .
Although several draft genome sequences are available for cichlids, these are mostly based on short Illumina sequencing reads . The previous O. niloticus assembly was produced using ~277X coverage of Illumina reads from several libraries including a 40kbp scaffolding library. Recently duplicated and highly repetitive sequences are typically collapsed in these assemblies . Indeed some of the most interesting and perhaps evolutionarily important regions of the genome may be the most difficult to assemble accurately. Recently duplicated regions are notoriously difficult to assemble due to their repeat length and high sequence identity . The repetitive “dark-matter” part of the genome is vastly underrepresented in the majority of current genome assemblies . Attempts to assemble these regions using only short read sequencing are futile . Only long sequencing reads will produce more contiguous and complete assemblies of complex vertebrate genomes [21,22,23,24,25]. The importance of such high quality assemblies for downstream applications cannot be overemphasized.
Here we report a new assembly of the tilapia genome from long PacBio sequence reads. This assembly contains much of the missing sequence from previous assemblies, and is among the most contiguous vertebrate genome assemblies to date. We use this new assembly to further characterize the tilapia sex determining loci previously identified on LGs 1 and 3.
A homozygous clonal XX female tilapia individual  was chosen for genome sequencing. The individual was sequenced to 44X coverage using PacBio sequencing of 63 SMRT cells using the P6-C4 chemistry. This yielded 5,085,371 reads with a mean subread length of 8,747 bp and N50 read length of 11,366 bp.
An overview of the assembly process is outlined in Fig. 1. To summarize, 37 candidate de novo assemblies were generated using both the FALCON  and Canu  genome assembly packages. Multiple parameters were adjusted for both algorithms to tune the assemblies. The error correction steps of both algorithms include parameters that control alignment seed length, read length, overlap length and error rates (see Methods).
The 37 candidate assemblies were evaluated using a number of different metrics, techniques and complementary datasets. First, each of the candidate assemblies was evaluated using ALE assembly likelihood estimates  (which integrated read quality, mate-pair orientation, insert size, coverage and k-mer frequencies) based on alignment of the reads from four separate Illumina libraries and of the 44X PacBio dataset (see Methods). Candidate assemblies were also evaluated for completeness using CEGMA  and BUSCO  core gene sets, as well as by aligning existing O. niloticus RefSeq  transcripts. A set of 193,027 BAC-end sequences  representing ~29X physical clone coverage were used to assess the longer range accuracy of candidate assemblies. Finally, both a physical radiation-hybrid (RH) map consisting of 1,256 markers  and a RAD-seq genetic map consisting of 3,802 markers  were used to estimate the number of misassemblies present in each of the candidate assemblies. The results of these analyses are provided in Additional file 1.
No single candidate assembly ranked the highest for all of the evaluation metrics that were computed. Principal component analysis (PCA) was used to reduce the various assembly evaluation metrics and compare the candidate assemblies. Additional file 2 shows that the Canu assemblies tend to cluster separately from the FALCON assemblies in the PCA space. The total assembly size and number of RefSeq exons mapped explained the largest amount of variance and were correlated. These two metrics did not seem like the most important metrics to base the evaluation upon since assembly parameters could be tuned to change the total size and the estimated genome size was 1.082Gbp .
The ALE likelihood scores explain the next largest proportion of the variance. The 37 candidate assemblies were ranked by overall ALE scores for each of the five sequencing libraries. An average of the ALE ranks was then calculated. The Canu assembly (#14) that was chosen as the best among the 37 candidate assemblies showed the best average ALE ranks. In addition, Canu assembly #14 had one of the best rates of properly mapped BAC-end sequences, and possessed among the fewest misassemblies as determined by conflicts with the RH and RAD map data (Additional file 1). These results suggest that Canu assembly #14 has the best long-range accuracy while maintaining comparable short-range accuracy.
A relatively small number of sequence errors remained in the intermediate unpolished Canu #14 assembly. To correct these errors, first the raw 44X PacBio reads were aligned to the Canu assembly and Quiver was used to polish the assembly at 1,870,943 sites (see Methods). Quiver corrected 1,739,112 (92.95%) insertions, 88,037 (4.71%) substitutions and 43,794 (2.34%) deletions. Next, four Illumina libraries, totaling 277X coverage, were aligned to the intermediate Quiver-polished assembly. Based upon these alignments, Pilon polished an additional 1,101,609 sites. Pilon corrected 1,087,107 (98.68%) insertions, 12,402 (1.13%) substitutions, and 2,100 (0.19%) deletions.
Detection of misassemblies
The polished intermediate assembly showed high accuracy at the level of individual bases and with respect to the placement of paired-end sequences from ~150kbp BACs (Additional file 1). However, 32 putative inter-chromosomal misassemblies were identified by alignment to the RH and RAD maps. The RH and RAD maps both identified 21 of these inter-chromosomal misassemblies. The RAD map identified an additional 8 putative misassemblies that were not identified using the RH map (the RH map had no markers aligning to these regions), while the RH map identified an additional 3 misassemblies that were not identified using the RAD map (likewise, the RAD map had no markers aligning to these regions). The regions around each putative misassembly were inspected using the genomic resources already mentioned. Each had a characteristic signature consisting of a high density of variants in the 44X PacBio read alignments, as well as low or zero physical coverage of the 40kbp insert Illumina mate-pair library. An example of these misassembly signatures is shown in Additional file 3.
Genome wide analysis of the intermediate assembly for each of these characteristic signatures detected 110 regions of high-density PacBio variants and 376 regions of low physical coverage in the 40kbp mate-pair library. 41 regions had both a high-density of PacBio variants and low physical 40kbp mate-pair coverage. Nine of these regions showed correct alignment to both maps and therefore were not included in the set of putative misassemblies. However, two of these regions were identified by the PacBio variants and low 40kbp physical coverage in regions where there were no markers in either the RH or RAD map and added to the 32 map-based misassemblies giving a total of 34 sites of likely misassembly. Table 1 provides a summary of the putative misassemblies that were identified by the maps and sequence alignment methods.
Analysis of the repetitive elements within these regions revealed that misassembly locations were enriched for highly repetitive interspersed and nested repeats. We examined the region ~75kbp on both sides of the likely misassembly breakpoints and found that 94.51% of these regions were classified as repetitive (see Methods). These regions were enriched for several TE families. Table 2 shows the enrichment of the most common repeats and TEs within the misassembly regions. In each of these cases, the mean length of these repeats was longer within the misassembly regions. Some of the same TE families that are abundant across the whole genome (e.g., DNA-TcMar-Tc1, LINE-L2, LINE-Rex-Babar) are also present in high frequency in the misassembly regions. However, some TE families that occurred in relatively low frequency across the whole genome (e.g., DNA-Sola, LTR-ERV1, RC-Helitron, and satellite repeats) are highly enriched in the misassembly regions.
Table 3 provides the anchored size of each LG, including gaps. The new O_niloticus_UMD1 assembly anchored 868.6Mbp of the total genome (86.9%), which is 211Mbp (32%) more than was anchored in the previous “Orenil1.1” assembly (657Mbp) . When gaps are not counted, the amount of anchored, non-gap, sequence is 864Mbp (86.4%) compared to 606Mbp (60.6%) in the previous Orenil1.1 assembly. LG3 is the largest anchored LG (68.6Mbp), which agrees with cytogenetic studies that show LG3 as the largest and most repetitive chromosome in the O. niloticus genome [33, 36, 37]. Cytogenetic studies also indicate that LG7 is the second largest chromosome in the O. niloticus genome, and LG7 is the second largest anchored LG in the new “O_niloticus_UMD1” assembly.
To determine the completeness of the new O_niloticus_UMD1 assembly, the assembly was compared against two established sets of core vertebrate gene sets. Table 4 shows the number of the 248 CEGMA and the 3,023 BUSCO conserved vertebrate genes that were identified in the new assembly. The number of conserved genes identified increased for both the CEGMA and BUSCO gene sets. The number of complete single-copy BUSCOs increased by 223 (10%), while the number of complete duplicated BUSCOs increased by 26 (59%). The number of missing BUSCOs decreased by 288 (67%) in the O_niloticus_UMD1 assembly compared to the Orenil1.1 assembly.
The O_niloticus_UMD1 assembly was annotated using the NCBI RefSeq automated eukaryotic genome annotation pipeline. This same pipeline was previously used to annotate the Orenil1.1 assembly. Several additional, new transcriptome datasets (particularly gill tissues, see Methods) were available to annotate the O_niloticus_UMD1 assembly that were not available during the Orenil1.1 annotation process. A comparison of both genome assembly annotations is provided in Table 5. The O_niloticus_UMD1 assembly contains 8,238 more gene and pseudogene annotations than the Orenil1.1 assembly (27.3% increase). Similarly, the number of mRNA annotations increased markedly by 10,374 (21.7% increase). The number of partial mRNA annotations decreased from 3,050 to 393 (87.1% decrease). CDS annotations also increased overall (21.9%). The RefSeq annotation pipeline makes corrections to CDS annotations that contain premature stop-codons, frameshifts and internal gaps that would disrupt protein sequence coding. These corrections are based on transcriptome data and corrected 743 CDSs in O_niloticus_UMD1 compared to 817 previously for Orenil1.1 (9.1% decrease). The number of non-coding RNAs more than doubled in the O_niloticus_UMD1 assembly (115.5% increase).
O_niloticus_UMD1 assembly summary
Table 6 provides summary statistics for the previous O. niloticus assembly (Orenil1.1), each intermediate of the new assembly, and our new final assembly (O_niloticus_UMD1). The O_niloticus_UMD1 assembly is more contiguous, with 45% fewer contigs than the number of scaffolds in Orenil1.1. The overall size of the O_niloticus_UMD1 assembly is 1.01Gbp compared to 927Mbp of Orenil1.1. The O_niloticus_UMD1 contains only 424 gaps that were introduced in the anchoring step. These anchoring gaps amount to 4.2Mbp (0.42%) due entirely to the arbitrary 10kbp gaps placed between anchored contigs. This compares to 111.5Mbp (12.04%) of gaps in Orenil1.1. Overall, 189.5Mbp of new sequence has been assembled in O_niloticus_UMD1 that was either previously in gaps or not assembled at all in Orenil1.1.
The TE and repeat portion of the genome is vastly under underrepresented in most genome assemblies . The use of long PacBio reads allowed for the assembly of more of the repetitive regions of the O. niloticus genome. 379Mbp (37.6%) of the total O_niloticus_UMD1 assembly was annotated as repetitive. Table 7 provides a summary of the repeat and TE families that were most abundant in the assembly. The new assembly includes an additional 146Mbp (14.6%) of repetitive sequence that was either hidden in gaps or not present at all in the previous assembly. The entire repeat catalog is provided in Additional file 4.
Figure 2 provides a comparison of the repeat landscape of the Orenil1.1 and O_niloticus_UMD1 assemblies. Most notably, recently inserted (~ < 5% Kimura divergence) TEs have been assembled in far greater number in the new O_niloticus_UMD1 assembly. The overall number of repetitive elements increased at all divergence levels (218,992 more elements, Additional file 4), with most at lower divergences (165,607 additional elements at < 5% Kimura divergence). Figure 2 suggests that TE insertions less than 1% diverged are still underrepresented in the assembly.
Satellite regions represent one of the most highly repetitive regions of the genome and are often associated with centromeric and heterochromatic regions. Two tilapia-specific satellite repeats have been previously described. ONSATA is a 209 bp repeat unit and shows variability between related tilapiine species . Only 29 copies of ONSATA (comprising a total of 2,917 bp) were assembled and annotated in the original Orenil1.1 assembly. In the new O_niloticus_UMD1 assembly, 226 regions of ONSATA comprising a total of 1,386,985 bp were assembled and annotated. Many of the ONSATA regions, the longest of which was 43,805 bp on the unanchored contig908, were composed of multiple, nested ONSATA copies. ONSATB is a 1,904 bp repeat unit that is organized in tandem arrays and appears to be more conserved and perhaps under selective constraint . 48 copies of ONSATB (comprising a total of 11,036 bp) were assembled and annotated in the original Orenil1.1 assembly. In the new O_niloticus_UMD1 assembly, 1,481 copies of ONSATB (comprising a total of 2,889,496 bp) were assembled and annotated. Again, many of the ONSATB regions were composed of multiple ONSATB copies, the longest of which was 11,210 bp located near the beginning of LG12 (607,345-618,555).
TEs specific to African cichlid species have been previously sequenced and used as molecular markers to study evolutionary history and phylogenetics of African cichlids [40, 41]. Some of these African cichlid specific or “AFC” LINEs and SINEs had been previously assembled and annotated in the Orenil1.1 assembly. An additional 2.3Mbp of AFC-specific TE sequence was annotated in the new O_niloticus_UMD1 assembly. This 2.3Mbp increase was assembled across 55 fewer AFC TE copies, which resulted in longer mean length AFC TE copies. This suggests that the previous assembly contained many fragmented AFC specific TE copies.
Recently duplicated regions
Recently duplicated genes are notoriously difficult to assemble due to their high sequence identity . Using short Illumina reads to assemble these regions is a difficult task even with mate-pair sequence data across multiple spatial scales. In a previous study of the tilapia vasa gene, we identified three partial gene sequences in the Orenil1.1 assembly . We then screened a tilapia BAC library for vasa gene sequences and identified three BAC clones containing vasa sequences. The three clones came from separate restriction fingerprint contigs , and represent duplications of the ancestral vasa gene. Sanger sequencing identified a full-length vasa gene in each of these BAC clones. Figure 3a shows how the previous Orenil1.1 assembly failed to correctly assemble any of the three vasa gene copies. Figure 3b indicates how these genes were assembled from each of the BAC clones. Figure 3c details how the new O_niloticus_UMD1 assembly correctly assembles the three copies of the vasa gene corresponding to the three BAC clones.
Sex determination regions
The new O_niloticus_UMD1 assembly was used to study sequence differentiation across two sex-determining regions in tilapias. The first region is an XX/XY sex-determination region on LG1 found in many strains of tilapia [9, 34, 44,45,46,47]. We previously characterized this region by whole genome Illumina re-sequencing of pooled DNA from males and females . We realigned these sequences to the new O_niloticus_UMD1 assembly and searched for variants that were fixed in the XX female pool and polymorphic in the XY male pool. Figure 4 shows the FST and the sex-patterned variant allele frequencies for the XX/XY O. niloticus comparison across the complete Orenil1.1 and O_niloticus_UMD1 assemblies, while Fig. 5 focuses on the highly differentiated ~9Mbp region on LG1 with a substantial number of sex-patterned variants, indicative of a reduction in recombination in a sex determination region that has existed for some time .
The second sex comparison is for an ZZ/WZ sex-determination region on LG3 in a strain of O. aureus [11, 49]. This region has not previously been characterized using whole genome sequencing. For this comparison we identified variant alleles fixed in the ZZ male pool and polymorphic in the WZ female pool. Figure 6 shows the FST and the sex-patterned variant allele frequencies for this comparison across the whole O_niloticus_UMD1 assembly, while Fig. 7 focuses on the differentiated region on LG3. O. aureus LG3 contains a large ~50Mbp region of differentiated sex-patterned variants, also indicative of a reduction in recombination in the sex determination region. Figure 6 also shows this differentiation pattern on several other LGs (LG7, LG9, LG14, LG16, LG18, LG22 and LG23). It is possible that these smaller regions of sex-patterned differentiation are actually translocations in O. aureus relative to the O. niloticus genome assembly.
The overall number of sex-patterned variants was markedly increased for both sex comparisons using the new assembly. Table 8 indicates this and provides the number of sex-patterned variants in each comparison across the whole genome as well as on the respective sex-determination LG. LG3 saw the largest gain of sex-patterned variants (1,445 to 24,983 variants) due to the fact that the LG3 assembly now includes 49.3Mbp of new sequence (Table 3).
We explored the parameter space of both the FALCON and Canu genome assembly packages and produced 37 candidate assemblies (Additional file 1). Since the true sequence is not known, we had to deduce which of the candidate assemblies best represented the true sequence of the homozygous clone. We elected to assess the assemblies with a variety of metrics, and to select the assembly that scored well across all of the most important metrics.
The first metric is the overall size of the assembly, which should closely match the estimated size of the genome. The size of the O. niloticus genome has been measured by both Feulgen densitometry and bulk fluorometric assay. Five separate measurements range between 0.95–1.20 picograms or ~0.929–1.174Gbp . The average genome size of these five estimates is 1.082Gbp. The various assemblies ranged in size from 975.1Mbp to 1.07Gbp. The assembly that was chosen (#14) has a length of 1.01Gbp, which corresponds to 93.3% of the estimated size of the genome.
The second set of metrics we considered were the standard measures of assembly contiguity such as NG50, number of contigs, longest contig and mean contig size. The third set of metrics consisted of assembly likelihood (ALE) scores, which were calculated by aligning four Illumina libraries (fragment, 3kbp, 6–7kbp, and 40kbp – Table 9, Methods) as well as the 44X PacBio library against each candidate assembly. The fourth metric measured the accuracy of the assemblies at larger scales by aligning the contigs to a ~29X clone coverage library of ~150kbp BAC-end sequences  and to existing genetic and physical maps of O. niloticus [33, 34]. Alignment of the RH and RAD maps to the candidate assemblies indicated that every assembly had a relatively low and consistent number of misassemblies (Additional file 1). The fifth metric assessed the completeness of each candidate genome assembly by looking for two core eukaryotic gene sets, CEGMA  and BUSCO .
No candidate assembly ranked the best for all of these different metrics. In order to choose a preferred assembly, we used principal component analysis to organize the several scores for each assembly. The PCA analysis showed a noticeable difference between the Canu assemblies and the FALCON assemblies (Additional file 2). All of the Canu assemblies clustered together in PCA space. The FALCON assemblies fell into two separate clusters because five of the FALCON assemblies (#17, 32, 34, 35, and 36, Additional file 1) had low ALE scores and NG50s. The other FALCON assemblies tended to show overall better ALE scores for the 44X PacBio library than did the Canu assemblies. This is due to differences in the consensus accuracy between Canu and FALCON assemblies. The 44X PacBio ALE placement and insert scores were virtually the same across all candidate assemblies, but the 44X PacBio ALE k-mer scores were lower for the Canu assemblies. This suggests a slight difference in consensus between Canu and FALCON, although it is probably not noticeable after the polishing steps.
Leaving aside the five low quality FALCON assemblies, a major tradeoff in the PCA is between the size of the assembly and the PacBio ALE score. The FALCON assemblies are all smaller than the Canu assemblies, and for the reasons discussed above, have higher ALE scores for the PacBio library. We elected to focus on the Canu assemblies, where the major tradeoff is between the quality of the assembly (ALE scores, NG50, completeness) and size of the assembly (Total size, exon bp mapped). Ultimately, we chose the assembly (Canu #14) with the best overall ALE average rank. This assembly was 28.8Mbp shorter than the longest Canu assembly (#15).
Alignment of the RH and RAD maps to the candidate assemblies indicated that every assembly had a relatively similar and low number of misassemblies (Additional file 1). To correct these misassemblies in the polished version of assembly #14, the locations of misassemblies were first narrowed using the RH and RAD map data together. This typically narrowed the location of a misassembly to a region of less than 1Mbp. From there, the region around each misassembly breakpoint was inspected using alignments of the PacBio data, Illumina data, RefSeq gene set, BAC-end sequences as well as the RepeatMasker annotations. A characteristic signal of high variation in the PacBio alignments, low physical coverage in the Illumina libraries (best characterized with the largest 40kbp Illumina library), and a high density of large and nested repeats was seen in each region of misassembly. Regions of high variation in the PacBio alignments and low 40kbp physical coverage were then calculated genome-wide to investigate whether additional misassemblies might be hidden in the assembly. When considering the PacBio highly variant regions and the low physical 40kbp coverage regions individually, both sets over-estimated the number of misassembly regions. These false-positive potential misassemblies occurred in regions where there was support for correct and continuous assembly based on both RH and RAD map alignments, which together lend stronger support. Only in two cases were there regions that had high PacBio variation, low physical 40kbp coverage and no alignment of RH or RAD map data. We decided to break the assembly at these two locations as well.
A total of 868.6Mbp of the assembled contigs were anchored to the 22 LGs in O_niloticus_UMD1. Overall, 258Mbp of additional (non-gap) sequence has been anchored in the O_niloticus_UMD1 assembly (Table 3). All but two of the O_niloticus_UMD1 LGs (LG5 and LG13) are larger in size than in the previous Orenil1.1 assembly. LG5 is 2.7Mbp smaller and LG13 is 0.4Mbp smaller. It is possible that the Orenil1.1 assembly correctly assembled more of these LG5 and LG13. Alternatively, the size difference could be due to overestimates of gap sizes in the Orenil1.1 assembly and/or incorrect assignment of contigs/scaffolds to the wrong LG, which have now been correctly assigned.
It should be noted that although two markers were required to anchor and orient any contig to a particular LG, not all of the markers in the RAD map were located at distinct map positions (i.e., the map has multiple markers at the same genetic position). Therefore, in some cases (particularly involving many of the smaller and repetitive contigs that were anchored to LG3b), the orientation of contigs on LGs is ambiguous. We chose to allow anchoring of these contigs to maximize the anchoring of the many small repetitive contigs that make up LG3.
Table 5 provides the RefSeq annotation summary of both the Orenil1.1 and new O_niloticus_UMD1 assemblies. The increase in gene and pseudogene annotations is at least partly due to the fact that the O_niloticus_UMD1 assembly contains an additional 189.5Mbp of sequence that was not present in Orenil1.1 as well as the fact that additional transcriptome reads were available for RefSeq annotation of O_niloticus_UMD1. These additional annotations include protein-coding genes (2,920, 11.1% increase), non-coding RNAs (5,091, 145.1% increase) and pseudogenes (227, 67.4% increase). At the same time, there was a decrease in the number of partial mRNA (2,657, 87.1% decrease) and partial CDS (2,066, 83.7% decrease) annotations. This is most likely due to the fact that O_niloticus_UMD1 gene annotations are not disrupted by assembly gaps. The remaining partial annotations may represent recent pseudogenes that the annotation pipeline has little way of differentiating.
The NCBI RefSeq annotation pipeline corrects CDS annotations that have premature stop-codons, frameshifts and internal gaps that would disrupt protein- coding sequence. The RefSeq annotation pipeline corrected 743 CDSs in O_niloticus_UMD1 compared to 817 previously for Orenil1.1. These remaining 743 CDS annotations that required corrections may be due to incomplete polishing in the final O_niloticus_UMD1 assembly, but this number is less than the amount of corrected CDSs annotated in the smaller Orenil1.1 assembly.
The vast majority of TE families are represented by more sequence in the new assembly (Table 7 and Additional file 4). It is likely that the fragmented Orenil1.1 assembly caused there to be an inflated count of annotated TE copies in places where gaps were inserted within TE copies. The O_niloticus_UMD1 assembly has assembled TE families in longer overall copies than in Orenil1.1 It is also likely that having longer repeat copies and overall 146Mbp more repeat sequence allowed for more accurate annotation of all repeat sequences. In turn, several TE families (such as SINE tRNA-V and LINE Dong-R4, Additional file 4) have decreased in overall number in the O_niloticus_UMD1 assembly, which is likely due to these TEs being more accurately annotated as different, but related TEs. The most recent and less diverged TE copies have been assembled in far greater number in the new O_niloticus_UMD1 assembly (Fig. 2).
The two tilapia-specific satellite repeats, ONSATA  and ONSATB , have been shown to be present in high copy number. Both of these satellite repeats have previously been physically mapped using fluorescent in situ hybridization (FISH) in O. niloticus . ONSATA was found almost solely in the centromeres, while ONSATB was also scattered throughout the length of each chromosome arm. Consistent with this, we found nested ONSATA repeat segments assembled near the very ends of several anchored chromosomes (LG3b, LG4, LG8, LG14, and LG17). ONSATB nested repeat segments were found near one or both ends of several anchored chromosomes (LG2, LG3a, LG3b, LG4, LG6, LG11, LG12, LG14, LG16, LG17, LG18, LG19, LG20, and LG23). These data suggest that our assembly of these chromosomes extend into the centromeres. These satellite nested repeats were also abundant in several of the misassembled regions (Table 2) suggesting that they remain an obstacle to complete assembly of the genome.
Recently duplicated regions
As the recent vasa gene duplication in O. niloticus (Fig. 3) shows, the use of long reads has enabled the assembly of such recently duplicated regions. It is likely that there are many other recently duplicated regions that have now been assembled. This is supported by the genome completeness analysis with BUSCO that showed there were 26 additional duplicated BUSCOs out of 3023 searched (Table 4). Even though this is a small percentage of the genes analyzed (0.86%), when extrapolated over all the genes in the genome this would amount to hundreds of recently duplicated genes being assembled for the first time. The RefSeq annotation shows that the O_niloticus_UMD1 assembly contained 227 additional pseudogenes (67.4% increase from the Orenil1.1 assembly), which also supports this notion.
Sex determination regions
Manipulation of sex-determination in tilapia has important economic implications. The O_niloticus_UMD1 assembly was used to confirm the known and previously described O. niloticus ~9Mbp sex-determination region on LG1 . The size and pattern of sex differentiation on LG1 and across the genome is similar in both the Orenil1.1 and O_niloticus_UMD1 assemblies (Figs. 4 and 5). A total of 331 additional LG1 sex-patterned variants are identified in the O_niloticus_UMD1 assembly.
The sex-determination region in O. aureus is located on the large and highly repetitive LG3. Due to the fact that LG3 is highly repetitive, it was poorly assembled in Orenil1.1 and the vast amount of sex-patterned variants were previously found on unanchored contigs and scaffolds (Fig. 6a and b). An additional 23,538 LG3-specific O. aureus sex-patterned variants are identified in the O_niloticus_UMD1 assembly. Now that LG3 has been assembled and anchored into a much larger LG (68.5Mbp versus 19.3Mbp, Table 3), many of these sex-patterned variants are confirmed on LG3 (Fig. 6c and d). There still exist a substantial number of sex-patterned variants on unanchored contigs in the new assembly. The overall pattern of O. aureus sex differentiation on LG3 is characterized by several sharp transitions between low and high differentiation (e.g., ~5Mbp and ~37Mbp, Fig. 7c and d). These sharp transitions may be explained by either errors in the anchoring process or structural differences between the reference species (O. niloticus) and O. aureus. Indeed, there are several peaks of differentiation on other LGs (LG7, LG9, LG14, LG16, LG18, LG22 and LG23, Fig. 6). These may also be chromosomal translocation differences between the two species that will need to be investigated further with FISH.
This study provides a new assembly and annotation of the Nile tilapia O. niloticus (O_niloticus_UMD1), which provides a high-quality reference for the cichlid research community as well as one for studying the evolution of vertebrate genomes. The study also serves as a template for vertebrate genome assembly with current technology and describes many of genomic features that can now be represented correctly. Generation of O_niloticus_UMD1 began by comparing candidate de novo assemblies systematically comparing them to select a single best assembly. A small number of misassemblies present in this candidate assembly were identified using several different datasets and subsequently corrected. The final anchored O_niloticus_UMD1 assembly remained very contiguous with a contig NG50 of 3.1Mbp and 86% of contigs anchored to LGs. The number of annotated genes increased 27.3% from the previous assembly of O. niloticus. Additionally, a vast amount of repetitive sequences (~146Mbp) were added in the O_niloticus_UMD1 assembly, many of which represent very recent TEs. Finally, the O_niloticus_UMD1 assembly was used to better characterize two large sex-determination regions. The first is a ~9MBp region in O. niloticus and the second is a ~50Mbp region in the related species O. aureus. Further characterization of these sex-determination regions will have important economic implications for farmed tilapia.
PacBio sequencing was performed on a new individual from the same XX homozygous clonal line used for the previous whole genome sequencing of O. niloticus . This mitogynogenetic line was developed and maintained at the University of Stirling, UK . All working procedures complied with the UK Animals (Scientific Procedures) Act .
The Qiagen MagAttract HMW DNA kit was used to extract high-molecular weight DNA from a nucleated blood cell sample of the female “F11D_XX” individual. Size selection was performed at the Genomics Resource Center, Institute for Genome Sciences using a Blue Pippin pulse-field gel electrophoresis instrument. A library was constructed and 63 SMRT cells were sequenced on their PacBio RS II instrument using the P6-C4 chemistry.
Both Canu  (version 1.0) and FALCON  (versions 0.3.0 and 0.5.0) were run to generate candidate de novo genome assemblies. The wide range of parameters tested for both algorithms are provided in Additional file 5. The final assembly (#14), chosen based on the evaluation and likelihood calculations (see below), was run using Canu with the following relevant parameters: ‘minReadLength = 7000 minOverlapLength = 2000 MhapSensitivity = high genomeSize = 1 g errorRate = 0.025 -pacbio-raw’.
Assembly accuracy measurements
Assembly summary metrics were calculated using the assemblathon_stats.pl script . Illumina libraries generated previously  were aligned to each candidate de novo assembly using Bowtie2 (version 2.2.5 in ‘--very-sensitive’ mode). The four different insert size Illumina libraries used are presented in Table 9.
For each SRA run, raw reads were downloaded from NCBI using the ‘fastq-dump’ program from the SRA Toolkit  (version 2.5.2). Raw fastq files were combined for each insert size group and Trimmomatic  (version 0.32) was run on the combined fastq files. The 101 bp fragment library and 3kbp library reads were each trimmed with the following Trimmomatic settings:
‘ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 SLIDINGWINDOW:4:20 LEADING:10 TRAILING:10 CROP:101 HEADCROP:0 MINLEN:80’
The 36 bp 6-7kbp library reads were trimmed with the following settings:
‘ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 SLIDINGWINDOW:4:20 LEADING:10 TRAILING:10 CROP:36 HEADCROP:0 MINLEN:31’
The 76 bp 40kbp library reads were trimmed with the following settings:
‘ILLUMINACLIP:TruSeq2-PE.fa:1:30:10 SLIDINGWINDOW:10:20 LEADING:5 TRAILING:10 CROP:76 HEADCROP:4 MINLEN:70’
For the fragment library, the trimmed and filtered reads were next overlapped with FLASH  (version 1.2.11) using the following parameters: ‘-m 20 -× 0.15 -z’. The samtools  (version 1.1) ‘view’ and ‘sort’ commands were used to convert the Bowtie2 SAM outputs to BAM format. The Picard  (version 2.1.0) ‘MarkDuplicates’ program was run on each of these Bowtie2 alignments with ‘REMOVE_DUPLICATES = true’.
The Assembly Likelihood Estimator (ALE)  was then run on each of these filtered BAM files to generate likelihood statistics for each candidate Canu and FALCON de novo assembly for each Illumina library. Additionally, to generate ALE scores for the raw PacBio data aligned to each assembly, the 44X raw PacBio reads were aligned using BLASR  (version 184.108.40.206046) with the following parameters: ‘-minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 10 -maxScore -500 -nproc 40 -noSplitSubreads –sam’. ALE was then run on these BLASR alignments as well.
A set of O. niloticus paired BAC-end sequences  were aligned against each candidate assembly using BLAST [59,60,61] (version 2.3.0+). The top hit with an E-value less than 1e-150 were kept and then assigned a category of alignment relative to the candidate assemblies according to the details described previously  and briefly explained for Additional file 1.
To evaluate the completeness of the candidate assemblies, BUSCO  (version 1.22) was run (in ‘-m genome’ mode) using the ‘vertebrata’ lineage-specific profile library. CEGMA  (version 2.5) was also run on each of the candidate assemblies. CEGMA was run optimized for vertebrate genomes (option ‘--vrt’) and relied on GeneWise (version 2.4.1), HMMER (version 3.1b1), and NCBI BLAST+ (version 2.3.0+) using the provided set of 248 CEGs.
Principal component analysis
The following metrics were calculated and culled for each of the 37 candidate assemblies: Total ALE score for the aligned Illumina fragment, 3kbp, 6-7kbp, and 40kbp libraries; Total ALE score for the aligned PacBio library; Total number of complete CEGs as defined by CEGMA; Longest contig; NG50; Total assembly size (bp); Total number of RefSeq exon bp mapped. O. niloticus RefSeq transcripts  (release 70) were aligned to each of the candidate assemblies using GMAP  (version 2015-07-23) and exon bp mapped were calculated from the output GFF3 file. R version 3.2.3 was used to perform the PCA analysis using the ‘prcomp’ function with ‘center = TRUE, scale = TRUE’ and to create plots with the ‘biplot’ function.
Polishing the assembly
SMRT-Analysis  (version 220.127.116.11936) was used for polishing the Canu #14 assembly using the 44X raw PacBio reads. First, each SMRT cell was separately aligned to the unpolished Canu assembly using pbalign (version 0.2.0.138342) with the ‘--forQuiver’ flag. Next, cmph5tools.py (version 0.8.0) was used to merge and sort (with the ‘--deep’ flag) the pbalign.h5 output files for each SMRT cell. Finally, Quiver (GenomicConsensus version 0.9.2 and ConsensusCore version 0.8.8) was run on the merged and sorted pbalign output to produce an initial polished assembly.
Pilon  (version 1.18) was run on the intermediate Quiver-polished assembly produced above. Again, Bowtie2 (version 2.2.5 in ‘--very-sensitive’ mode) was used to align Illumina reads to this intermediate assembly for Pilon polishing. The fragment library alignment was supplied to Pilon with ‘--unpaired’ while the other 3 insert library alignments were specified with ‘--jumps’. Additionally, Pilon was run with the following parameters: ‘--changes --vcf --chunksize 40000000 --fix all’.
The 44X coverage raw PacBio reads were aligned to the Quiver- and Pilon-polished Canu #14 assembly using BLASR  (version 18.104.22.168046) with the same parameters as mentioned above. Variants were called using FreeBayes  (version v1.0.2-33-gdbb6160-dirty). To facilitate FreeBayes processing, regions of the polished assembly were broken into 500kbp chunks using the FreeBayes “fasta_generate_regions.py” script. The separate VCF output files were then concatenated using the VCFtools  ‘vcf-concat’ program. The FreeBayes utility ‘vcffilter’ was used to filter these variants for quality greater than 10 (‘-f “QUAL > 10” ’). VCFtools was then used to compute variant density by specifying ‘--SNPdensity 10000’ to calculate variant density in 10kbp windows. Highly variant regions were flagged if there were more than 1 variant per 1kbp over a 10kbp window.
The 40kbp mate-pair Illumina reads of the same homozygous inbred O. niloticus line  were downloaded from the NCBI SRA (SRR071595, SRR071598, and SRR071611). Trimmomatic  (version 0.32) was run to remove adaptor sequences and to trim/quality filter these reads. The relevant parameters for Trimmomatic were ‘PE -phred33 ILLUMINACLIP:TruSeq2-PE.fa:1:30:10 SLIDINGWINDOW:10:20 LEADING:5 TRAILING:10 CROP:76 HEADCROP:4 MINLEN:70’. The trimmed and filtered reads were combined and aligned to the polished assembly using BWA mem  (version 0.7.12-r1044) with the ‘-M’ flag. The Picard  (version 2.1.0) ‘SortSam’ program was used to convert the SAM output to BAM (‘SORT_ORDER = coordinate’) and the Picard ‘MarkDuplicates’ program was used to identify duplicate reads. The physical coverage of the 40kbp mate-pairs was calculated on a per-contig basis using a series of piped samtools  (version 1.1) and bedtools (version v2.26.0) commands using the following template, where ‘contig’ and ‘contig_size’ are the specific contig and its respective size: ‘samtools sort -no < (samtools view -bh -F 2 -q 1 40 kb.bam contig) tmp | bamToBed -i stdin -bedpe | cut -f 1,2,6 | sort -k 1,1 | bedtools genomecov -i stdin -g < (echo -e “contig\tcontig_size\n”) -bga -pc | grep ^contig > output’. Regions within 200kbp of the start or end of a contig were then excluded from this analysis. Regions below 20× physical coverage of 40kbp mate-pair reads were flagged.
Regions of high variant density within 20kbp of each other, based on raw PacBio alignments, were merged using the bedtools ‘merge’ program (‘-d 20000’). The same merging of windows was performed for regions of low physical coverage based on the 40kbp mate-pair library. The bedtools ‘intersect’ program was then used to determine regions of high-density PacBio variants and low 40kbp mate-pair physical coverage that overlapped by at least 80% in the high-density PacBio variants merged windows (‘-f 0.8’).
Regions of both high-density PacBio variants and low 40kbp mate-pair physical coverage were compared to the alignments of the RH map and RAD map to confirm or contradict the putative misassemblies. Putative misassembled regions were manually inspected using the BLASR and BWA alignments using IGV . In addition to these tracks, both RefSeq  (release 70) O. niloticus transcripts aligned to the polished Canu assembly using GMAP  (version 2015-07-23) and RepeatMasker  repeat annotations were considered when defining the exact location of a misassembly. Break locations were chosen so that they did not occur within RefSeq transcripts or within single repeat annotations. The REAPR  (version 1.0.18) ‘break’ program was used to break and fix the polished Canu assembly by providing the determined break locations.
Anchoring with chromonomer
Chromonomer  (version 1.05) was first used to anchor the polished and misassembly-corrected assembly using the RH map for O. niloticus . This initial anchored assembly was then subsequently anchored again with a RAD map for O. niloticus . BWA mem (version 0.7.12-r1044) was used in both Chromonomer runs to create the input SAM file by aligning respective map marker sequences to the appropriate intermediate assembly. A minimum of two markers were required to anchor a contig to a particular LG. Gaps of 10kbp were placed between anchored contigs using ‘--join_gap_size 10000’ in Chromonomer. Several RH linkage groups required manual placement were fixed by replacing their entries in the SAM file used by Chromonomer. The RH map LGs that were not anchored using the RAD map (“LOD4.9-RH10- LG10”, “LOD6.5-RH17- LG15”, and “LOD5.7-RH31- LG3”) were manually placed onto the final LGs by using the additional mapping data provided in the previous publication, ‘Additional file 4 . Data S4’ of  which integrated FISH mapping of BAC markers and an previous genetic map . Three RH LGs also had to be fixed as they contained a number of repetitive markers, which was causing them to be anchored to incorrect linkage groups in the RAD map (“LOD4.5-RH5- LG9,” LOD6.9-RH6-LG5.rev”, and “LOD5.1-RH8-LG13”).
To further evaluate the candidate assemblies described above, the Chromonomer output file ‘problem_scaffolds.tsv’ was used to count the number of contigs in each assembly that had multiple markers that mapped to two or more separate linkage groups.
The O_niloticus_UMD1 assembly was submitted to NCBI to perform the Eukaryotic Genome Annotation Pipeline . This automated pipeline masks the assembly, and aligns existing transcript, protein, RNA-seq, and curated RefSeq sequences to it. Gene prediction based on these alignments is performed and the best gene models are selected among the RefSeq and predicted models which are then made available as the annotation release. The O_niloticus_UMD1 assembly was annotated as “annotation release 103”  using software version 7.2 on December 5 2016, while the previous “annotation release 102”  consisted of the Orenil1.1 annotation using software version 6.4 on July 30 2015. The Orenil1.1 annotation used 1,319,429,488 reads available and the O_niloticus_UMD1 used 2,295,445,708 reads available at the times of the respective annotations. The newer transcriptome datasets were derived from testis, ovary, liver, and gill tissues. Only the gill tissue was not present in the annotation of Orenil1.1. The numbers in Table 5 were extracted from these summaries.
The annotation of repetitive elements was run on several of the intermediate assemblies as well as the final O_niloticus_UMD1 assembly. For each of these assemblies, RepeatModeler  (version open-1.0.8) was first used to identify and classify de novo repeat families present in each assembly. These de novo repeats were then combined (separately for each assembly) with the RepBase-derived RepeatMasker libraries . RepeatMasker  (version open-4.0.5) was then run on each of these assemblies using NCBI BLAST+ (version 2.3.0+) as the engine (‘-e ncbi’) and specifying the combined repeat library (‘-lib’). The more sensitive slow search mode (‘-s’) was used.
Analysis of duplicated Vasa regions
The vasa transcript (NCBI accession AB032467.1) was aligned to three assembled BAC clones (NCBI accessions AB649031-AB649033) corresponding to the three copies of vasa present in the O. niloticus genome  using GMAP  (version 2015-07-23). The vasa transcript was also aligned to the scaffolds of the Orenil1.1 assembly and the final anchored O_niloticus_UMD1 assembly. IGV was used to generate images displaying the transcript alignments of the duplicated vasa genes.
Sex comparisons were run on the O_niloticus_UMD1 assembly for two species of tilapia, O. niloticus and O. aureus. The O. niloticus sequence data used in this study was previously described . The O. aureus individuals used were F1 individuals derived from a stock originally provided by Dr. Gideon Hulata (Institute of Animal Science, Agricultural Research Organization, The Volcani Center, Bet Dagan, Israel) and maintained at University of Maryland. These animal procedures were conducted in accordance with University of Maryland IACUC Protocol #R-10-74. A total of 58 O. niloticus XY males, 33 O. niloticus XX females, 22 O. aureus ZZ males and 22 O. aureus WZ females were pooled separately, sheared to ~500 bp on a Covaris shearer, and sequenced on an Illumina HiSeq 2000. The reads from each pool were separately mapped to O_niloticus_UMD1 using BWA mem (v0.7.12). The alignments were sorted and duplicates were marked with Picard (v2.1.0). Alignments were converted into an mpileup file using Samtools (v0.1.18) and subsequently into a sync file using Popoolation2 (v1201) . Estimates of FST and analyses of sex-patterned variants (SNPs and short deletions that are fixed or nearly fixed in the homogametic sex and in intermediate frequency in the heterogametic sex) were carried out using Sex_SNP_finder_GA.pl (https://github.com/Gammerdinger/sex-SNP-finder). For the O. niloticus sex comparison, the XX females were set to be the homogametic sex. For the O. aureus comparison, the ZZ males were set to be the homogametic sex.
Assembly likelihood estimate
Bacterial artificial chromosome
Principal component analysis
Restriction site associated DNA
FAO. World Review of Fisheries and Aquaculture. 2012. Available from: http://www.fao.org/docrep/016/i2727e/i2727e01.pdf.
FAO. Cultured Aquatic Species Information Programme. In: FAO Fisheries and Aquaculture Department. Available from: http://www.fao.org/fishery/culturedspecies/Oreochromis_niloticus/en.
Mair GC, Abucay JS, Skibinski DOF, Abella TA, Beardmore JA. Genetic manipulation of sex ratio for the large-scale production of all-male tilapia Oreochromis niloticus. Can J Fish Aquat Sci. 1997;54(2):396–404.
Hickling C. The Malacca tilapia hybrids. J Genet. 1960;57:1–10.
Little D, Hulata G. Strategies for tilapia seed production. In: Beveridge M, McAndrew B, editors. Tilapias: biology and exploitation. 2000. p. 267–326.
Baroiller JF, D’Cotta H, Bezault E, Wessels S, Hoerstgen-Schwark G. Tilapia sex determination: Where temperature and genetics meet. Comp Biochem Physiol A Mol Integr Physiol. 2009;153:30–8.
Wessels S, Sharifi RA, Luehmann LM, Rueangsri S, Krause I, Pach S, et al. Allelic variant in the anti-müllerian hormone gene leads to autosomal and temperature-dependent sex reversal in a selected nile tilapia line. PLoS One. 2014;9(8):e104795.
Palaiokostas C, Bekaert M, Khan MG, Taggart JB, Gharbi K, McAndrew BJ, et al. A novel sex-determining QTL in Nile tilapia (Oreochromis niloticus). BMC Genomics. 2015;16(1):1–10.
Lee BY, Coutanceau JP, Ozouf-Costaz C, D’Cotta H, Baroiller JF, Kocher TD. Genetic and physical mapping of sex-linked AFLP markers in Nile tilapia (Oreochromis niloticus). Mar Biotechnol (NY). 2011;13(3):557–62.
Eshel O, Shirak A, Weller JI, Hulata G, Ron M. Linkage and physical mapping of sex region on LG23 of Nile Tilapia (Oreochromis niloticus). G3. 2012;2(1):35–42.
Lee B-YB, Hulata G, Kocher TD. Two unlinked loci controlling the sex of blue tilapia (Oreochromis aureus). Heredity. 2004;92(6):543–9.
Ser JR, Roberts RB, Kocher TD. Multiple interacting loci control sex determination in lake Malawi cichlid fish. Evolution. 2010;64(2):486–501.
Roberts RB, Ser JR, Kocher TD. Sexual conflict resolved by invasion of a novel sex determiner in Lake Malawi cichlid fishes. Science. 2009;326(5955):998–1001.
Parnell NF, Streelman JT. Genetic interactions controlling sex and color establish the potential for sexual conflict in Lake Malawi cichlid fishes. Heredity. 2013;110(3):239–46.
Li M, Sun Y, Zhao J, Shi H, Zeng S, Ye K, et al. A Tandem duplicate of anti-müllerian hormone with a missense SNP on the Y chromosome is essential for male sex determination in Nile Tilapia, Oreochromis niloticus. PLoS Genet. 2015;11(11):1–23.
Brawand D, Wagner CE, Li YI, Malinsky M, Keller I, Fan S, et al. The genomic substrate for adaptive radiation in African cichlid fish. Nature. 2014;513:375–81.
Conte MA, Kocher TD. An improved genome reference for the African cichlid, Metriaclima zebra. BMC Genomics. 2015;16(1):724.
Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW. Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol. 2014;10(12):e1003998.
Copetti D, Wing RA. The dark side of the genome: revealing the native transposable element/repeat content of eukaryotic genomes. Mol Plant. 2016;9(12):1664–6.
Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat Methods. 2011;8(1):61–5.
Chaisson MJP, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari F, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2014;517:608–11.
Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, et al. Phased diploid genome assembly with single molecule real-time sequencing. Nat Methods. 2016;13:1050–4.
Seo J-S, Rhie A, Kim J, Lee S, Sohn M, Kim C-U, et al. De novo assembly and phasing of a Korean human genome. Nature. 2016;538(7624):243–7.
Shi L, Guo Y, Dong C, Huddleston J, Yang H, Han X, et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun. 2016;7(1):12065.
Vij S, Kuhl H, Kuznetsova IS, Komissarov A, Yurchenko AA, Van Heusden P, et al. Chromosomal-level assembly of the asian seabass genome using long sequence reads and multi-layered scaffolding. PLoS Genet. 2016;12(4):e1005954.
Sarder MRI, Penman DJ, Myers JM, McAndrew BJ. Production and propogation of fully inbred clonal lines in the Nile Tilapia (Oreochromis niloticus L.). J Exp Zool. 1999;284(6):675–85.
Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. bioRxiv. 2016;71282.
Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics. 2013;29(4):435–43.
Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23(9):1061–7.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–45.
Soler L, Conte MA, Katagiri T, Howe AE, Lee B-Y, Amemiya C, et al. Comparative physical maps derived from BAC end sequences of tilapia (Oreochromis niloticus). BMC Genomics. 2010;11(1):636.
Guyon R, Rakotomanga M, Azzouzi N, Coutanceau JP, Bonillo C, D’Cotta H, et al. A high-resolution map of the Nile tilapia genome: a resource for studying cichlids and other percomorphs. BMC Genomics. 2012;13:222.
Palaiokostas C, Bekaert M, Khan MGQ, Taggart JB, Gharbi K, McAndrew BJ, et al. Mapping and validation of the major sex-determining region in Nile Tilapia (Oreochromis niloticus L.) using RAD sequencing. PLoS One. 2013;8(7):1–9.
Gregory TR. Animal genome size database. 2016. Available from: http://www.genomesize.com.
Poletto AB, Ferreira I a, Cabral-de-Mello DC, Nakajima RT, Mazzuchelli J, Ribeiro HB, et al. Chromosome differentiation patterns during cichlid fish evolution. BMC Genet. 2010;11(1):50.
Ferreira IA, Poletto AB, Kocher TD, Mota-Velasco JC, Penman DJ, Martins C. Chromosome evolution in African cichlid fish: contributions from the physical mapping of repeated DNAs. Cytogenet Genome Res. 2010;129(4):314–22.
Franck J, Wright J, McAndrew B. Genetic variability in a family of satellite DNAs from tilapia (Pisces: Cichlidae). Genome. 1992;35(5):719–25.
Franck JPC, Wright JM. Conservation of a satellite DNA sequence (SATB) in the tilapiine and haplochromine genome (Pisces: Cichlidae). Genome. 1993;36(1):187–94.
Takahashi K, Terai Y, Nishida M, Okada N. A novel family of short interspersed repetitive elements (SINEs) from cichlids: the patterns of insertion of SINEs at orthologous loci support the proposed monophyly of four major groups of cichlid fishes in Lake Tanganyika. Mol Biol Evol. 1998;15(4):391–407.
Terai Y, Takahashi K, Nishida M, Sato T, Okada N. Using SINEs to probe ancient explosive speciation: “Hidden” radiation of African cichlids? Mol Biol Evol. 2003;20(6):924–30.
Fujimura K, Conte MA, Kocher TD. Circular DNA intermediate in the duplication of Nile tilapia vasa genes. PLoS One. 2011;6(12):e29477.
Katagiri T, Kidd C, Tomasino E, Davis JT, Wishon C, Stern JE, et al. A BAC-based physical map of the Nile tilapia genome. BMC Genomics. 2005;6:89.
Cnaani A, Lee BY, Zilberman N, Ozouf-Costaz C, Hulata G, Ron M, et al. Genetics of sex determination in tilapiine species. Sex Dev. 2008;2(1):43–54.
Lee BY, Penman DJ, Kocher TD. Identification of a sex-determining region in Nile tilapia (Oreochromis niloticus) using bulked segregant analysis. Anim Genet. 2003;34(5):379–83.
Ezaz MT, Harvey SC, Boonphakdee C, Teale AJ, McAndrew BJ, Penman DJ. Isolation and physical mapping of sex-linked AFLP markers in Nile tilapia (Oreochromis niloticus L.). Mar Biotechnol. 2004;6(5):435–45.
Gammerdinger WJ, Conte MA, Baroiller J-F, D’Cotta H, Kocher TD. Comparative analysis of a sex chromosome from the blackchin tilapia, Sarotherodon melanotheron. BMC Genomics. 2016;17(1):808.
Gammerdinger WJ, Conte MA, Acquah EA, Roberts RB, Kocher TD. Structure and decay of a proto-Y region in Tilapia, Oreochromis niloticus. BMC Genomics. 2014;15(1):975.
Campos-Ramos R, Harvey SC, Masabanda JS, Carrasco LAP, Griffin DK, McAndrew BJ, et al. Identification of putative sex chromosomes in the blue tilapia, Oreochromis aureus, through synaptonemal complex and fish analysis. Genetica. 2001;111(1–3):143–53.
Ferreira IA, Martins C. Physical chromosome mapping of repetitive DNA sequences in Nile tilapia Oreochromis niloticus: Evidences for a differential distribution of repetitive elements in the sex chromosomes. Micron. 2008;39(4):411–8.
Parliament of the United Kingdom. The Animals (Scientific Procedures) Act 1986, revised.
Bradnam K. assemblathon2-analysis/assemblathon_stats.pl. Available from: https://github.com/ucdavis-bioinformatics/assemblathon2-analysis/blob/master/assemblathon_stats.pl.
NCBI SRA Toolkit. Available from: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software#header-global.
Bolger AM, Lohse M, Usadel B, Planck M, Plant M, Mühlenberg A. Trimmomatic : a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30(15):2114–20.
Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27(21):2957–63.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.
Picard Tools - By Broad Institute. Available from: http://broadinstitute.github.io/picard.
Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012;13(1):238.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
Altschul SF, Madden TL, Schaffer A, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PS I-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(1):421.
Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21(9):1859–75.
PacificBiosciences/SMRT-Analysis. Available from: https://github.com/PacificBiosciences/SMRT-Analysis.
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9(11):e112963.
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv Prepr arXiv:1207.3907. 2012;9.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Prepr. 2013;0(0):1–2.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): High-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92.
Smit, AFA, Hubley, R & Green P. RepeatMasker Open-4.0. 2010.
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14(5):R47.
Catchen J, Amores A. Chromonomer. Available from: http://catchenlab.life.illinois.edu/chromonomer/.
Lee B-Y, Lee W-J, Streelman JT, Carleton KL, Howe AE, Hulata G, et al. A second-generation genetic linkage map of tilapia (Oreochromis spp.). Genetics. 2005;170(1):237–44.
NCBI. The NCBI Eukaryotic Genome Annotation Pipeline. Available from: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/.
NCBI. NCBI Oreochromis niloticus Annotation Release 103. Available from: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Oreochromis_niloticus/103/.
NCBI. NCBI Oreochromis niloticus Annotation Release 102. Available from: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Oreochromis_niloticus/102/.
Smit, AFA, Hubley R. RepeatModeler Open-1.0. 2010.
Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6(1):11.
Kofler R, Pandey RV, Schlötterer C. PoPoolation2 : Identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics. 2011;27(24):3435–6.
We thank Luke Tallon and Naomi Sengamalay at the Genomics Resource Center, Institute for Genome Sciences for providing a high quality PacBio library and sequence reads. We thank Gideon Hulata of the Agricultural Research Organization in Israel for providing the original stocks of O. aureus that we maintained for this study. We thank Christopher Hill at University of Washington for helpful guidance on assembly parameters. We thank the members of the Kocher and Carleton labs at UMD for providing thoughtful advice during the course of the project. We thank Karen Carleton for reading and providing thoughtful feedback on the manuscript. We thank Christos Palaiokostas (Roslin Institute) and Michaël Bekaert (University of Stirling) for providing the marker sequences of the RAD map. We thank Lucile Soler (Bioinformatics Infrastructure for Life Sciences) for BAC-end analysis scripts. We thank the NCBI RefSeq curators for annotation of the O_niloticus_UMD1 genome assembly. We acknowledge the University of Maryland supercomputing resources (www.glue.umd.edu/hpcc/) made available in conducting the research reported in this paper.
This work was supported by the United States Department of Agriculture under project number MD.W-2014-05906 to improve and close the tilapia genome via PacBio sequencing. The work was also funded by National Institutes of Health project R01-EY024639 providing partial salary support for MAC.
Availability of data and materials
Sequencing data is available via NCBI using the accessions provided below.
Female O. niloticus pool: SRR1606304
Male O. niloticus pool: SRR1606298
Female O. aureus pool: SRR5121055
Male O. aureus pool: SRR5121056
44X O. niloticus PacBio reads: SRP093160
O_niloticus_UMD1 Assembly: MKQE00000000
MAC and TDK conceived and designed study. DJP provided the blood sample from the homozygous clonal line of O. niloticus and KLB optimized and performed the HMW DNA extraction. MAC performed the genome assembly, evaluation and analyses. WJG and TDK raised specimens, performed DNA extraction and prepared sequencing library construction of the O. aureus samples. WJG organized map data and performed the computational analysis of the sex comparisons. MAC and TDK wrote the paper. All authors read and approved the manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
All working animal procedures complied with the UK Animals (Scientific Procedures) Act  under project license number PPL 60/4397 and were performed at the University of Stirling, UK. The animal used for this study was developed and maintained at the University of Stirling, UK.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended assembly metrics of the 37 candidate assemblies. Assembly statistics for Canu assemblies (#1-16), FALCON assemblies (#17-37), total and composite ALE scores for each assembly with rankings, and CEGMA/BUSCO results for each assembly. BAC-end types are defined as: type 1 BAC-ends have only a single read-pair aligned, type 2 BAC-ends have both read-pairs aligned in correct orientation, type 3 BAC-ends have incorrect orientation or mapping distance, and type 4 BAC-end read-pairs match separate contigs. (XLSX 134 kb)
PCA analysis of candidate assemblies. PCA analysis of the 37 candidate assemblies (assembly numbers listed in Additional file 1 and Additional file 5) is composed of total size (bp), exon bp mapped, complete CEGMA CEGs, NG50, longest contig, and overall ALE scores (Illumina fragment, 3kbp, 6–7kbp, 40kbp and 44X PacBio libraries). (PDF 5 kb)
Example misassembly signature. An example misassembly identified by both RH and RAD maps showing the characteristic signature of high variation in the 44X PacBio read alignments as well as low coverage in the 40kbp Illumina mate-pair library and high density of repetitive elements. (PNG 37 kb)
Extended repeat annotations of O_niloticus_UMD1 and Orenil1.1. RepeatMasker results for each TE family and other repeats for both assemblies and the differences between the two. (XLSX 60 kb)
Candidate assembly parameters. Parameters used for the 37 candidate assemblies, Canu and FALCON assembly parameters are separated. (XLSX 12 kb)
About this article
Cite this article
Conte, M.A., Gammerdinger, W.J., Bartie, K.L. et al. A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions. BMC Genomics 18, 341 (2017). https://doi.org/10.1186/s12864-017-3723-5