A high-quality de novo genome assembly based on nanopore sequencing of a wild-caught coconut rhinoceros beetle (Oryctes rhinoceros)

Background An optimal starting point for relating genome function to organismal biology is a high-quality nuclear genome assembly, and long-read sequencing is revolutionizing the production of this genomic resource in insects. Despite this, nuclear genome assemblies have been under-represented for agricultural insect pests, particularly from the order Coleoptera. Here we present a de novo genome assembly and structural annotation for the coconut rhinoceros beetle, Oryctes rhinoceros (Coleoptera: Scarabaeidae), based on Oxford Nanopore Technologies (ONT) long-read data generated from a wild-caught female, as well as the assembly process that also led to the recovery of the complete circular genome assemblies of the beetle’s mitochondrial genome and that of the biocontrol agent, Oryctes rhinoceros nudivirus (OrNV). As an invasive pest of palm trees, O. rhinoceros is undergoing an expansion in its range across the Pacific Islands, requiring new approaches to management that may include strategies facilitated by genome assembly and annotation. Results High-quality DNA isolated from an adult female was used to create four ONT libraries that were sequenced using four MinION flow cells, producing a total of 27.2 Gb of high-quality long-read sequences. We employed an iterative assembly process and polishing with one lane of high-accuracy Illumina reads, obtaining a final size of the assembly of 377.36 Mb that had high contiguity (fragment N50 length = 12 Mb) and accuracy, as evidenced by the exceptionally high completeness of the benchmarked set of conserved single-copy orthologous genes (BUSCO completeness = 99.1%). These quality metrics place our assembly ahead of the published Coleopteran genomes, including that of an insect model, the red flour beetle (Tribolium castaneum). The structural annotation of the nuclear genome assembly contained a highly-accurate set of 16,371 protein-coding genes, with only 2.8% missing BUSCOs, and the expected number of non-coding RNAs. The number and structure of paralogous genes in a gene family like Sigma GST is lower than in another scarab beetle (Onthophagus taurus), but higher than in the red flour beetle (Tribolium castaneum), which suggests expansion of this GST class in Scarabaeidae. The quality of our gene models was also confirmed with the correct placement of O. rhinoceros among other members of the rhinoceros beetles (subfamily Dynastinae) in a phylogeny based on the sequences of 95 protein-coding genes in 373 beetle species from all major lineages of Coleoptera. Finally, we provide a list of 30 candidate dsRNA targets whose orthologs have been experimentally validated as highly effective targets for RNAi-based control of several beetles. Conclusions The genomic resources produced in this study form a foundation for further functional genetic research and management programs that may inform the control and surveillance of O. rhinoceros populations, and we demonstrate the efficacy of de novo genome assembly using long-read ONT data from a single field-caught insect. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08628-z.

Background Adult coconut rhinoceros beetles, Oryctes rhinoceros L. (Coleoptera: Scarabaeidae), feed by boring into the crown of coconut palms. This damages growing tissue and significantly reduces coconut yields and can lead to the death of trees. Native to southeast Asia, this pest was accidentally introduced into Samoa in 1909 [1], and it has since spread across the tropical Pacific, bringing a significant threat to the livelihoods of the peoples of Pacific island nations for whom the coconut palm ('the tree of life') is an important source of food, fibre and timber. Invasive populations of O. rhinoceros have been suppressed over the past 60 years through management approaches that included the release of a biocontrol agent, Oryctes rhinoceros nudivirus (OrNV) [2]. However, a highly damaging infestation by O. rhinoceros in Guam in 2007 was not controlled with OrNV, and the beetle's subsequent expansion to other Pacific Islands including Papua New Guinea, Hawaii, Solomon Islands, and most recently Vanuatu and New Caledonia [3][4][5][6], suggests potential changes in this biological system [7] that require new approaches to management, including the isolation and deployment of highly virulent OrNV strains for specific O. rhinoceros genotypes [8].
Genome sequencing has enabled better understanding of population outbreaks, invasion and adaptation mechanisms in insect pests [9,10]. Functional and comparative genomics studies are identifying new targets for control and the implementation of integrated pest management strategies [11]. Draft genome assembly is generally a good starting point for relating genome function to organismal biology, but the production of this genomic resource for agricultural pests has lagged behind that of some other insects [11,12]. A recent project aiming to tackle this lag is the Ag100Pest Initiative, led by the United States Department of Agriculture's Agricultural Research Service (USDA-ARS), that is set to produce reference quality genome assemblies for the top 100 arthropod agricultural pests in the USA, with nearly one third of species belonging to Coleoptera [13].
Draft genome assemblies are very useful for population genomic analyses, enabling the design of, for example, optimal protocols for reduced genome representation sequencing [14]. However, draft genome assemblies that are highly fragmented, incomplete or misassembled have limited use for functional genomic studies. Transcriptome assemblies are useful for studying functionally and sufficiently transcribed parts of the genome, but only complete and accurate genome assemblies provide information on non-transcribed regions (e.g. promoters, enhancers) that can have important influences on gene expression and, ultimately, economically-important phenotypes [13]. In addition, different types of nontranslated RNAs (e.g. microRNAs, lncRNAs) are often not detected in transcriptome studies but are included in complete and accurate genome assemblies. These can help us understand how insect pests interact and respond to their hosts, pathogens, the environment and they can reveal new targets for novel genetic control measures (e.g. RNAi [15], gene drives [16,17]).
Obtaining high-quality genome assemblies is often challenging in insects [12], particularly from short-read sequencing data (e.g. Illumina) for species with high levels of DNA polymorphism and repetitive genomic elements [18]. These issues are further compounded for insects of small physical size or for partial specimens, as they may require whole genome amplification or the pooling of several individuals to obtain enough DNA for library preparations. Different methods of whole genome amplification vary in their ability to preserve specific genetic variation and can be biased against regions with high GC-content, smaller and low-abundance DNA fragments [19]. They can also create chimeric fragments and amplify contaminating DNA that can be erroneously integrated into the target assembly. Pooling of individuals is preferably done with individuals from a line that has undergone inbreeding to reduce genetic variation, but many pest species cannot be colonised in the laboratory. Moreover, for those insects that can be lab-reared, intensive inbreeding procedures such as full-sib mating for tens of generations may not reduce heterozygosity in all parts of the genome (e.g. [20]). The pooling of wild-caught samples is particularly problematic given the possibility of combining cryptic species or biotypes, which would impact assembly quality and lead to spurious biological conclusions. When presented with a highly heterozygous genome or a pool of diverse haplotypes, the standard assembly process tends to report a heterozygous region as alternative contigs (instead of collapsing them into a single haplo-contig) and is unable to resolve multiple paths between homo-and heterozygotic regions, producing a highly fragmented assembly with an erroneously inflated total size [21]. Such assemblies cause problems in genome annotation and downstream analyses, giving fragmented gene models, wrong gene copy numbers, and broken synteny. They also preclude linkage mapping and genome-wide association studies.
The development of long-read sequencing technologies is revolutionizing the production of contiguous and complete insect genome assemblies [18], but their requirement for large quantities of input DNA have complicated their application to single-insect assemblies. However, new low-input protocols were recently demonstrated for Pacific Biosciences (PacBio) long-read sequencing, producing high-quality single-insect genome assemblies for the mosquito Anopheles coluzzii [22] and spotted lanternfly Lycorma delicatula [23]. A chromosome-level assembly was recently reported for a single outbred Drosophila melanogaster generated using a combination of long-read sequences from Oxford Nanopore Technologies (ONT), Illumina short-read sequences and Hi-C data [24]. However, the small size of this insect necessitated genome amplification to prepare the sequencing libraries, and the final assembly was ~ 20% smaller than the canonical reference genome for D. melanogaster [24].
Here, we present a high-quality de novo genome assembly based on ONT long-read data from a single wild-caught adult female of the coconut rhinoceros beetle (O. rhinoceros, NCBI:txid72550). The amount of DNA extracted from this large insect was sufficient to prepare multiple ONT libraries without genome amplification. Data from just one flow cell were enough to produce a high-quality draft assembly of the beetle's nuclear genome, and data from four MinION flow cells enabled the assembly that is among the most accurate and complete of the published Coleopteran genomes, as well as the assembly of its mitochondrial genome [25], and the genome of the biocontrol agent Oryctes rhinoceros nudivirus (OrNV) [26] that had infected the individual we analysed.

ONT library preparation and sequencing
We used a customized Solid-phase Reversible Immobilization (SPRI) bead-based protocol to extract high molecular weight (HMW) DNA from an O. rhinoceros female (see Materials and methods, Fig. 1A-C). Given the large size of the insect, we achieved high quantity (~ 10 μg) and quality HMW DNA (Supplemental Fig. 1), that we size-selected with the Circulomics XS kit (Fig. 1C), and prepared four standard ligation-based ONT libraries. Each library was sequenced on a MinION Flow Cell (Fig. 1D), yielding between 896,000 and 1.48 million raw reads. After basecalling with Guppy v.3.2.4, we obtained a total 29.4 Gb of sequence data with 89.8% passing the QC filtering (Phred score ⪆8). 26.4 Gb of high-quality data with the read length N50 of ~ 11.3 kb were used for downstream analyses (Supplemental Table 1), and the longest recorded read that passed the QC filtering was 143.6 kb. For the second round of analyses, we used the newer base-caller version, Guppy v4.2.2, which improved the yield of high-quality reads (a total of 29.5 Gb of data, 92.1% passing the QC filtering, Phred score ⪆8). These 27.2 Gb of high-quality reads had a length N50 of ~ 11.2 kb and were used for the main downstream analyses (Supplemental Table 1). The longest read that passed the QC filtering in this dataset version was ~ 148.4 kb.

Genome assembly and quality assessment
Because we expected the long-read data (LR) to contain some percentage of mitochondrial, bacterial and other contaminant DNA reads, we first ran the longread assembler Flye version 2.5 ( Fig. 2A) in metagenome mode that accommodates a highly non-uniform coverage of genomic fragments and is sensitive to under-represented sequences [27]. The initial draft assembly graph (S4-i-v1-g, Fig. 2B) consisted of 512 nodes with N50 length of 7.9 Mb and total assembly size of 370.4 Mb. This initial draft assembly graph was then screened for the mitochondrial genome sequence, expecting a circular node 11 kb to 22 kb in size (based on a typical mitogenome size in insects [28]), and a disproportionately high depth of coverage (given that there are tens/hundreds of copies of the mitochondrial genome per nuclear genome copy in each cell). We identified one node with such characteristics: edge_110 (Fig. 2D) was 21,039 bp in length and had a median coverage of 10,292X, showing the NCBI 'blastn' match with the mitochondrial genome assembly sequences (complete or partial) of beetles and other insects. Another circular node (edge_371) (Fig. 2C) with a high depth of coverage (1196X) was 126,204 bp in length, which we identified through the NCBI 'blastn' search as the Oryctes rhinoceros nudivirus (OrNV), a double-stranded DNA virus used as a biocontrol agent against O. rhinoceros [29]. Both nodes were removed from the draft assembly graph and analysed separately ( Fig. 2E-F), and their detailed characterization is described elsewhere [25,26].
Given the potential for ONT basecalling to introduce systemic indel errors in the homopolymer regions of the ONT-based assemblies [30], we used Pilon [31], BWA-MEM aligner [32] and more accurate Illumina Whole Genome Sequence data to remove small indels in the initial linearised draft assembly ( Fig. 2G-I). We used the previously generated Illumina short reads from a whole-genome sequencing library that we prepared using the NebNext Ultra DNA II Kit (New England Biolabs, USA) with DNA extracted from another O. rhinoceros female that was collected from the same geographic location. The short-fragment Illumina library ( Fig. 1F-G) contained ~ 39.4 Gb of 150 bp paired end read data. We point out that Illumina sequencing library intended for the polishing of an ONT-based assembly would ideally be prepared from the same individual that was used to generate the long-read data. This would allow not only the correction of indels but also the correction of SNPs in the assembly consensus sequences. For the experiments with small-bodied insects that yield limited amounts of DNA, we recommend using the Low Molecular Weight (LMW) DNA found in the supernatant of the ONT library preparation mix (LMW depletion step, Fig. 1E). It is also worth noting that the indel error correction with the Illumina short reads has limitations in repetitive regions of the assembly, where short reads cannot be accurately aligned. For polishing, we used 92.4% of the Illumina reads that aligned to the initial genome assembly (S4-i-v1). Of the remaining reads, 6.1% aligned to the mitogenome and 0.2% to the OrNV genome, leaving 1.3% of the short-reads unaligned. The resulting polished initial genome assembly version S4-i-v2 consisted of 427 fragments (6 scaffolds and 421 contigs, Fig. 2J), with the fragment N50 length of 9.8 Mb, the longest fragment of 18.9 Mb, and a total assembly size of 369.2 Mb (34.9% GC content).
A quantitative assessment of the initial assembly's accuracy and completeness was done through the benchmarking analysis of conserved genes, as implemented in BUSCO [33]. Using the BUSCO collection of 2124 genes from the endopterygota database (endoptery-gota_odb10), we found that the initial polished assembly (S4-i-v2) contained 97.9% complete genes, with 97.2% occurring as single copies and only 0.9% missing. In comparison, BUSCO analysis of the unpolished assembly version (S4-i-v1) recovered only 65.1% genes as complete and 19.6% as missing, revealing the substantial impact of the uncorrected indel errors on gene prediction and detection (Supplemental Table 2).
To further improve the assembly quality, we used the latest available version of the base-caller Guppy (v4.2.2) in high accuracy mode, and the latest available version of the long-read assembler Flye (v2.8.2) to generate multiple draft assemblies ( Fig. 3A-B) by increasing the minimum read overlap parameter for each assembly from 5 kb to 10 kb in increments of 500 bases. The Illumina shortreads were aligned against each draft assembly using BWA-MEM (Fig. 3C), and the resulting alignments were further utilised to polish indels within each draft assembly (Fig. 3D). This iterative process produced a collection of 11 polished draft assemblies (Fig. 3E), and each was assessed for contiguity (assembly-stats "https:// github. com/ sanger-patho gens/ assem bly-stats ") and completeness (BUSCO) (Fig. 3F) (Supplemental Table 2). The best overall assembly (S4-7k-1v2) was produced with a minimal read overlap of 7 kb, and this parameter value was used to repeat the assembly, polishing and assessment two additional times (producing S4-7k-2v2 and S4-7k-3v2). The best of these three versions (S4-7k-2v2) was selected for further processing. We then removed the OrNV and mitochondrial sequences from the assembly (published previously [25,26]), and this version were used to remove erroneous indels in homopolymers (E, F) to produce analysis-ready assemblies [25,26]. The remainder of the draft assembly (G) is linearized (H) and short reads [SR] are used to remove erroneous indels (I) in each scaffold, producing an initial polished nuclear genome assembly for Oryctes rhinoceros (J) (S4-7k-2v3) was further analysed with DIAMOND [34] and MEGAN [35,36] in order to identify potential contaminant fragments. All assembly sequences that were not classified within Arthropoda in this pipeline were additionally checked against the NCBI's online databases of nucleotide (nt/nr) and non-redundant protein sequences (nr) to identify the origin of a putative contaminant sequence (Fig. 3G). Given that none of the analysed sequences had a significant BLAST hit to a taxon other than Coleoptera, we did not consider them as contaminants and did not remove them from the final genome assembly (S4-74-2v3, Fig. 3H). This final assembly consisted of 1013 fragments (6 scaffolds and 1007 contigs), with the fragment length N50 of 10.7 Mb and the longest fragment (contig_6) of 32.7 Mb (Table 1, Supplemental  Table 2).
The size of our final O. rhinoceros nuclear genome assembly (S4-7k-2v3, GenBank assembly accession: GCA_020654165.1) was 377.4 Mb, which is very similar to the latest assembly for the congeneric beetle O. borbonicus (371.60 Mb in ungapped length, NCBI accession:   Table 4). Of note is that the original assembly for O. borbonicus, generated with the shortread Illumina technology, was first reported to be 518 Mb [37], but refinement with the 10X Genomics data led to a 28% reduction in size (removal of more than 140 Mb). The inflated size of the initial assembly was explained as a consequence of an incorrect haploidization of the assembly i.e., divergent haplotypes were assembled separately across many parts of the genome [38]. This exemplifies the difficulties of the assembly process based on the short-read sequencing of samples that have high genome-wide variability. Conversely, our O. rhinoceros assembly indicates that the correct haploidization is not problematic for long-read assemblers like Flye [27], particularly when the long-read data are generated from a single insect.

Comparison with other available nuclear genome assemblies in Coleoptera
A recent 'state of the field' overview of insect genome assemblies [18] reports that this biological resource has been significantly underrepresented in Coleoptera (i.e. few genome assemblies are produced relative to the species richness), but that long-read sequencing is revolutionizing the creation of high-quality assemblies across insect groups [18]. We analysed 39 representative nuclear genome assemblies in the Coleoptera (out of 41 accessed from NCBI's GenBank in October 2020) and found that one third were generated with data that included long-read sequences (nine assemblies with PacBio, four with ONT). For a total set of 39 analysed assemblies (Fig. 4), the mean fragment N50 was 6.9 Mb (median: 298.9 kb, SD: 19.9 Mb) and the mean BUSCO completeness was 88.4% (median: 92.4%, SD: 14.3%). These quality metrics are above the average for a set of 601 assemblies from 20 insect orders (N50: 1.1 Mb, BUSCO completeness: 87.5%, [18]).
Our O. rhinoceros assembly had the highest assembly accuracy and completeness among 39 benchmarked Coleopteran genomes, having only 0.5% missing BUS-COs (10 out of 2124 core genes) and 0.4% fragmented BUSCOs (9 out of 2124 core genes) (Fig. 4). A genome assembly from another member of the family Scarabaeidae, Onthophagus taurus, had the same number of missing BUSCOs but twice as many duplicated genes (2.7%), and a substantially lower assembly contiguity, with scaffold (fragment) L50 of 160 versus 12 in O. rhinoceros (Supplemental Table 4).
Inspection of other beetle genomes that have also been assembled using ONT data confirms that this technology facilitates production of assemblies with high contiguity and completeness (Supplemental Tables 3 and 4). However, two examples in true weevils, Rhynchophorus ferrugineus and Listronotus bonariensis (Curculionidae), reveal that low completeness and accuracy (evident as low number of single copy BUSCOs) exist in ONT-based assemblies that have both low and high contiguity (Supplemental Tables 3 and 4).
Until recently, ONT's requirement for large amount of input DNA precluded full utilization of this technology in small-bodied insects, but a recent example of the ONT-based assembly from a single Drosophila [39] indicates that genome assemblies of high completeness (96.9% complete BUSCOs), albeit partial genome length (85%), could be achieved with this technology even for very small insects. Considering that the limiting factor of ONT technology is the density of available nanopores per flow cell, the sequencing yield could be improved by having more shorter DNA fragments rather than fewer long ones when the amount of input DNA is small. We also note that the improvements in the later versions of both the ONT basecaller Guppy and the long-read assembler Flye are reflected in a substantially better draft assembly prior to any indel polishing (see Fig. 4: S4-7k-2v1 versus the equivalent non-polished assembly S4-i-v1 that was produced with the older software versions).

Structural annotation and quality assessment
To delineate protein-coding genes, we used the BRAKER pipeline (Fig. 3J) which enables an automated training of the gene prediction tools (GeneMark-EX and AUGUS-TUS) with the extrinsic evidence from the RNA-Seq experiments [40][41][42][43][44][45][46]. We used the publicly-available RNA-seq data that cover different life stages of O. rhinoceros, from early instar larva, late instar larva, pupa, and the adult stage (NCBI accession: PRJNA486419; [47]), which is expected to maximize the probability of capturing the sequences of the entire set of expressed genes in this organism. To check data quality from these RNA-seq samples, we first aligned the reads against our genome assembly with the splice-aware aligner HISAT2 [48], and used these alignments to produce a genomeguided transcriptome with Trinity [49]. The assembled transcriptome had a very high BUSCO completeness (97.5%), indicating that the source RNA-seq dataset provides an excellent training set for gene prediction. Along with these aligned RNA-seq reads, the BRAKER pipeline was supplied with the final genome assembly (S4-7k-2v3) that had the repetitive regions (transposons and simple repeats) soft-masked on 32.7% of the assembly sequences (using the repeat detector Red [50]). The gene prediction algorithm produced a set of 16,375 protein-coding genes with a total of 20,072 transcripts. Our results match the available data for other members of Coleoptera; for example, 16,538 genes were reported for the bull-headed dung beetle Onthophagus taurus (Scarabaeidae) [51,52], and the latest reference annotation for the red flour beetle Tribolium castaneum (Tenebrionidae) reports 16,593 genes with a total of 18,536 transcripts [53].
We also delineated the non-protein-coding RNAs, using tRNAscan-SE [54] and Infernal [55] with the Rfam database [56,57] (Fig. 3L). The annotation produced predictions for 18 tRNA-like pseudogenes, one selenocysteine tRNA gene, and 13 unknown isotypes. The number of tRNA genes predicted in O. rhinoceros (392) is highly congruent with another scarab beetle, O. taurus, that has 395 predicted tRNA genes [51,52]. Our annotation with all predicted protein-coding genes, as well as non-protein-coding genes (including rRNA, miRNA) and other features is provided as a gff3 file (Fig. 3M) (Additional file 1).
The benchmarking analysis (Fig. 3K) indicated that our structural annotation of protein-coding genes in O. rhinoceros assembly is of high quality, with only 2.8% of BUSCOs missing. Somewhat higher missingness obtained for the annotated gene set when compared to the assembly (2.8% vs. 0.5% missing BUSCOs) can be explained by the similar completeness of RNAseq dataset (1.8% missing BUSCOs) that was used by the annotation pipeline to guide gene predictions. It could also indicate that the annotation pipeline, which uses multiple sources of evidence, has generated slightly inferior gene models for a set of single-copy orthologs than the single-predictor approach that BUSCO takes when working directly on the assembly sequences [58]. Such differences have been reported, for example, in the BUSCO assessment of 15 Anopheles mosquito genomes and their annotated gene sets [58].

Predicted gene models recover the correct phylogenetic placement of O. rhinoceros
In addition to the BUSCO metrics, we assessed the quality of our predictions of single-copy orthologous genes through a phylogenetic analysis. We used the largest data source for beetle phylogenetics to date, generated by Zhang et al. [59], that includes partial sequences of 95 nuclear protein-coding genes from 373 beetle species and 10 outgroup taxa. Out of 95 genes used by Zhang et al. [59], 94 genes were identified in our structural annotation, and one gene that was missing from the annotation was identified in our assembly. The concatenated alignment supermatrix consisted of 24,542 amino acids (Additional file 2) and the phylogeny was estimated using the maximum likelihood method in RAxML-ng [60] (Additional file 3). The resulting phylogeny was well resolved, and O. rhinoceros was correctly grouped with two other members of the Dynastinae subfamily (branch support = 100%, Fig. 5A), further confirming the high quality of our predictions of single-copy genes.

Possible expansion of sigma GST genes in Scarabaeidae
We wanted to check if paralogous genes are also correctly predicted in our annotation. Sigma Glutathione-S-Transferase genes (Sigma GSTs) belong to an ancient gene family and one of six classes of cytosolic GSTs in Although seven sigma GST genes were previously reported in T. castaneum [37], only five were detected in its current genome annotation [53]. Sigma GST class is shaded in blue. Other GST classes (omega, delta, theta, epsilon, zeta) are found as divergent and highly-supported branches Filipović et al. BMC Genomics (2022) 23:426 insects, that were previously reported to have undergone Oryctes-specific expansion [37]. Meyer and colleagues found 12 Sigma GST paralogs in their O. borbonicus assembly, while the genomes of four other insects they analysed, including two beetles (Tribolium castaneum and Dendroctonus ponderosae), did not have more than seven paralogues in this GST class [37]. Based on this pattern, they hypothesized that the expansion of Sigma GST genes occurred specifically in the beetle lineage containing Oryctes species. Assuming that initial O. borbonicus assembly contained divergent haplotypes that were not correctly haploidized [38], the likelihood of an erroneous inference of gene duplications in this taxon is high. Our O. rhinoceros assembly and annotation recovered nine Sigma GST genes grouped on two contigs (Fig. 5B, Additional files 4 and 5). We then analysed genome annotations of two other beetles whose assemblies also showed very high BUSCO completeness (> 98%), O. taurus and T. castaneum, as well as the annotation of Drosophila melanogaster that is considered a gold standard for this genomic resource in insects. Sigma GST is found in only one copy in D. melanogaster, while five paralogs were detected in T. castaneum and 12 in O. taurus (Fig. 5B), Based on this limited taxon sampling, there is an indication that sigma GST family expansion occurred in the Scarabaidae lineage, as both O. taurus and O. rhinoceros (Scarabaeidae) contain more sigma GST genes than T. castaneum (Tenebrionidae), and these sigma duplications might have an important role in eliminating the by-products of oxidative stress [61]. However, more genome assemblies and annotations of very high accuracy and completeness are needed across Coleoptera to be able to confidently infer evolutionary expansion of gene families in this insect order.

Application of genomic resources for O. rhinoceros management RNAi target discovery
RNA interference (RNAi) is a promising new approach for insect pest control, particularly for beetles that exhibit a robust environmental RNAi response [62,63]. RNAi is a highly-specific gene-silencing mechanism in which double-stranded RNA (dsRNA) directs cleavage of complementary endogenous mRNA. When targeting essential insect genes, RNAi causes rapid mortality and could be developed into a control tool that is integrated with other pest management tactics. Through the mining of our O. rhinoceros assembly and annotation, we identified orthologs of all 30 genes (Supplemental Table 5) that were experimentally validated as effective RNAi targets in T. castaneum, ten of which were also validated in Diabrotica v. virgifera and four in Brassicogethes aeneus [64]. The strongest candidates for initial testing in O. rhinoceros are orthologs of D. melanogaster's Prp19, Spt5 and RPII-215 (Supplemental Table 5), as they exhibited > 79% mortality upon injection or feeding with dsRNA in at least two of the three tested beetles [65].

Investigating interactions with a biocontrol agent
The assembled and annotated genome of O. rhinoceros provides an excellent opportunity to get genomewide insight into the interaction between this insect pest and its control agent, Oryctes rhinoceros nudivirus (OrNV). For example, differences in the pattern on genome-wide expression can be traced between insects that have been experimentally infected with OrNV and the control group (non-infected) via transcriptome analysis. This approach for identifying putative infection-responsive genes has been used to study the interaction between one of the most important crop pests, the diamond-back moth Plutella xylostella, and the fungal insect pathogens, Beauveria bassiana and Metarhizium anisopliae, that have been widely used as insecticides [66]. For this type of a study, having access to a high-quality genome annotation is very important, as it has been shown that quality of a genome annotation strongly influences the inference of gene expression [67]. Identifying key O. rhinoceros genes that respond to OrNV infection could narrow a search for the causal genomic changes underlying the suspected attenuation of OrNV pathogenicity against this beetle. Namely, the resurgence and spread of O. rhinoceros over the last decade is hypothesized to be driven by the emergence of the virus-tolerant beetle populations and/ or less virulent OrNV strains. The molecular basis for this suspected change in the beetle-OrNV interaction could reside in the regulation of small interfering RNAs (siRNAs) that are a known part of the insect immune response to viral infections [8,68]. Our annotation contains the predictions for various non-protein-coding RNAs, laying a good foundation for further in-depth characterization of these regulatory genomic elements in O. rhinoceros.

Conclusions
We provide a highly contiguous and accurate nuclear genome assembly and structural annotation for an important invasive pest of palm trees, the scarab beetle O. rhinoceros. The assembly is based on the ONT sequencing of a single wild female, further demonstrating the utility of long-reads (and ONT sequencing in particular) in generating high-quality de novo genome assemblies from field specimens. Along with our structural annotation, this genomic resource opens up avenues for further biological discoveries aiming to improve the management of this pest, from the functional studies of interactions with the existing biocontrol agents, to the development of new control solutions via RNAi tools.

Field collection and DNA isolation
Oryctes rhinoceros adults were collected from a pheromone trap (Oryctalure, P046-Lure, ChemTica Internacional, S. A., Heredia Costa Rica) on Guadalcanal, Solomon Islands in January 2019 and preserved in 95% ethanol. High-molecular weight (HMW) DNA was extracted from a single female using a customized paramagnetic (SPRI) bead-based protocol. Specifically, we dissected pieces of tissue from four legs and the thorax, avoiding the abdomen to minimize the proportion of gut microbiota in the total DNA extract (Fig. 1A). We incubated approximately 50 mm 3 of tissue in each of the eight 1.7 mL eppendorf tubes with 360 μL ATL buffer, 40 μL of proteinase K (Qiagen Blood and Tissue DNA extraction kit) for 3 h at room temperature, while rotating end-overend at 1 rpm. Four hundred microliters of AL buffer were added to each reaction and incubated for 10 min at room temperature, followed by the addition of 8 μL of RNase A and incubation for 5 minutes at room temperature. To remove the tissue debris, each tube was spun down for 1 min at 16,000 rcf and 600 μL of homogenate was transferred to a fresh tube. Six hundred microliters of the SPRI bead solution were added to each homogenate and incubated for 30 min while rotating at end-over-end at 1 rpm. After two washes with 75% ethanol, DNA in each tube was eluted in 50 μL of TE buffer. All eight elutions were combined and DNA quality was assessed on the 4200 Tapestation system (Agilent) and with the Qubit broadrange DNA kit (Fig. 1B). Finally, we used the Circulomics Short Read Eliminator XS kit to enrich the DNA elution with fragments longer than 10 kb (i.e. High Molecular Weight, HMW, DNA, Fig. 1C).

ONT library preparation and sequencing
One microgram of the size-selected HMW DNA was used as the starting material for the preparation of each ONT library, following the manufacturer's guidelines for the Ligation Sequencing Kit SQK-LSK109 (Oxford Nanopore Technologies, Cambridge UK). Four libraries were sequenced on four R9.4.1 flow cells using the Min-ION sequencing device and the ONT MinKNOW Software (Oxford Nanopore Technologies, Cambridge UK) (Fig. 1C).

Genome assembly
High accuracy base-calling from the raw ONT data was computed with Guppy v3.2.4 (for the initial assembly) and Guppy v4.2.2 (for the final assembly). The initial genome assembly (S4-i-v1) was produced with Flye version 2.5 [27] using the following input parameters: the approximate genome size (--genome-size) of 430 Mb, based on the size of an initial genome assembly in a related species O. borbonicus [37] two iterations of polishing (--iterations 2), aimed at correcting a small number of extra errors based on the improvements on how reads align to the corrected assembly; a minimum overlap between two reads (--min-overlap 5000) of 5000 bp; and a metagenome mode (--meta) to allow for the recovery of mitochondrial, symbiont, pathogen and other "contaminant" genomes, given that this mode is sensitive to highly variable coverage and under-represented sequences [27]. Flye version 2.8.2 was used during the iterative process for the final genome assembly (S4-7k-2v), with the parameter '--min-overlap' ranging from 5000 bp to 10,000 bp in 500 bp increments while keeping other parameters (--genome-size, --iterations, --meta) unchanged.

Identification of pathogens, symbionts, contaminants
Screening of the circular nodes with a disproportionately high coverage in the initial genome assembly graph identified the OrNV and mitogenome, and they were removed from further analyses. A linearized set of the remaining putative genome assembly sequences (contigs and scaffolds) were locally compared against the NCBI non-redundant protein (nr) database using DIAMOND [34] version 0.9.24 in 'blastx' mode. The NCBI database was downloaded from ftp. ncbi. nih. gov/ blast/ db/ FASTA/. The results obtained with DIAMOND were analysed with the metagenome analyser tool MEGAN [36]. Any sequence not classified within Arthropoda was also checked against the NCBI's online database of nucleotide (nt/nr) and non-redundant protein sequences (nr) to identify the origin of a suspected contaminant sequence.

Polishing of the genome assembly with Illumina reads
Indel errors in the homopolymer regions represent inherent basecalling errors of the ONT platform [30]. To remove putative indel errors in the draft assembly, we used the genome polishing program Pilon version 1.23 [31] that was supplied with the spliced-aware alignments of the Illumina reads from one whole-genome sequencing library. DNA for this Illumina library originates from a female beetle collected in the same location as the female used for the ONT sequencing. Because Illumina and ONT data did not come from the same individual, we only performed indel polishing. The Illumina sequences were produced on a HiSeq X10 platform by Novogene (Beijing, China) using the 150 bp paired-end chemistry, and were processed in Trimmomatic [69] to remove Illumina adapters, and trim and filter each read based on the minimum phred score of 20.

Evaluation of genome assemblies
The completeness of the initial genome assembly (S4-i-v3) was evaluated using: (a) alignment of DNA-seq data, (b) alignment of RNA-seq data, and (c) the recovery of the benchmarking universal single copy orthologs (BUS-COs) [33]. We used the BWA-MEM aligner with default settings and recorded the percentage of mapped Illumina reads from the whole-genome sequencing dataset (Illumina DNA library described above) and four independently-generated RNA-seq datasets from the beetle's four life stages [47] (NCBI SRA Accession: PRJNA486419) that were combined prior to alignment with the beetle genome assembly. The number of recovered universal single-copy orthologs (SCOs) was obtained using the "genome autolineage" mode in BUSCO version 4.0.6, that first searched the databases 'eukaryota_odb10' (7 species, 255 SCOs), and 'endopterygota_odb10' (56 species, 2124 SCOs). To perform the comparative benchmarking, the same BUSCO analysis was done for 39 representative assemblies in the Coleoptera out of 41 that were available in the NCBI's GenBank in October 2020 (Supplemental Table 4). Two Coleoptera genomes (for Protaetia brevitarsis GCA_004143645.1, and Alaus oculatus GCA_009852465.1) were excluded due to a persistent BUSCO analysis failure with their assembly files.

Structural annotation
To perform the structural annotation of the final genome assembly, we used the independently-generated RNA-seq datasets from the beetle's four different life stages (NCBI SRA Accession: PRJNA486419) [47]. The RNA-seq reads were pruned of the Illumina adapters and aligned against our genome assembly with the splice-aware aligner HISAT2 (Fig. 3I). The quality and completeness of these RNA-seq data were assessed through the transcriptome assembly in Trinity version 2.10.0 [49,70], using the default settings in two modes: de novo and genomeguided assembly. To avoid incorporating the extraneous RNA sequences into the de novo transcriptome assembly, we used only those reads that were mapped with HISAT2 [48] to our S4-i-v3 genome assembly. The completeness of each transcriptome assembly was evaluated with BUSCO, using the 'auto-lineage' mode. The final genome assembly (S4-7k-2v3) and the splice-aware alignments (from HISAT2) were used for the genome-guided transcriptome assembly using the BRAKER pipeline version 2.1.4 (https:// github. com/ Gaius-Augus tus/ BRAKER/ relea ses/ tag/ v2.1.4). Annotation of the non-coding RNA genes was done with tRNAscan-SE version 2.0.6 [54,71] and Infernal version 1.1.3 [55] against the Rfam database v14.2 [56,57] that was available on Sep 72,020 (ftp:// ftp. ebi. ac. uk/ pub/ datab ases/ Rfam/ 14.2/).

Phylogenetic analysis using 95 genes across all major lineages of Coleoptera
The reported nucleotide alignment supermatrix with the sequences from 95 genes in 373 Coleoptera and 10 outgroup taxa (from Zhang et al. [59]) was partitioned into 95 separate alignments, each of which was then translated into amino acid sequences. Blastp was used to find their orthologs in O. rhinoceros annotation, identifying 94 genes. One remaining ortholog was found in O. rhinoceros assembly using blastx. Each of the 95 gene transcript sequences in O. rhinoceros was then aligned against the original amino acid alignment (from Zhang et al) using CLUSTAL Omega and all 95 separate alignments were then concatenated into the resulting alignment matrix with 24,542 amino acids. Maximum likelihood tree was inferred using RAxML-ng version 1.0.2 [60] with parameters: --model Blosum62 --opt-branches on --opt-model on --tree pars{10}, rand{10} --all --bs-trees autoMRE{200} --bs-cutoff 0.03 on the unpartitioned alignment (given that Zhang et al. [59] report high congruency between partitioned and non-partitioned datasets). The final nexus tree file is available in the Supplementary Data (Additional file 4). The tree visualization was done in FigTree [72].

Analysis of the sigma GST gene family
Genes from the Sigma Glutathione-S-Transferase family in O. rhinoceros were identified using blastp match (E-value << e-5) between the protein translated coding DNA sequences (CDS) of Drosophila's Gluthatione-Stransferase S1 gene (GstS1) and all of the protein translated CDS derived from our annotation. The protein sequences of the identified genes were then searched in the O. rhinoceros assembly using blastn, in case some sigma GST genes are missing from our annotation. We also extracted all CDS from genes that had a GST term in the annotation of two Coleoptera (O. taurus (Scarabaidae) and T. castaneum (Tenebrionidae) with the highest BUSCO score for genome assembly and annotation) and D. melanogaster, which cover other classes of GSTs (omega, delta, epsilon, theta, zeta). Nucleotide sequences identified and extracted across all four taxa (total of 137 sequences including D. melanogaster sina gene (sina) as an outgroup) were aligned using Clustal Omega [73], and