The Rhododendron Plant Genome Database (RPGD): a comprehensive online omics database for Rhododendron

Background The genus Rhododendron L. has been widely cultivated for hundreds of years around the world. Members of this genus are known for great ornamental and medicinal value. Owing to advances in sequencing technology, genomes and transcriptomes of members of the Rhododendron genus have been sequenced and published by various laboratories. With increasing amounts of omics data available, a centralized platform is necessary for effective storage, analysis, and integration of these large-scale datasets to ensure consistency, independence, and maintainability. Results Here, we report our development of the Rhododendron Plant Genome Database (RPGD; http://bioinfor.kib.ac.cn/RPGD/), which represents the first comprehensive database of Rhododendron genomics information. It includes large amounts of omics data, including genome sequence assemblies for R. delavayi, R. williamsianum, and R. simsii, gene expression profiles derived from public RNA-Seq data, functional annotations, gene families, transcription factor identification, gene homology, simple sequence repeats, and chloroplast genome. Additionally, many useful tools, including BLAST, JBrowse, Orthologous Groups, Genome Synteny Browser, Flanking Sequence Finder, Expression Heatmap, and Batch Download were integrated into the platform. Conclusions RPGD is designed to be a comprehensive and helpful platform for all Rhododendron researchers. Believe that RPGD will be an indispensable hub for Rhododendron studies.

traditional ornamental flowers [8]. The breeding history began with gardening enthusiasts in Western countries in the late eighteenth century [9]. Currently, there are over 28,000 cultivars of Rhododendron [10], which are widely cultivated in many regions such as Asia, America, and Europe [6]. Most wild rhododendrons are found in regions with temperate climates, high rainfall, humid atmosphere, and organic acid soils with low nutrient composition [11]. Furthermore, most varieties are derived through crossbreeding by gardening enthusiasts according to their preference for ornamental traits. In general, breeding goals have previously been focused mostly on ornamental characteristics rather than adaptability and resistance, resulting in a disconnect between existing varieties and market demands. Therefore, a challenge for Rhododendron breeding is the development of varieties capable of adapting to environments with cold winters, hot summers, lower rainfall and humidity, and less optimal soils [12].
Additionally, the genus Rhododendron has a long history in traditional medicine [7]. Phytochemists have demonstrated interest in Rhododendron species due to their abundance of secondary metabolites [13]. Currently, approximately 200 compounds, mostly flavonoids and diterpenoids, have been isolated from Rhododendron. Some of the isolates have demonstrated intriguing bioactivity [14,15]. For example, diterpenoids isolated from the flowers, roots, and fruits of R. molle exhibit significant anticancer, antiviral, antinociceptive, immunomodulatory, and sodium channel antagonistic activities.
With the rapid development of sequencing and genomic editing technology, molecular design breeding has become a more efficient and accurate plant breeding method [16]. Elucidation of the genetic mechanisms associated with ornamental traits (flower color, flower shape, etc.), adaptability, resistance, secondary metabolism, etc. will be a helpful and necessary foundation for more practical Rhododendron breeding. A great deal of omics data concerning Rhododendron have been accumulated to date and several rhododendron genomes have been sequenced. The R. delavayi genome sequence was released in 2017 [17], R. williamsianum in 2019 [18], and R. simsii in 2020 [19]. In addition, relevant transcriptomic data have also been published in recent years [20][21][22][23][24]. Progress in the development of highthroughput sequencing technology has greatly accelerated studies on Rhododendron [17][18][19][20][21][22][23][24]. These large genomic data sets provide a new perspective for understanding biological traits such as ornamentation, adaptability, resistance, and secondary metabolism for breeders and phytochemists alike.
Rhododendron omics data sets are currently distributed in public databases that are easily accessible [25,26]. However, processing these data is a considerable challenge for research groups with limited bioinformatics experience. To address this problem, we have constructed a comprehensive database for data storage, categorization, online analysis, and visualization of Rhododendron omics data sets.
Here, we present the Rhododendron Plant Genome Database (RPGD; http://bioinfor.kib.ac.cn/RPGD/), a data center for Rhododendron functional genomics researchers. The database integrates the three released genome sequences, expression profiles, functional annotations, gene family ontologies, simple sequence repeats, chloroplast genome assemblies, and gene homology information. We have also incorporated bioinformatics tools such as BLAST, JBrowse, Flanking Sequence Finder, Genome Synteny Browser, Ortholog Gene Finder, Expression Heatmap, and Batch Download into the user interface. The interface is designed to be simple and user-friendly. We suggest that RPGD will be of great convenience as a "one-stop shop" to a wide range of Rhododendron researchers.

Transcriptomic data
All publicly available RNA-Seq datasets in the NCBI Sequence Read Archive (SRA) database, including data from two projects and 19 samples, were obtained. One transcriptomics project was related to drought stress (4 samples) while the other was related to the flower bud in different dormancy statuses (15 samples) [23] ( Table 1). Both projects focused on R. delavayi.
We processed and analyzed the RNA-Seq datasets by a standard pipeline method. First, we used the SRA Toolkit [27] to convert the data format to FASTQ and lowquality reads were removed from raw reads by Trimmomatic [28]. We then employed Tophat2 [29] to map all clean reads onto the reference genome (R. delavayi) with default parameters, which were assembled using Cufflinks (version 2.2.1) using the reference genome as a guide [30]. Combined transcriptome assemblies were generated using Cuffmerge. Based on the alignments, the read counts of each gene were calculated and normalized to fragments per kilobase of transcript per million mapped fragments (FPKM) values in Cuffdiff. Mean and standard errors of the FPKM values were derived for the biological replicates.

Gene model and function annotation
A total of 89,496 protein-coding genes were collected from the downloaded data mentioned in the genomic data, including 32,938 from R. delavayi, 23,559 from R. williamsianum, and 32,999 from R. simsii. The protocol for annotating protein-coding genes is described as follows. Firstly, protein-coding genes were annotated using two software packages, eggNOG-mapper [31,32] and InterProScan with default parameters [33]. Then, the results from the two different tools were combined and redundant annotations were removed to obtain complete and precise GO annotations using homemade scripts. The protein sequences were aligned against the NCBI non-redundant (nr), UniProt (Swiss-Prot and TrEMBL), and Arabidopsis protein (TAIR) databases using the BLASTP command of DIAMOND with an E-value cutoff of 1e − 5 [34]. The BLASTP results against the UniProt and TAIR databases were then fed to the AHRD program (https://github.com/groupschoof/AHRD) to obtain concise, precise, and informative gene function descriptions. All BLASTP results are shown on the detailed gene page. All of these protein sequences were further compared against the InterPro database using InterProScan to identify functional domains [33]. As a result, the genes from R. delavayi were functionally annotated to 805,276 on GO database and 77,221 on InterPro. The R. williamsianum gene were functionally annotated to 687,600 on GO and 60,834 on InterPro. The R. simsii genes were functionally annotated to 785, 704 on GO and 81,654 on InterPro (Table 1).

Transcription factors and transcriptional regulators
The iTAK package was used to identify transcription factors (TFs) and transcriptional regulators (TRs) in the three Rhododendron genomes and all candidates were classified into different gene families using the default parameters [35]. Thus, R. delavayi contains 1662 TFs and 442 TRs, R. williamsianum contains 1261 TFs and 361 TRs, and R. simsii contains 1740 TFs and 416 TRs (Table 1).

Orthologous/paralogs group
OrthoFinder [36,37] was employed to identify orthologous and paralogous genes by using default parameters among R. delavayi, R. williamsianum, R. simsii, Actinidia chinensis [38], Camellia sinensis [39] and Arabidopsis thaliana [40]. In total, 18,048 orthologous groups were identified. To ensure that the inference of orthologous genes was sufficiently accurate, we extracted 985 groups of single-copy orthologs to construct the "Orthologous Groups" module (Table 1). We also used Ortho-Finder to search for pairwise homologous genes between the three Rhododendron genomes and A. thaliana respectively [36,37]. We considered the genes of each orthologous group as belonging to one gene family and mapped gene family information from A. thaliana to R. delavayi (4168 gene families), R. williamsianum (3546 gene families), and R. simsii (3742 gene families).

Simple sequence repeats
Simple sequence repeats (SSRs) were identified in R. delavayi, R. williamsianum and R. simsii by MISA with default parameters; the total number were 361,268, 230, 013, and 358,705, respectively [41] (Table 1). We also used Primer3 with default parameters to design primers for SSRs and the primers can be displayed on the SSR detail page [42].

Chloroplast genomes
We also collected full-length chloroplast genomes of R. delavayi and R. pulchrum from the NCBI database [43][44][45]. RPGD hosts two complete chloroplast genome assemblies of R. delavayi. One of them is 193,798 bp in length, and 123 genes were annotated, including 80 protein-coding genes, 35 tRNA genes, and 8 rRNA genes [43]. The other is 202,169 bp in length, a total of 137 genes were found, including 88 protein-coding genes, 41 tRNAs, and 8 rRNAs [44]. The chloroplast genome of R. pulchrum is 136,249 bp in length, and it contains 73 genes, comprising 42 protein-coding genes, 29 tRNA genes, and 2 rRNA genes [45] (Table 1).
Syntenic relationships among R. delavayi, R. williamsianum and R. simsii We identified syntenic blocks and homologous gene pairs in the three Rhododendron genomes. Protein sequences were first aligned against each other (pairwise comparisons) using BLASTP with an E-value cutoff of 1e − 5 [46]. Based on the BLASTP results and gene positions, syntenic blocks were determined using MCScanX with default parameters [47]. A total of 2913 syntenic blocks and 55,590 homologous genes were identified (Table 1) with detail presented in the "Tools/Genome Synteny" module. Users should note that the current assembly of draft genomes and annotations might affect the results of syntenic relationships, and we will update the data when new versions become available.

Implementation
RPGD was constructed using the LAMP framework, including Apache2 (a free and open-source cross-platform web server software; https://www.apache.org/), MariaDB (a relational database management system; https:// mariadb.org/), and PHP (a popular general-purpose f Genomic synteny blocks. g Homologous genes information in 6 organisms and BLASTP results against the nr-NCBI, UniProt and TAIR databases. h Gene/mRNA/CDS/protein sequences scripting language; https://www.php.net/). All data were stored on a Linux platform with the MariaDB database to facilitate efficient management, search, and display. The web pages were built using HTML5, CSS3, JavaScript, and Bootstrap3 (a free and open-source CSS framework directed at responsive, mobile-first front-end web development; https://getbootstrap.com/docs/3.3/). The Bootstraptable (an extended Bootstrap table with radio, checkbox, sort, pagination, extensions, and other added features; https://bootstrap-table.com/) and jQuery (a JavaScript library designed to simplify HTML DOM tree traversal and manipulation; http://jquery.com, version 3.4.1) were used to display the query results dynamically. Presentation of the diagram was made by Echart (a free, powerful charting and visualization library offering a way of easily adding intuitive, interactive, and highly customizable charts; https:// echarts.apache.org/zh/index.html).

Browsing RPGD
Users can browse all data in RPGD easily on the "Browse" page, including genome statistics, gene models, gene function annotations, SSRs, genome syntenic blocks, gene expression profiles, gene families and transcription factor information from R. delavayi, R. williamsianum and R. simsii, respectively. The information described above is presented in tabular form on the web page using a Bootstrap-table plug. Additionally, a detailed information page for a specific gene can be accessed by clicking the gene ID hyperlink. Information about each gene is displayed on a detailed page, including the gene summary, exons, gene structure (in JBrowse), GO, family, expression, homology, and sequence information.

Searching RPGD
A series of search tools are presented on the navigation menu "Search", such as "Gene", "Genome", "Gene Ontology", "Gene Family", "Gene Expression", "Transcription Factor", "Chloroplast Genome" and "SSR" to help users more easily find data of interest to them. (i). "Search Gene": RPGD provides four different ways to search genes including gene ID, AHRD descriptions, InterPro, GO accession, and GO term. The response is a dynamic table that contains all genes associated with the entered search terms, and the list of those genes can be downloaded as a TXT file for further analysis. Additionally, the details of the genes can be viewed by clicking the gene ID hyperlink. (ii). "Search Genome": users can use scaffold/chromosome ID to search the scaffold/ chromosome information. The results are divided into a list, a table, and a chromosome viewer. The list shows basic information about the chromosome, including the species, chromosome ID, and the length of the chromosome. The table displays information about all genes on the chromosome. The chromosome viewer is embedded in JBrowse to display the chromosome profile. (iii). "Search GO": users can use gene ID, GO accession, and GO term to query GO information of a gene. The responses are a set of genes annotated with the queried functions. Similarly, users can download the list of genes and click the gene ID hyperlink to review gene details. (iv). "Search Family": users can find genes with gene family names specified by the user. A list of genes related to this gene family are generated as the response. Users can also download the list of genes and click the gene ID hyperlink to view gene details. (v). "Search Gene Expression": users can input gene ID of interest to search their expression patterns based on currently provided transcriptomics results. The output is a line chart that shows graphically the expression level and can be downloaded locally for further analysis. (vi). "Search Transcription Factor": users can search for transcription factor genes by clicking transcription factor names. The responses are a list of genes annotated as transcription factors. Users can also download the list of genes and click the gene ID hyperlink to view gene details. (vii). "Search Chloroplast Genome": users can use the gene or product name to find the information from chloroplast genes. The response is a list of detailed information about the entered keywords. In addition, the list returned contains a number of hyperlinks which allow user to view the details about that chloroplast gene at NCBI. (viii). "Search SSR": RPGD provides SSR location, SSR type (monomer to hexamer) and SSR motif to query the SSR detailed information, including SSR ID, type, motif, size, and location. Users can click the SSR ID hyperlink to view SSR primer information. Examples are displayed below each search field that can be clicked to autofill the search keywords on every search page.

BLAST
BLAST is a sequence similarity searching program frequently used for bioinformatics queries [46]. ViroBLAST [48], a useful and user-friendly tool for online data analysis, was integrated into RPGD (Fig. 2a). Users can input their sequence of interest or upload their sequence files to perform BLASTN, BLASTP, BLASTX, tBLASTN, and tBLASTX against a whole genome, CDS, or peptide library.

JBrowse
A key mission of RPGD is to help users browse genomic data in detail. Therefore, JBrowse [49], a fast, scalable, and widely used genome browser built completely with JavaScript and HTML5, was embedded in RPGD to visualize genomic information (Fig. 2b). In RPGD, JBrowse hosts different tracks, including genome sequence, gene models, SSRs, and transcriptome-aligned BAM files of R. delavayi, R. williamsianum, and R. simsii, respectively. In addition, we will integrate other data styles, such as single-nucleotide polymorphisms (SNPs), as they become available.

Flanking sequence finder
The flanking sequences of genes often contain a wealth of information including regulatory elements and promoters. To aid in research of flanking sequences, we utilized gene annotations and genome data to develop a useful tool -"Flanking Sequence Finder". Researchers can find and download flanking sequences by inputting gene ID and specifying the length of the desired flanking sequences.

Genome syntenic browser
To view genome syntenic blocks and homologous gene pairs between the three Rhododendron genomes, we constructed the "Genome Syntenic Browser" module using AJAX, JavaScript and Echart. Users can browse the genome syntenic blocks or search for a specific block they want to query. Users can retrieve syntenic blocks by selecting a chromosome and subject genome together.
This module returns an image to displaying all syntenic blocks for every paired query and subject genome (Fig. 3a) and a full list of the syntenic blocks. For each syntenic block, users can jump to a new page by clicking on the block ID hyperlink which contains an image to display the homologous gene pairs (Fig. 3b). The full list of genes is also provided with links to the "data hub" interface to detail the gene information for each gene (Fig. 1).

Orthologous groups
A common task in routine bioinformatics analysis is the identification of homologous genes. Users can input gene IDs to find orthologous groups in R. delavayi, R. williamsianum, R. simsii, as well as A. chinensis, C. sinensis, and A. thaliana. The details of the homologous genes are be presented in a table, which also provides links to "data hub" page for each gene (Fig. 1).

Expression heatmap
RPGD not only stores gene expression profiles derived from RNA-Seq datasets but also provides an "Expression Heatmap" module ( Fig. 2c). "Expression Heatmap" can be used to retrieve the gene expression patterns of a group of genes from different samples. The output is a heatmap that graphically shows expression levels and can be downloaded locally for further analysis.

GO and KEGG enrichment analysis
Functional enrichment analysis is a powerful method for mining gene data, providing further insight into what biological processes these genes may be involved in. To help users to capture biological information of genes, we construct the GO and KEGG enrichment analysis tools base on the functional annotation mentioned above and clusterProfiler R package [50]. Users can input a list of interested genes to perform the enrichment analysis (Fig. 2d). The results returned the significantly enriched functional categories.

Download and batch download
All the data in RPGD were available for users to download, including genome assembly (FASTA), gene prediction (GFF3), gene function annotation (TXT), complete chloroplast genome (FASTA), gene family data (CSV), orthologous groups data (CSV), simple sequence repeat data (TXT), gene expression data (CSV), and other related data can also be downloaded in this module.
"Batch Download" is provided for users to export custom datasets or bulk download datasets from RPGD. Users can download multiple types of sequences (gene, CDS, PEP, flanking sequence and gene expression profile) by inputting a list of genes.

Conclusions
RPGD is dedicated to providing a comprehensive database of Rhododendron omics data. The current implementation of RPGD integrates important data including genome sequence assemblies, gene expression profiles, functional annotations, gene families, transcription factors, homologous genes, simple sequence repeats, and chloroplast genome assemblies. It also provides a series of tools for online data analysis and visualization. The integration of these data and tools makes RPGD a valuable database. We intend to continue updating the datasets when new data are released. For instance, our team will release a novel Rhododendron genome (R. irroratum) and its phenotypic datasets, including breeds, genotypes, and phenotypes in the near future. Additionally, we will continue to develop and integrate tools for functional, evolutionary, and network analysis. We hope that researchers will take advantage of these resources and also provide comments and suggestions for improving RPGD. Believe that RPGD will be in indispensable hub for Rhododendron studies.  3 Genome synteny viewer. a Syntenic blocks displayed in a circus plot. The darkslategray circle represents the query chromosome. Besides, the same color represents the same chromosome, and different circles of the same color represent different syntenic blocks located on the same chromosome. Additionally, the lines between darkslategray and other colors represent syntenic blocks identified between the two genomes. By the way, all blocks of this chromosome can be made to disappear from the image by clicking on the color that represents that chromosome. b Detailed view of a specific synteny block. The two gray lines represent the chromosomes of different species, and the red areas represent homologous gene pairs