Bolbase: a comprehensive genomics database for Brassica oleracea

Background Brassica oleracea is a morphologically diverse species in the family Brassicaceae and contains a group of nutrition-rich vegetable crops, including common heading cabbage, cauliflower, broccoli, kohlrabi, kale, Brussels sprouts. This diversity along with its phylogenetic membership in a group of three diploid and three tetraploid species, and the recent availability of genome sequences within Brassica provide an unprecedented opportunity to study intra- and inter-species divergence and evolution in this species and its close relatives. Description We have developed a comprehensive database, Bolbase, which provides access to the B. oleracea genome data and comparative genomics information. The whole genome of B. oleracea is available, including nine fully assembled chromosomes and 1,848 scaffolds, with 45,758 predicted genes, 13,382 transposable elements, and 3,581 non-coding RNAs. Comparative genomics information is available, including syntenic regions among B. oleracea, Brassica rapa and Arabidopsis thaliana, synonymous (Ks) and non-synonymous (Ka) substitution rates between orthologous gene pairs, gene families or clusters, and differences in quantity, category, and distribution of transposable elements on chromosomes. Bolbase provides useful search and data mining tools, including a keyword search, a local BLAST server, and a customized GBrowse tool, which can be used to extract annotations of genome components, identify similar sequences and visualize syntenic regions among species. Users can download all genomic data and explore comparative genomics in a highly visual setting. Conclusions Bolbase is the first resource platform for the B. oleracea genome and for genomic comparisons with its relatives, and thus it will help the research community to better study the function and evolution of Brassica genomes as well as enhance molecular breeding research. This database will be updated regularly with new features, improvements to genome annotation, and new genomic sequences as they become available. Bolbase is freely available at http://ocri-genomics.org/bolbase.

The A. thaliana genome has undergone two whole genome duplication events (α and β) within the crucifer lineage and one more ancient genome triplication event (γ) shared with most dicots (asterids and rosids) [5]. The Brassica and Arabidopsis lineages diverged from a common ancestor about 20 million years ago (MYA) after the α events [6], and a whole genome triplication event occurred subsequently in the Brassica ancestor 13-17 MYA [7]. The two representative Brassica diploids, B. rapa and B. oleracea, separated from each other about 3.75 MYA [8]. The genetic system of Brassica species, particularly of those described by the "triangle of U" (the relationship between three diploids and three synthetic tetraploids) [1], provides an unprecedented opportunity to study inter-species hybridization, polyploidization, genome evolution and its role in plant speciation. The genome of B. rapa (A genome) has been sequenced and made available in the BRAD database [9]. Recently, we finished the genome assembly of B. oleracea (C genome) and submitted the data to NCBI. These primary genomic data will facilitate structural, functional, and evolutionary analyses of Brassica genomes, as well as those of other Brassicaceae.
There now exist several public databases for B. oleracea genome sequence data, including Brassica Genome Gateway (http://brassica.bbsrc.ac.uk/), Brassica.info (http://www. brassica.info/resource/databases.php), and AAFC Comparative Genome Viewer (http://brassica.agr.gc.ca/navigation/ viewer_e.shtml). These databases present only partial genomic data for B. oleracea, such as QTLs, ESTs and cloned genes. To better access, search, visualize, and understand the genome sequences, annotation, structure, and evolution of the B. oleracea genome, we developed a comprehensive web-based database, Bolbase (http:// ocri-genomics.org/bolbase), which include genome sequence data and comparative genomics information. This user-friendly database will serve as an infrastructure for researchers to study the molecular function of genes, comparative genomics, and evolution in closely related Brassicaceae species as well as promote advances in molecular breeding within Brassica (Figure 1).

Construction and content
The genome of B. oleracea capitata (line 02-12) was sequenced by next generation sequencing technologies combined with 454 and Sanger sequencing. In total, a 540-Mb draft assembly, representing 85% of the estimated 630-Mb genome, was generated and submitted to NCBI. In Bolbase, we collected the complete sequence assembly, including nine pseudomolecular chromosomes, 1,848 scaffolds, and all genome components, comprising 45,758 predicted protein-coding genes, 13,382 transposable elements, and 3,581 non-coding RNAs. For each annotated genomic component, we supplied detailed annotations and crosslinks to publicly available databases. Moreover, we provided a comprehensive analysis of synteny among B. oleracea, B. rapa, and A. thaliana using data from BRAD (http://brassicadb.org/brad/, v1.0) [9] and TAIR (http:// www.arabidopsis.org, TAIR9) [10], respectively.

Genomic component
A total of 45,758 predicted genes with annotations were collected in Bolbase (Table 1). Putative genes with a variety of architectonic types, such as gene families, orthologous groups, and tandem arrays, and their locations on pseudomolecular chromosomes and scaffolds were included in Bolbase. Each putative gene was annotated using public databases or web service sites to obtain a comprehensive functional overview ( Figure 2). A total of 13,382 transposable elements in B. oleracea were deposited in Bolbase, including 2 major classes: retrotransposons (Class I transposons) and DNA transposons (Class II transposons). Additional categories, such as long terminal repeat retrotransposons (LTR-RTs), long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), Tc1-Mariner, hAT, Mutator, Pong, PIF-Harbinger, CACTA, Helitron, and miniature inverted repeat transposable elements (MITEs) were hierarchically listed. Moreover, information on different superfamilies and families of LTR-RT elements was also provided. Bolbase compiled 3,581 noncoding RNAs by their conserved motifs and sequence

Gene clusters
Clusters of genes with similar functions evolve through tandem, segmental, or whole genome duplication and are remarkably important for genome evolution and trait establishment. The gene cluster section in Bolbase is composed of gene families, orthologous groups, and tandem duplicated arrays. First, HMMER v3.0 software was employed to detect gene family members using HMM profile from the Pfam database [11,12]. Second, OrthoMCL 2.0 software was used to classify orthologous groups with E-value ≤ 1e-05 and inflation parameter of  1.5; all B. oleracea genes were divided into 21,509 ortholog groups [13]. Third, tandem duplicated genes were classified using the BLASTP program with E-value cutoff ≤ 1e-20 where one unrelated gene within a tandem array was allowed. Approximately 1,825 tandem arrays with 2 to 12 genes each were detected and saved in Bolbase.

Syntenic regions
To better understand evolutionary history and species divergence, syntenic regions between A. thaliana and Brassica species were identified using the MCscanX software and manual curation, and they can be visualized and used in Bolbase [14] (Figure 3). Orthologous gene pairs were first identified based on an all-against-all BLAST search with an E-value cutoff ≤ 1e-10 between species from best-reciprocal BLAST hits [15]. Then, MCscanX was employed to identify syntenic regions, using the parameters e = 1e-20, u = 1, and s = 5, which required a minimum of five consecutive orthologous gene pairs in the collinear regions. In total, 558 syntenic regions, including 22,413 gene pairs, were classified between B. oleracea and A. thaliana, and 1,034 syntenic regions containing 24,422 gene pairs were defined between B. oleracea and B. rapa. These data can be freely accessed and visualized ( Table 2, Additional file 1). Moreover, nonsynonymous (Ka) and synonymous (Ks) substitution rates of orthologous gene pairs were calculated and provided.

Utility
Bolbase provides a user-friendly interface to facilitate the retrieval of information. Five main functional unitsbrowse, synteny, search, document, and helpwere integrated into Bolbase. From those units, users can browse genomic and comparative genomic information for B. oleracea and its relatives or retrieve comprehensive genomic component annotations, their locations on pseudomolecular chromosomes, and genome sequences. These genomic data can also be downloaded in bulk. Therefore, Bolbase will facilitate studies on genome variation and genomic structure differentiation within and between species. Here we describe some main functions of the interface.

Browsing genomic components and syntenic regions
The genomic component web interface of Bolbase is organized by component type. Each of the main navigation tabs focuses on a specific component to allow users to retrieve information from the database. This functional unit is contained in "Browse" on the main navigation bar. The putative gene tab is organized by gene families, orthologous groups, tandem arrays, and gene locations on pseudomolecular chromosomes or scaffolds. Repeat element and non-coding RNA tabs are organized by types, categories, or superfamilies. IN particular, Bolbase provides detailed function annotations for every putative gene that can be divided into four units: (i) basic information ( Figure 2A); (ii) protein sequence features ( Figure 2B); (iii) gene clusters, including orthologous groups and tandem duplicated arrays ( Figure 2C); and (iv) syntenic analyses including orthologs in B. rapa and A. thaliana, as well as corresponding syntenic regions and triplicated blocks ( Figure 2D). Basic information consists of gene identifier, location, model structure (intron/exon boundary, number, length, etc.), and coding nucleotide and protein/ peptide sequences. The unit of protein sequence features displays conserved protein domains or motifs predicted by InterProScan in detail [16]. Additionally, putative genes were also annotated and compared with different databases, including Gene Ontology (GO) [17], Swiss-Prot [18], TrEMBL [18] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [19].
To better visualize the collinear relationship between species, the syntenic regions in B. oleracea, B. rapa, and A. thaliana are visualized on chromosomal images produced by Perl scripts, and statistical analyses of gene pairs between species are also scatter plotted. The syntenic regions between any target chromosome and those of other species will appear when the chromosome is selected, revealing gene pairs in each region and their Ka, Ks and Ka/Ks values.

Keyword search
The keyword search is a powerful search engine to retrieve useful information, such as sequences, annotations, and homologous genes. These functional units are contained in the "Search" section on the main navigation bar. This section mainly includes putative gene, transposable element, and non-coding RNA search pages. Putative gene searching will provide users with detailed annotations, orthologous genes, and/or tandem arrays, if they exist. By inputting a GO term, a InterPro entry, or a KEGG pathway entry, researchers can retrieve a group of putative genes in the B. oleracea genome. Different types, categories, and superfamilies of transposable elements can be screened in the transposable element search page. The non-coding RNA search page is designed to help users compile information on these genetic elements. The different types or categories of non-coding RNA can be also searched on this page.

Orthologous genes and syntenic regions search
Through comparative analyses among species, researchers can further understand the genomes of B. oleracea and its relatives. Orthologous genes in conserved syntenic regions can be displayed using a localized GBrowse_syn software by inputting a gene name, as indicated in Figure 3 [20,21]. This functional unit is contained in the "Search" section on the main navigation bar. Here, we use the B. oleracea  Figure 2E,F). By selecting a chromosome from one species, syntenic regions in the other species can be visualized as a comparative chromosomal image, and lists of syntenic regions are displayed with their chromosomal positions. When the hyperlink for the target region is clicked, the syntenic regions in other species will be displayed.

Sequence similarity search
The similarity search page, which embeds customized BLAST software, will satisfy users with various interests related to homologous genes or regions. This functional unit is contained in the "Search" section on the main navigation bar. Users can supply a nucleic acid or amino acid sequence by uploading or directly pasting it to search against the available databases. Thus, this function allows quick comparisons and annotations of user query sequences using the data deposited in Bolbase. BLAST hits return with hyperlinks to the genes, enabling users to quickly acquire annotations from the database.

Discussion
Although a few Brassica databases existed previously, Bolbase is the first comprehensive database with a focus on the B. oleracea genome and comparisons with its relatives. The deposited sequences and relatively accurate annotations will allow users to retrieve and download important information to further their interests in both functional and comparative genomics studies. Compared to other databases of B. oleracea genomic data, Bolbase supplies more detailed genomic annotations from public databases to allow users to analyze them more thoroughly. Syntenic regions and orthologous genes, which are useful resources for comparative and evolutionary analysis, can be explored in a highly visual style. Additionally, the userfriendly interface provides users quick and comprehensive information. The friendly and powerful search tools allow multi-channel searching and will be improved in the future based on user feedback. We continue to update and expand the database by adding data from other Brassica species as they become available.

Conclusions
We have developed Bolbase, a comprehensive and searchable database of the B. oleracea genome. Bolbase is the primary resource platform for the B. oleracea genome and for genomic comparisons with its relatives, and its functions are not available in other public databases of Brassica species. To assist researchers and breeders in using the B. oleracea genomic information efficiently, Bolbase will be regularly updated with new genome annotations and the results of comparisons with newlysequenced genomes as they become available. We hope that Bolbase will provide a valuable resource for the study of the functional and evolutionary aspects of Brassica genomes and for further exploration of the evolutionary relationships within the Brassica genus and the crucifer lineage.
Other requirements: Apache, PHP, MySQL, GD, SVG, GBrowse. These data are freely available without restrictions for use by academics. Please login to the 'Help' page on the Bolbase homepage or email Dr. Shengyi Liu (liusy@oilcrops.cn) to request data subsets of interest.

Additional file
Additional file 1: Summary of syntenic regions in Brassica oleracea, Brassica rapa, and Arabidopsis thaliana. In this Excel file, the "A. thaliana-B.oleracea_aligns" sheet is a summary of syntenic regions between the B. oleracea and A. thaliana genomes; the "B.oleracea-B. rapa_aligns" sheet is a summary of syntenic regions between the B. oleracea and B. rapa genomes; and the "A.thaliana-B.rapa_aligns" sheet is a summary of syntenic regions between the B. rapa and A. thaliana genomes.