SNUGB: a versatile genome browser supporting comparative and functional fungal genomics

Background Since the full genome sequences of Saccharomyces cerevisiae were released in 1996, genome sequences of over 90 fungal species have become publicly available. The heterogeneous formats of genome sequences archived in different sequencing centers hampered the integration of the data for efficient and comprehensive comparative analyses. The Comparative Fungal Genomics Platform (CFGP) was developed to archive these data via a single standardized format that can support multifaceted and integrated analyses of the data. To facilitate efficient data visualization and utilization within and across species based on the architecture of CFGP and associated databases, a new genome browser was needed. Results The Seoul National University Genome Browser (SNUGB) integrates various types of genomic information derived from 98 fungal/oomycete (137 datasets) and 34 plant and animal (38 datasets) species, graphically presents germane features and properties of each genome, and supports comparison between genomes. The SNUGB provides three different forms of the data presentation interface, including diagram, table, and text, and six different display options to support visualization and utilization of the stored information. Information for individual species can be quickly accessed via a new tool named the taxonomy browser. In addition, SNUGB offers four useful data annotation/analysis functions, including 'BLAST annotation.' The modular design of SNUGB makes its adoption to support other comparative genomic platforms easy and facilitates continuous expansion. Conclusion The SNUGB serves as a powerful platform supporting comparative and functional genomics within the fungal kingdom and also across other kingdoms. All data and functions are available at the web site .


Background
As the number of sequenced genomes rapidly increases, search and comparison of sequence features within and between species has become an integral part of most biological inquires. To facilitate uses of the sequenced genomes, numerous bioinformatics tools have been developed; among these, genome browser plays an essential role by providing various means for viewing genome sequences and annotated features (e.g., chromosomal position and context of individual genes, protein/nucleotide sequences, structures of exon/intron, and promoters) via graphical and text interfaces. Widely utilized genome browsers include: (i) Ensembl http:// www.ensembl.org/, which is specialized for mammalian genomics and comparative genomics [1], (ii) UCSC Genome Browser http://genome.ucsc.edu/, which archives genome sequences of 30 vertebrate and 24 nonvertebrate species [2], (iii) GBrowse http://gmod.org/ wiki/Gbrowse, a widely-used component-based genome browser [3], and (iv) Map Viewer http:// www.ncbi.nlm.nih.gov/projects/mapview at the National Center for Biotechnology Information (NCBI), which covers a large number of organisms [4]. A new genome browser based on the Google map engine, called the X::Map Genome Browser http://xmap.picr.man.ac.uk/ [5], contains genomes of three mammalian species and is specialized for supporting microarray analyses based on the Affymetrix platform [6].
Since complete S. cerevisiae genome sequences were released in 1996, more than 90 fungal/oomycete species have been sequenced with many additional species being currently sequenced [7]. A few sequencing centers, such as the Broad Institute http://www.broad.mit.edu/ and the JGI http://www.jgi.doe.gov/, have sequenced most of the fungal genomes and provide their own genome browsers to support data visualization and utilization. Although they use standardized formats, such as fasta and gff3, for data presentation and distribution, each center uses its own data formats for sequences, annotation data, and other chromosomal information. In addition, some of the sequenced fungal genomes lack certain data, such as exon positions. These problems have hampered the integration and visualization of all available genome sequences via a single genome browser. As a solution for this problem, a group at Duke University http://fun gal.genome.duke.edu/ installed an open-source browser called the GBrowse [3] after reannotating genome sequences of 42 fungal species from multiple sequencing centers through the use of their own annotation pipeline consisting of several gene prediction programs; large scale evolutionary analyses were conducted based on the archived genomes, demonstrating the usefulness of unified and standardized data formats [8].
A large number of sequenced fungal genomes have provided opportunities to compare genome sequences and features at multiple taxon levels, revealing potential mechanisms underpinning fungal evolution and biology [8][9][10][11][12][13][14][15][16][17][18]; however, due to the complexity and vast scale of the resulting data, presentation of these data in an easily accessible format is challenging. To overcome this limitation, both the database construction and the pipeline/ tools for comparative analyses should be carefully designed. One good example is the e-Fungi project http:// www.e-fungi.org.uk/ [19], which archives genome sequences of 34 fungal and 2 oomycete species and supports various queries via the web interface. Comparative fungal genomics studies have been conducted using e-Fungi [9,11]. Yeast Gene Order Browser (YGOB; http:// wolfe.gen.tcd.ie/ygob/) [20] archives genome sequences of the species belonging to the subphylum Saccharomycotina and provides a graphical gene order browser, which helps the dissection of evolutionary history of genome changes during yeast speciation [21]. Although these platforms provide useful tools and data, only certain fungal genomes are covered, and the function of user-friendly access to sequence information and graphical presentation of data are limited.
The Comparative Fungal Genomics Platform (CFGP; http://cfgp.snu.ac.kr/) was established to archive all publicly available fungal and oomycete genome sequences using a unified data format and to support multifaceted analyses of the stored data via a newly developed user interface named as Data-driven User Interface [7]. Currently, CFGP archives genome sequences of 92 fungal and 6 oomycete species (137 different datasets) and also carries genome sequences of 55 plant, animal and bacterial species (56 datasets). Taking advantage of the data warehouse and functionalities in CFGP, several databases specialized for certain gene families or functional groups have been constructed, one of which is the Fungal Transcription Factor Database (FTFD; http://ftfd.snu.ac.kr/) [22]. This database identified and classified all fungal transcription factors and provides a phylogenomic platform supporting analyses of individual transcription factor families [23]. In addition, Fungal Cytochrome P450 Database (FCPD; http://p450.riceblast.snu.ac.kr/) [24], Fungal Secretome Database (FSD; http://fsd.snu.ac.kr/; Choi et al., unpublished), Fungal Expression Database (FED; http://fed.snu.ac.kr/; Park et al., unpublished) have been constructed or are currently being constructed. The CFGP was also used to manage high-throughput experimental data and link them to corresponding genes [25,26] and to maintain the Phytophthora database http:// www.phytophthoradb.org/ [27].
To support comparative genomics analyses using CFGP and offer tools for versatile data visualization, we newly developed a genome browser named as the Seoul National University Genome Browser (SNUGB; http:// genomebrowser.snu.ac.kr/). We chose to develop a new genome browser instead of adopting one of the existing browsers in part because the adoption required conversion of the data archived in CFGP into new formats, and the existing browsers do not support the integration of additional databases, such as the InterPro and customized homologous gene databases available through SNUGB. We also wanted to have a browser based on the architecture of CFGP and associated databases so that we would be able to quickly present updated contents in these resources and seamlessly integrate new tools for data processing, visualization, and/or utilization.
The SNUGB currently covers genome sequences and associated information for 92 fungal and 6 oomycete species (137 datasets), which is the largest among the available fungal genome browser services on the web. These 92 fungal species cover four phyla and one subphylum based on a recently revised fungal taxonomy framework [28] (Table  1, 2, and 3). It also houses genome sequences of 12 plant, 18 insect, and 3 nematode species and human genome sequences (38 datasets), to support comparison of fungal genomes with those in other kingdoms ( Table 4). The taxonomy browser implemented in the SNUGB provides an easy means to access genome sequences of specific species via two ways. The SNUGB provides lists of putative orthologous genes of all fungal ORFs and a tool for comparison of genomic contexts of any orthologous genes among chosen species. In addition, SNUGB displays the InterPro terms assigned to each ORF as well as the genomic regions where expressed sequence tags (ESTs) are matched. With these functionalities, SNUGB will serve as a powerful platform supporting comprehensive fungal comparative genomics.

Data processing via an automated pipeline and the function of Positional Database
Positional information of functional/structural units that are present on individual contigs/chromosomes, such as the start and stop sites of ORFs and exons/introns, was collected from the data warehouse of CFGP and stored in the Position Database of SNUGB. New types of data, such as Simple Sequence Repeats (SSRs) on the genome, can be easily added to the Positional Database for visualization via SNUGB. Along with the positional information, for each data, data type (e.g., ORFs), primary key, and any additional information were saved into the partitioned tables, which were designed for enhancing the speed of data retrieval. Through the primary key, SNUGB can display detailed information of each datum (e.g., sequences) stored at external sources. Considering the large number of available fungal genome sequences and those that are currently being sequenced, in addition to this data standardization scheme, a standardized pipeline for data extraction and management is needed to organize the data and to ensure orderly expansion of SNUGB.
The pipeline developed for SNUGB processes each genome data set via the following steps. Firstly, once whole genome sequences are deposited in the data warehouse of CFGP, the integrity of genome information, such as the position information of functional/structural units, is inspected. Several properties of the whole genome, such as the length and the GC content, are calculated. Secondly, the GC content, AT-skew, and CG-skew are calculated via 50-bp sliding windows with 20 bp steps. Thirdly, for each gene, three types of sequence information, including coding sequences (sequences from the start to stop codon without introns), gene sequences (sequences from the start to stop codon with introns), and transcript sequences (sequences from the transcription start site to end site without intron sequence), if transcript information is available, are generated based on the genome annotation information. Fourthly, all data generated in the previous steps are transferred into the Position Database to support graphical representation of these features. Fifthly, if the genome has chromosomal map information, including genetic map and optical map, this information is converted into a standardized format and stored in SNUGB for graphical representation via Chromosome Viewer. Lastly, after subjecting all ORFs in the genome through the InterPro Scan [29], the genomic position of each domain predicted by the InterPro Scan is calculated and stored into the Position Database.

Modular design of SNUGB facilitates its application
To facilitate the efficient implementation of SNUGB in diverse genomics platforms, a modular design was used for its application programming interface (API). Through API, a diagram showing genome features in a selected region can be created using only their chromosomal positions and display options. Four recent publications illustrate the utility of this design: T-DNA Analysis Platform (TAP; http://tdna.snu.ac.kr/) provides the GC content and AT skew around T-DNA insertion sites on the chromosomes of Magnaporthe oryzae via a mini genome browser supported by SNUGB [25]. The chromosomal distribution pattern of T-DNA insertion sites in M. oryzae http:// atmt.snu.ac.kr/ was also displayed using SNUGB [26]. Fungal Cytochrome P450 Database (FCPD; http:// p450.riceblast.snu.ac.kr/) [24] employs SNUGB to present the chromosomal distribution pattern and contexts of cytochrome P450 genes in fungal genomes. Two databases, FED http://fed.snu.ac.kr/ and FSD http:// fsd.snu.ac.kr/, utilize SNUGB for presenting the genomic context of the region matched to EST and secreted proteins, respectively. Moreover, Systematical Platform for

Properties of the fungal/oomycete genomes archived in SNUGB
Among the 98 fungal/oomyvete species (137 genome datasets) covered by SNUGB, 77 species (111 genome datasets; 81%) belong to the phylum Ascomycota (Table  1 and 2), and 10 species (14 genome datasets; 10%) belong to the phylum Basidiomycota (Table 3). In contrast, the phyla Chytridiomycota and Micosporidia are represented only by one (2 datasets) and two species (both belong to the subphylum Mucoromycotina), respectively (Table 3). Six oomycete genomes, derived from Phytophthora, Hyaloperonospora, and Pythium species, are available for comparison with fungal genomes (Table  3). Although oomycetes belong to the kingdom Stramenophla and show closer phylogenetic relationships to algae and diatoms than fungi [31], due to their morphological similarities to fungi, they have been traditionally grouped with fungi.
The datasets that cover the whole genome (121 out of the 137 datasets) were analyzed to investigate genome properties. The average size of the genomes, measured by adding lengths of all scaffolds together, is 31.42 Mb which is one-seventeenth of plant genomes (547.41 Mb in the phylum Streptophyta) and one-seventh of insect genomes (215. 36 Mb in the phylum Arthropoda) ( Figure 1A). The fungal/oomycete genome sizes ranged from 2.5 Mb (Encephalitozoon cuniculi) to 228.5 Mb (Phytophthora infestans); the genome of E. cuniculi is shorter than that of Escherichia coli (4.6 Mb) [32], while the genome of P. infestans is much larger than the genomes of Arabidopsis thaliana (119.2 Mb) [33] and Caenorhabditis elegans (100.5 Mb) [34], indicating no clear relationship between the genome size and the organismal complexity [35]. With regard to the average genome sizes in different taxon groups, the phylum Microsporidia, known as ancestral fungi, shows the smallest average size (4.28 Mb), while oomycetes show the largest at 102.83 Mb ( Figure 1A). In the phylum Basidiomycota, which is large and very diverse, the degree of difference in average genome sizes within each of the represented subphyla is highest in the fungal kingdom: the ratios of standard deviation to the average length in three subphyla Agricomycotina, Pucciniomycotina, and Ustilaginomycotina are 71.95%, 86.93%, and 57.46%, respectively ( Figure 1B). The subphylum Pucciniomycotina displays the largest size with large variation ( Figure 1A and 1B), while two subphyla Saccharomycotina and Taphrinomycotina belonging to the phylum Ascomycota exhibit the relatively low degree of variations ( Figure 1B), probably because only closely related species have been sequenced. Although the average genome sizes varied from group to group, ANOVA and TukeyHSD tests (P < 0.05) showed only the difference between fungi and oomycetes was significant ( Figure 1A). The GC content of fungal genomes ranges from 32.523% (Pneumocystis carinii in subphylum Taphrinomycotina) to 56.968% (Phanerochaete chrysosporium in the subphylum Agricomycotina), while the GC content of plant and insect genomes ranges from 29.638% to 46.850% ( Figure  1C). Although the coding regions exhibit higher GC con-  [55]. 'Y' indicates the existence of information in each field, and 'N' indicates the lack of information.  tents than the rest of the genome, there is no relationship between the proportion of ORFs on the genome and the GC content of the whole genomes (linear regression; R 2 = 0.04; Figure 1C and 1D).
The number of total proteins encoded by each organism was once considered to reflect organism's characteristics [36]. Based on the size of total proteomes, all sequenced fungal and oomycete species were divided into three groups: The medium group contains the subphylum Pezizomycotina in Ascomycota and the subphyla Agricomycota and Puccinomycotina in Basidiomycota, the small group includes three subphyla Saccharomycotina, Taphrinomycotina, and Ustilagomycotina and the phylum Microsporidia, and the large group has the subphylum Mucoromycotina and the phylum Oomycota (ANOVA and TukeyHSD; P < 0.05; Figure 1E). This grouping shows that the number of total ORFs does not correlate with tax-onomic positions at the phylum level, however, at the subphylum level, the correlation was high. For example, subphyla Saccharomycotina and Taphrinomycotina can be distinguishable from Pezizomycotina based on this  [55]. 'Y' indicates the existence of information in each field, and 'N' indicates the lack of information.
character. The ORF density classified the sequenced species into three distinct groups, Oomycetes, Microsporidia and the rest, through ANOVA and TukeyHSD test (P < 0.05; Figure 1F). Taken together, these three indicators can be used to divide fungal subphyla/phyla. For example, the subphylum Pezizomycotina shows the medium-level of ORF number and ORF density, while the subphylum Saccharomycotina displays the low-level of ORF number but its ORF density is comparable to that of the subphylum Pezizomycotia. Both the number of ORFs and the ORF

Comparison of genome sequences of multiple isolates within species
For 14 fungal species, two or more strains have been sequenced (Table 5). For some species, such as Fusarium graminearum, additional isolate(s) were sequenced only at a low coverage (e.g., 0.4× coverage for the second strain of F. graminearum); however, even such low-coverage provided some insights into the evolution of pathogenicity in this important cereal pathogen [37]. Except Aspergillus niger, Histoplasma capsulatum, and Paracoccidioides brasiliensis, all strains within same species showed less than 1 Mb variation in genome sizes (Table 5). It is possible that the 3.2 Mb difference between two A. niger strains is in part due to different sequencing coverage: the coverage of ATCC1015 was 8.9× while CBS513.88 was 7× http:// genome.jgi-psf.org/Aspni1/Aspni1.info.html [38]. The differences among three P. brasiliensis genomes, ranging from 29.1 Mb to 33.0 Mb, may reflect their distinct phylogenetic positions [39]. The differences among five H. capsulatum genomes may be due to a combination of different levels of sequencing coverage http:// www.broad.mit.edu/annotation/genome/ histoplasma_capsulatum/Info.html and different geological origins [40]. Three isolates of H. capsulatum and P. brasiliensis showed approximately 1% difference in the GC content, whereas the degree of GC content variation among 11 strains of Coccidioides posadasii was only 0.5%. Four Cryptococcus neoformans strains, representing three different serotypes (A, B and D), showed around 0.3% variation in the GC content, and within a serotype (two serotype D strains) the difference was only 0.043% [41].

Isolates of Candida albicans, Saccharomyces bayanus, and
Batrachochytrium dendrobatidis showed only 0.01% variation in the GC content. These intraspecific variations of genome properties can be compared in detail via SNUGB.  [42,43], will further accelerate the rate of fungal genome sequencing, emphasizing the importance of frequently updating SNUGB.
With the aid of the developed pipeline, SNUGB will be updated whenever new fungal genome sequences have been publicly released with annotation information. A notice for updated genomes will be posted on the SNUGB web site.

Taxonomy browser
To support selection of species of interests based on their taxonomic positions, a web-based tool, named as the taxonomy browser, was developed. Considering an anticipated increase in comparing genome sequences and features across multiple species to investigate evolutionary questions at the genome scale, such a tool is necessary to provide an overview of the taxonomic positions of the sequenced species and their evolutionary relationships with other fungi to users of SNUGB and to assist them in selecting appropriate species for comparative analyses.
The taxonomy browser provides two methods for accessing the data archived in SNUGB, one of which is textsearch using species name (Figure 2A). When a user begins typing a species name in the text box, the full name will be completed automatically to assist a quick search of species. The other method is using the taxonomical hierarchy (i.e., tree of life). When a user clicks a specific taxon (e.g., phylum), taxonomy browser will present all subgroups within the chosen taxon for further selection (Figure 2B).

Chromosome viewer and Contig/ORF browser
Three different methods can be used to access genomic information. For those with chromosomal map data (21 species), their chromosomal maps can be displayed via Chromosome viewer ( Figure 3A). The following color scheme was used to denote the level of completeness: i) chromosome constructed using genetic or optical map data (with gaps) as blue (Chromosomes 1 to 7 of M. oryzae; Figure 3A), ii) chromosome map based on a combination of sequences and genetic/optical map data as pink (e.g., chromosomes of A. niger), and iii) unassigned contigs (labeled as Chromosome Ex of M. oryzae; Figure  3A) as light blue. For the species without chromosomal map information, SNUGB provides the contig and ORF browsers, which display the name of contig and ORFs, respectively, and allow users to search them using their names ( Figures 3B and 3C).

Graphical Browser with six different display formats
Gene annotation information in a selected area of chromosome or contig, such as transcripts, ORFs, and exon/ intron structure, and InterPro domains [29], can be displayed through three formats: i) the 'single' format shows these features as bars; ii) the 'squish' format displays them via color-coded diagrams without description; and iii) the 'pack' format presents them as small color-coded icons with description ( Figure 4A). These graphical formats were also used by UCSC Genome Browser [2]. In addition, the GC content and AT/CG skew information for individual chromosomes can be displayed via three formats: i) color-coded bar graph, ii) line, and iii) dotted lines along with a description of data ( Figure 4B). For species with EST data (Table 1), the genomic region corresponding to each EST sequence can be displayed along with ORF and InterPro domains to help users identify predicted gene structure and expressed regions (see Figure  4A). Presentation of these data is supported by Fungal Expression Database http://fed.snu.ac.kr/.

Table browser and Text browser
Although graphical presentation of genomic features helps users view global patterns, the graphical browser does not provide sequences or a list of elements present in a chosen area. To provide such information, we developed two additional tools named as the table browser and the text browser. The table browser provides a list of the names and chromosomal/contig positions of all elements present in a selected region in the csv format, which can be opened using the Excel program ( Figure 5A). The text browser provides sequences in a selected region. If ORFs exist in the region, exons and introns are presented using different colors and cases; this function is useful for designing primers and transferring selected sequences to a different data analysis environment ( Figure 5B). Additionally, all InterPro domains present on each ORF are displayed as special characters under corresponding sequences so that putative functional domains can be easily recognized at the sequence level. The table and text browser can display sequences up to 50 kb.

Kingdom-wide identification of the putative orthologues of individual fungal proteins via BLAST and comparison of the genomic contexts and properties of homologous proteins among species via the Session History function
To identify putative orthologues of individual fungal proteins, BLAST searches with each of the 924,343 fungal proteins against all proteins were performed using the e-value of 1e -5 as the cut-off line. The 'BLAST annotation' tab shows a list of putative orthologues of a chosen gene  BLAST annotation to catalog homologous proteins

A B
product in other species with their BLAST e-values (see Figure 6A). To compare the genomic contexts around the orthologous genes between species or among multiple species, users can store the genomic contexts of the genes using the Session History function, in which the stored genomic contexts can be displayed in one screen ( Figure  6B). In each session, other information, such as the GC content and InterPro terms, can also be presented to further support the comparison.

Additional functionalities of SNUGB
The 'flexible-range-select' function allows users to select a chromosomal segment by clicking a mouse at the start site and moving it over the desired segment; the selected area will be displayed as shaded box, and the subsequent click displays an enlarged view of the selected segment ( Figure  3A). Through the 'high-resolution-diagram' function, users can obtain a high-resolution image (more than 3,000 pixels in width) showing various features on a whole chromosome, such as ORFs, InterPro terms, and GC content. This image can be downloaded as image file via both the graphical genome browser and the sessionstorage function.

Conclusion
The SNUGB supports efficient and versatile visualization and utilization of rapidly increasing fungal genome sequence data, as well as those from selected organisms in other kingdoms, to address various types of questions at the genome scale. Properties and features of the archived fungal genomes are available for viewing and comparison in SNUGB. The taxonomy browser helps users easily access the genomes of individual species and provides taxonomic positions of chosen species, and the chromosome map function shows the whole genome of selected species. The graphical browser, table browser, and text browser present a global view of genomic contexts in a selected chromosomal region and support analyses of sequences in the region. The 'BLAST annotation' provides lists of putatively orthologous proteins in the fungal kingdom and facilitates comparison of the genomic contexts of their genes across multiple species. The SNUGB also allows users to manage their own work histories via the SNUGB web site.

Availability and requirements
All data and functionalities in this paper can be freely accessed through the SNUGB web site at http://genome browser.snu.ac.kr/. The source code, a set of programs, and database structure of SNUGB will be publicly released in the future after finalizing packaging of SNUGB to be opened.

Authors' contributions
JP and YHL planed and managed this project, KJ designed the web site, KJ, JP, BP, KA, JYC, and JHC implemented various functions to SNUGB, JP, JYC, SIK, and DC processed genome sequences, and JP, SK and YHL wrote the manuscript.