Genome Browsers
We began developing BGD upon release of the bovine genome assembly Btau_2.0. Btau_2.0 was the first bovine assembly in which contigs were assembled in scaffolds. Our first step was to set up a GBrowse genome browser with a MySQL database serving as the backend [4]. GBrowse allows for simultaneous viewing of all data sets associated with a particular region of the genome. Although it was premature to annotate Btau_2.0, we presented the BGD GBrowse at several international conferences to create interest and attract potential annotators from the research community.
BGD now includes GBrowse sites for the newer assemblies, Btau_3.1 and Btau_4.0. BGD maintains GBrowse sites for each assembly on scaffold or chromosome coordinate systems. Each chromosome-coordinate-based GBrowse has a track showing ordered scaffolds, with links to the corresponding scaffolds in the scaffold-coordinate-based GBrowse. Although the GMOD Chado schema [5] is compatible with GBrowse and setting up a Chado database was our next step, we chose to maintain separate MySQL databases for each implementation of GBrowse to improve query performance. The MySQL databases are routinely synchronized with the Chado PostgreSQL databases.
Genome Database and Community Annotation System
BGD relies heavily on software produced by the GMOD project [6]. In addition to the GBrowse and Chado database schema, we have incorporated the Apollo Annotation Editor [7], and XORT and GMODTools for bulk data exchange in and out of our Chado database, respectively. We employ the PostgreSQL database management system structured with the Chado schema for the complete sets of assembly and genomic feature data. Chado uses controlled vocabulary (CV) terms from the Sequence Ontology (SO) [8]. Although Chado was originally designed for use by FlyBase [9], it has since been deployed by several other model organism databases.
Before BGD was developed, curators at FlyBase had been the primary users of the Apollo annotation software, and data exchange between annotators and the database occurred only through flat XML files. We were the first research group to implement the Apollo system for the annotation of a mammalian genome. An initial concern was the large size of genes in mammals, due to long introns. For any one gene, Apollo would be required to load a longer segment of chromosome and hold more data in memory, potentially a challenge for personal computers with little memory. To minimize the amount of required memory, we chose to develop the tools to annotate features on scaffolds instead of whole chromosomes.
Our first Chado database contained data from Btau_3.1, the assembly that the BGSAC used for annotation. An independent database was later developed for Btau_4.0. One of our motivations for using Chado has been the feasibility of directly sending data to remote Apollo software clients on users' workstations to help with annotation. Early in the development of this community annotation system we encountered technical challenges, such as identifying SO terms for computed gene models that were compatible between Chado and Apollo. Discontinuity of funding for Apollo resulted in discrepancies between CV terms used by Apollo and those used by Chado and other GMOD components. For example, many GMOD components load protein features into Chado using the term "protein", but the SO term is "polypeptide". We approached this SO mismatch problem by trial and error: creating generic feature format (GFF) files using different SO terms, loading the GFF into the database, and checking whether or not the track was displayed correctly in Apollo. Renewed support for Apollo has currently mitigated most of the issues we initially encountered. We have attained Chado-Apollo compatibility by modeling computational gene data as a three-tiered hierarchy, with the features of CV term type "gene" at the highest level of the hierarchy. Each "gene" feature has one or more features of CV term type "mRNA" and each "mRNA" feature in turn has one or more features of CV term type "exon", as well as one feature of CV term "protein". We have also modified the way exons and mRNA features are modeled. One way to minimize the size of the Chado database is to allow mRNA features to share exons when appropriate. However, we discovered errors when dumping FASTA formatted sequences using GMODTools, so we have not allowed exon features to be shared between different mRNA features.
Our system allows experts in bovine biology to contribute to annotation of the bovine genome using the Apollo Annotation Editor, installed on users' desktops. The system is composed of the main Chado database, an intermediate Chado database to hold pre-reviewed submissions, the Apollo Annotation Editor, an Apollo-Chado adapter (included in the Apollo package), a community annotation web portal for user authentication and coordination of annotation efforts, XORT (a component of GMOD) to load manual annotations in Chado-XML format into the Chado databases, and GMODTools to dump annotation data from Chado. Although we initially developed the annotation system for Btau_3.1, we now only support annotation of Btau_4.0. Routine maintenance on the Chado PostgreSQL databases includes running postgres vacuum after data is loaded (which greatly improves database performance in our hands), and regular database backups using the PostGreSQL command pg_dump.
Because default configuration files were designed for FlyBase, Apollo must be configured for each project separately. We have modified the chado-adaptor.xml file to provide Apollo with connection information for our server and Chado database, and to describe the data available in our database. We have also created the bovine.tiers file, which defines the different data tracks available and also allows for additional settings to control how the data are displayed. The tiers file also allows incorporation of URLs for features in the gene evidence tracks, allowing users to obtain more information about the features. In BGD, the homolog alignment tracks are linked to their source webpage at NCBI, Ensembl or Uniprot. Ab initio gene predictions and consensus gene models are linked to data on the BGD server via CGI scripts. The bovine.style file contains the name of the tiers file and the organism name for the species to be annotated. In addition to these settings, the bovine.style file also contains a series of pre-generated comments that can be used by annotators to describe gene models. These "canned comments" have been useful for standardizing annotation notes, and have been customized for the bovine annotation project. We have also updated the Apollo chado.style and apollo.cfg files to include the style information for the new species.
Our community annotation web portal consists of a set of CGI scripts to authenticate users and accept uploaded annotations, as well as a MySQL database that maintains user information and serves as a back-end for annotation query web pages. The portal has allowed users to register, login, download Apollo software configuration files and tutorials, and sign up to annotate priority genes. Annotation submission pages have allowed users to either upload Chado-XML files exported from Apollo or enter annotation information manually into web forms. Users can search for submitted annotations by user name or user-submitted information, such as gene name, gene family or keywords. Users can also view all submissions and edit their own submissions.
Data Exchange During Initial Annotation of the Bovine Genome
During the BGSAC annotation project, data was exchanged within BGD and with the bovine research community as described in Figure 1. Computational results were formatted into GFF3 and loaded using XORT into the Chado PostgreSQL database. In addition, a subset of the GFF3 was loaded into the Gbrowse MySQL databases. The Chado PostgreSQL database supplied data to the Gene Pages and to the Apollo annotation editor.
Annotators accessed computational gene evidence by starting Apollo and entering the following information in the startup menu of the Apollo client software: 1) the BGD server hostname and 2) either an OGS identifier for a gene of interest or a scaffold identifier and coordinates designating the region of interest. Apollo then accessed the assembly and gene feature data from the Chado PostgreSQL database for the specified region. The annotator edited existing gene models or created new gene models and saved the results in Chado-XML format using the File pull-down menu in Apollo. The annotator then logged in to the BGD community annotation portal and uploaded the Chado-XML file to the BGD server. The uploaded files were saved in a secure directory on the BGD server, with the user id and a timestamp appended to the filename. Upon upload, the Chado-XML file was processed by a Perl script which used the Perl DBI module to load the information into the annotation portal MySQL database so that the data would be immediately visible on the annotation portal website, with a temporary id consisting of the user id and automatically incremented digits. Periodically, a BGD curator used XORT to load the Chado-XML files into the intermediate Chado database for pre-reviewed annotations. The curator then used GMODTools to retrieve the annotations from the Chado database as GFF3 and FASTA sequence files. The curator first performed automated checks on the GFF3 and FASTA coding sequences to flag potential conflicting annotations and coding sequences that have stop codons for further inspection and revision. The checked manual annotations were loaded into the main Chado database after being assigned BGD identifiers and incorporated into a new release of the OGS.
Data Exchange in the Next-generation BGD
We have made several improvements to the annotation system after the BGSAC annotation project and publication of the bovine genome (Fig 2), including support for direct writebacks from Apollo to the intermediate Chado database, database auditing and support for installing and launching Apollo using Java Webstart. Support for direct database writebacks allows users to upload their annotations directly from Apollo to intermeduate Chado database. This increases annotator efficiency and improves user experience by eliminating the need to save and upload Chado-XML files. It also allows users to view the work of other annotators, which greatly reduces redundancy. To support database auditing and rollbacks, we extended the Chado schema to include an audit module, composed of several tables and database triggers. These triggers record any changes in the database to the audit module tables, allowing BGD administrators to monitor annotator activity. In the future, the audit module will also allow BGD administrators to rolled back changes. Support for Java Webstart allows users to install, configure and open Apollo by simply clicking a hyperlink to a Java Network Launching Protocol (.jnlp) file on the BGD website. This eliminates the need for annotators to manually install and configure Apollo. Every time Apollo-Webstart is launched by a user, it checks BGD servers for updated Apollo Java Archive (.jar) files, and updates the user's Apollo configuration files (e.g. chado-adapter.xml, apollo.cfg, bovine.tiers) even after the user has installed Apollo.
Gene Pages
To display details for each OGS gene, we deployed a novel web application based on Chado on Rails [10], a framework for developing web applications that use Chado databases. Data for any OGS gene of interest are retrieved from Chado using the Ruby-on-Rails (RoR) application, which follows the "model-view-controller" (MVC) pattern. The model component is composed of an object-relational mapping of the Chado database schema in RoR, allowing tables in the Chado database to be manipulated using Rails objects. The controller component provides the logic to retrieve genes and their associated data, and the view components provide the html templates to display the data retrieved from Chado. In addition to providing pre-computed information about genes, each gene page contains a link to a wiki page on which research community members can enter information about genes, associated literature, and suggestions for correcting gene models.
BLAST
BGD features a website for BLAST [11] similarity searches implemented using the NCBI standalone WWW BLAST server software. BGD allows BLAST searching of twenty-six different sequence databases, including the genome assembly on scaffolds and chromosome coordinate systems, the Official Gene Sets, and each set of automated gene predictions. We have modified the BLAST output page to provide each hit identifier with a hyperlink to the corresponding sequence information and each hit alignment with a hyperlink to the corresponding genomic region in GBrowse, where a track for the alignment is displayed alongside the other evidence tracks in the region. To accomplish this, we created a CGI script that reads and reformats BLAST output. Hits to OGSv2 protein or coding DNA sequence records are linked to the gene page record for the relevant OGSv2 gene model. Other hits are linked to a sequence at an external database, such as GenBank, GenPept, Ensembl, or to sequence data from the BGD Chado database (for OGS, ab initio and GLEAN gene models).
To display BLAST hits to the assembly in GBrowse dynamically, we have leveraged the built-in distributed annotation system (DAS) [4] feature of GBrowse. Our CGI script creates a DAS track using HSP coordinates in the BLAST output and submits the track to GBrowse. The track is displayed as an "External Annotation Track" labeled with an arbitrary numerical identifier. The track is maintained for the user in browser cookies so that results of multiple BLAST searches may be accumulated and viewed simultaneously. Using built-in GBrowse functionality, the user also can edit and download tab-delimited coordinate files for each DAS track.
Content Management
BGD uses Drupal [12], an open source content management system for web pages. All content is stored in a MySQL database and each block of content (e.g. a home page, sidebar or banner) is referred to as a node. This allows web content in the forms of nodes to be edited and combined in a modular way. Drupal offers many advantages over static HTML pages, including easier site editing and maintenance and easy theme creation and adjustment. Drupal's modular system has a required set of core modules and additional optional modules that may be used to expand functionality on the site. The large user base frequently contributes new modules and themes back to the Drupal website. Once downloaded, new themes and modules can be quickly enabled using the web-based interface.
BGD has used several modules to extend the capabilities of the base Drupal installation. The FCKeditor [13] is a WYSIWYG text editor that simplifies formatting content. The IMCE module [14] is a file upload/browser module that, when integrated with FCKeditor, makes it very simple to upload images and quickly add them to pages using a GUI interface. The Content Construction Kit (CCK) [15]is a set of modules that enables the addition of custom fields to nodes. For example, BGD has a custom content type called "News" which is used for news items. The CCK Views module can be used to create a special type of node called a block that only displays content tagged as News. This block is used to display the news items on the front page, in addition to the content on the front page. The InsertFrame module [16] extends on the HTML iFrame tag by pre-calculating the page height and automatically setting it for the displayed page. These Drupal modules have offered a mechanism to seamlessly integrate CGI scripts and other dynamically generated content into the BGD theme without additional coding. Drupal site themes are divided into two sections: one or more cascading style sheets (CSS), and PHP templates. Style sheets specify everything from header colors and font styles to the characteristics of the menu items. Pages are rendered from the PHP templates, which contain the logic for the organization and display of the node's content. This allows different content to be displayed on a page based on predefined conditions, such as whether a user is logged in as an administrator, collaborator, or guest. Simply editing the CSS and template files can therefore change an entire site's appearance, eliminating the need to make page-by-page modifications.