Marine Genomics: A clearing-house for genomic and transcriptomic data of marine organisms
© McKillen et al; licensee BioMed Central Ltd. 2005
Received: 19 August 2004
Accepted: 10 March 2005
Published: 10 March 2005
The Marine Genomics project is a functional genomics initiative developed to provide a pipeline for the curation of Expressed Sequence Tags (ESTs) and gene expression microarray data for marine organisms. It provides a unique clearing-house for marine specific EST and microarray data and is currently available at http://www.marinegenomics.org.
The Marine Genomics pipeline automates the processing, maintenance, storage and analysis of EST and microarray data for an increasing number of marine species. It currently contains 19 species databases (over 46,000 EST sequences) that are maintained by registered users from local and remote locations in Europe and South America in addition to the USA. A collection of analysis tools are implemented. These include a pipeline upload tool for EST FASTA file, sequence trace file and microarray data, an annotative text search, automated sequence trimming, sequence quality control (QA/QC) editing, sequence BLAST capabilities and a tool for interactive submission to GenBank. Another feature of this resource is the integration with a scientific computing analysis environment implemented by MATLAB.
The conglomeration of multiple marine organisms with integrated analysis tools enables users to focus on the comprehensive descriptions of transcriptomic responses to typical marine stresses. This cross species data comparison and integration enables users to contain their research within a marine-oriented data management and analysis environment.
Large collections of ESTs enable the assembly of nucleotide sequence contigs and, if the genomic sequence is available, a means of mapping the species genome and ultimately assist in gene and pathway discovery. Multivariate statistical analysis of these collections is used for microarray design. The main feature of the Marine Genomics project is that all data is accessible to the public (non-curator viewers) as a curated clearing house for genomic and transcriptomic data of marine organisms. Accordingly, Marine Genomics includes a tool for automated EST submission to NCBI's GenBank to assist in integrating data and annotation results with a wider public resource. One of the primary goals of the clearing-house is to ensure the successful submission to NCBI, of all processed and curated data contained in the Marine Genomics databases. New species databases are welcomed in order to build a comprehensive marine repository. The conglomeration of multiple marine organisms with integrated analysis tools enables focusing on the comprehensive descriptions of transcriptomic responses to typical marine stresses such as water pollution and algal blooms, effects of climate change such as altered pH and increase in carbon dioxide levels [1, 2], as well as localized phenomena such as coral bleaching  and viral infections (crustacean and fisheries diseases ).
Construction and content
The marine genomics pipeline is a web-based software environment. The open-source public license for the scripting language PHP 4.2.2 is used at the front end for user interface development and runs on Apache web server 1.3.33. MATLAB version 6.5 http://www.mathworks.com/products/matlab/, a scientific engineering computational language, release 13 is employed for statistical data analysis and some of the more intensive computational processes. The open source database application PostgreSQL, version 7.2.3 http://www.postgresql.org, is used for all data storage. The entire system is developed and run on servers configured with the open source operating system Red Hat Linux version 7.3 http://www.redhat.com.
Site divisions and support functions
The site functionally divides in two main upload functions for both EST and microarray data upload. These upload tools enable users to add EST and microarray data to their species PostgreSQL database where they are accessed and manipulated by various processing tools. The microarray upload also allows users to warehouse microarray MIAME and MAGE compliant data. Other site support functions include an EST annotative text search, a stand-alone BLAST  function as well as user authentication and curation control.
Marine Genomics currently contains 19 different marine species databases. Species currently included are: Anas platyrhynchos (mallard), Crassostrea gigas (Pacific oyster), Callinectes sapidus (blue crab), Crassostrea virginica (eastern oyster), Eubalaena glacialis (Northern Atlantic right whale), Fundulus species (killifish), Homarus americanus (American Atlantic lobster), Karenia brevis (red tide algae), Leucoraja erinacea (little skate), Litopenaeus setiferus (white shrimp), Litopenaeus stylirostris (blue shrimp), Litopenaeus vannamei (white shrimp), Montastraea annularis (lobed star coral), Oculina varicosa (stony coral), Porites porites (clubbed finger coral), Palaemonetes pugio (daggerblade grass shrimp), Squalus acanthias (spiny dogfish) and Tursiops truncatus (bottlenose dolphin). Each of these species databases undergoes a cross-BLAST for sequence similarities. New databases are added regularly upon user request.
In order to maintain the most current BLAST results both from GenBank and the internal Marine Genomics database, the databases will be BLAST on a quarterly basis.
EST pipeline implementation
Users can reference the Marine Genomics Process flow guide on the homepage of the website to get an overview of Marine Genomics EST and microarray processes. Currently the pipeline accepts both FASTA and text sequence files as well as electropherogram trace files. The user interface allows the user to upload sequences both as zipped batches and as individual uploads. The phred and the phd2fasta programs [6, 7] are used for converting the trace file into a readable text format. The files are then stored for back-up. Once submitted to the pipeline each sequence undergoes a number of QA/QC procedures and subsequently becomes available for curation and user initiated submission to NCBI's GenBank.
EST quality control and sequence processing (QA/QC)
Cross-match  is employed to mask vector content from the uploaded sequence files. Then this masked vector is automatically removed by a Marine Genomics trimming tool.
The collars (user specified regions of the vector adapters) chosen by the user on file upload are used for a final vector screening in an attempt to ensure all vector is removed from the sequence before submission to the species databases. It allows the user some control in specifying the end of the vector sequence and thus adds an extra layer of vector screening and removal.
Poly-A tail removal.
Size control: Sequences shorter than 50 nucleotide bases are flagged.
N-content control: Flagging of sequences with an N-content of greater than 3 bases in 10.
Flagging of non-DNA sequences to prevent any possible file upload contamination.
EST curation and submission to NCBI
Sequence viewing: Public viewers have access to the sequence, BLAST and processing results such as when the sequence was updated and last modified as well as tissue and NCBI accession number.
Sequence curation: Curators can review/edit trimmed ESTs as well as delete and submit the individual sequence to GenBank.
Automated BLAST: Sequences automatically undergo a BLAST against NCBI's GenBank using the BLAST tool. GenBank databases currently used include the GenBank non-redundant CDS translations protein databases (nr) with BLASTx, the GenBank non-redundant nucleotide databases (nt) with BLASTn and the GenBank EST databases with BLASTn (dbEST). The sequences also undergo BLAST against all local Marine Genomics databases.
Interactive sequence submission to GenBank: Once each curator-uploaded sequence batch has been appropriately run through the QA/QC process, the curator receives an automated email allowing them to submit their ESTs to GenBank's dbEST database. The email also lists which sequences will not be included in the submission due to problem-flagging. The curator has the option of reviewing sequences that may have been left out of the submission process, such that they can be edited appropriately and then later submitted.
Contigs are assembled using CAP3 program  after ESTs have been curated. This enables the users to determine unique transcripts/genes. Currently curators can request the addition of this functionality for their particular species and the contig data is made web accessible then.
Sequences are annotated with Gene Ontology terms by searching the sequences against the Gene Ontology database  using BLAST . Species-specific Gene Ontologies (GO) are currently reported in a piechart on each species entry page and also within each EST sequence page. Also a cross species GO data summary is reported along side the listing of all current Marine Genomics species. The addition of GO annotation is currently on going.
Microarray pipeline implementation
The microarray pipeline is a more recent addition to Marine Genomics than the EST pipeline and is still undergoing development particularly in the buildup of analysis tools. The MATLAB statistics and bioinformatic libraries are of critical importance for the fast development and deployment of advanced analysis procedures. Accordingly, a suite of multivariate statistical analysis tools, developed in MATLAB is being employed to assist in microarray design. An optimal cDNA microarray probe selection algorithm combining different clustering methods and contig information was implemented to assist microarray design . The procedure can also be used for multiple species microarray design which critically benefits from the marine-specific nature of the EST databases maintained. EST probe selection for microarray design is available upon request and the clustering and probe selection output is available on the website on each species entry page.
A MIAME compliant  excel template is provided for user download http://www.marinegenomics.org/~mckilldj/MIAME.xls to ensure the data remains compliant. This file can be filled-in and exported from Microsoft Excel as a tab-delimited MIAME data file for upload alongside the corresponding MAGE data file. Currently Marine Genomics accepts MAGE data from in-house microarray experiments run at the Hollings Marine Laboratory in Charleston, South Carolina for storage and analysis.
Microarray data upload and warehousing of experimental results data
Data upload: The microarray pipeline accepts a text file (MIAME compliant) which contains information specific to the experimental information (lab notes etc.) It also accepts a text data file, output from the microarray laser scanner, (containing specific spot locations and intensities) which is parsed into MAGE compliant data for warehousing.
Data warehousing: The uploaded tab-delimited files are parsed and stored in a relational PostgreSQL database. Currently warehoused microarray data is accessible to the public through specific species links.
Data download: Microarray data can be accessed and downloaded as Web pages or as Excel compatible files. Marine Genomics also includes a MATLAB centric microarray access feature consisting of an m-file, mgma_get.m. This function will automatically list all available microarray data and download complete microarray content directly into the user's MATLAB workspace. This makes an entire microarray dataset or selection of datasets available in a MATLAB Bioinformatics Toolbox specific format for analysis by the functions within the Bioinformatic Toolbox (e.g. maimage(mgma_get(15), Ch1 Intensity)). This m-file is made freely available at: http://www.marinegenomics.org/mgma_get.m
The Marine Genomics infrastructure was developed as a clearing house of functional genomics data for marine organisms. It currently includes tools to upload, preprocess, cross-reference, annotate, NCBI submit and store EST data. It also includes a corresponding microarray design, upload and storage tool with development of analysis tools underway.
The usage of the Marine Genomics infrastructure has been speedily increasing with more species databases being added and the numbers of sequences increasing even faster. Furthermore, Marine Genomics integrates the microarray entries with the MATLAB environment for which there are several commercial and public libraries ("toolboxes") for microarray analysis.
Current development goals include the added functionality of continued addition of ontology and contig information for all species. Another goal is to add the ability to parse and process multiple microarray platforms such that users can have the flexibility of uploading data output from their own individual microarray platforms. Finally, in particular special care will be given to speedily exporting expression data as MAGE-XML for incorporation in NCBI's GEO databases rather than having that data exclusively retained in Marine Genomics.
The ultimate purpose of Marine Genomics is indeed to assist in submitting quality data to the NCBI GenBank and GEO databases. For that purpose, the Marine Genomics pipeline and tools have been assembled to provide a medium for working with functional genomics in a marine biology environment.
Availability and contacts
This work was partially supported by the National Science Foundation (EPS0083102 & MCB0315393), the South Carolina Sea Grant Consortium (R/MT-6), and the South Carolina Department of Natural Resources. This is publication #20 from the Marine Biomedicine and Environmental Sciences at the Medical University of South Carolina. YAC acknowledges the support by the South Carolina Sea Grant (NA16RG2250)
- Sabine SL, Feely RA, Gruber N, Key RM, Lee K, Bullister JL, Wanninkhof R, Wong CS, Wallace DWR, Tilbrook B, Millero FJ, Peng TH, Kozyr A, Ono T, Rios AF: The Oceanic Sink for Anthropogenic CO2. Science. 2004, 305: 367-371. 10.1126/science.1097403.PubMedView ArticleGoogle Scholar
- Takahashi T: Carbon Dioxide. Science. 2004, 305: 352-353. 10.1126/science.1100602.PubMedView ArticleGoogle Scholar
- Brown B: Adaptations of Reef Corals to Physical Environmental Stress. Advances in Marine Biology. 1997, 31: 121-277.Google Scholar
- Chapman RW, Browdy CL, Salvin S, Prior S, Wenner E: Sampling and evaluation of white spot syndrome virus in commercially important Atlantic penaeid shrimp stocks. Diseases of Aquatic Organisms. 2004, 59: 179-185.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error Probabilities. Genome Research. 1998, 8: 186-194.PubMedView ArticleGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. GenomeResearch. 1998, 8: 175-185.Google Scholar
- Huang X, Maddan A: CAP3: a DNA sequence assembly program. Genome Res. 1999, 9: 868-877. 10.1101/gr.9.9.868.PubMedPubMed CentralView ArticleGoogle Scholar
- Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2004, 32: 258-261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Chen YA, McKillen DJ, Wu S, Jenny MJ, Chapman R, Gross PS, Warr GW, Almeida JS: Optimal cDNA microarray design using expressed sequence tags for organisms with limited genomic information. BMC Bioinformatics. 2004, 5: 191-203. 10.1186/1471-2105-5-191.PubMedPubMed CentralView ArticleGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FCP, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a microarray experiment (MIAME) – towards standards for microarray data. Nature Genetics. 2001, 29: 365-371. 10.1038/ng1201-365.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.