CASCAD: a database of annotated candidate single nucleotide polymorphisms associated with expressed sequences
© Guryev et al; licensee BioMed Central Ltd. 2005
Received: 11 October 2004
Accepted: 27 January 2005
Published: 27 January 2005
With the recent progress made in large-scale genome sequencing projects a vast amount of novel data is becoming available. A comparative sequence analysis, exploiting sequence information from various resources, can be used to uncover hidden information, such as genetic variation. Although there are enormous amounts of SNPs for a wide variety of organisms submitted to NCBI dbSNP and annotated in most genome assembly viewers like Ensembl and the UCSC Genome Browser, these platforms do not easily allow for extensive annotation and incorporation of experimental data supporting the polymorphism. However, such information is very important for selecting the most promising and useful candidate polymorphisms for use in experimental setups.
The CASCAD database is designed for presentation and query of candidate SNPs that are retrieved by in silico mining of high-throughput sequencing data. Currently, the database provides collections of laboratory rat (Rattus norvegicus) and zebrafish (Danio rerio) candidate SNPs. The database stores detailed information about raw data supporting the candidate, extensive annotation and links to external databases (e.g. GenBank, Ensembl, UniGene, and LocusLink), verification information, and predictions of a potential effect for non-synonymous polymorphisms in coding regions. The CASCAD website allows search based on an arbitrary combination of 27 different parameters related to characteristics like candidate SNP quality, genomic localization, and sequence data source or strain. In addition, the database can be queried with any custom nucleotide sequences of interest. The interface is crosslinked to other public databases and tightly coupled with primer design and local genome assembly interfaces in order to facilitate experimental verification of candidates.
The CASCAD database discloses detailed information on rat and zebrafish candidate SNPs, including the raw data underlying its discovery. An advanced web-based search interface http://cascad.niob.knaw.nl allows universal access to the database content and allows various queries supporting many types of research utilizing single nucleotide polymorphisms.
Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation within species. As a result, SNPs are now becoming the most popular type of marker in genetic association and mapping studies. SNPs are also most likely to be the molecular basis for the majority of phenotypic variation in (outbred) populations. In particular, SNPs in regulatory and protein-coding regions can have an effect on gene expression levels and protein activity, respectively. The phenotypic differences observed between selected (sub) strains in model organisms may be the result of specific (combinations of) natural occurring polymorphisms. Hence, a comprehensive inventory of SNPs, including extensive annotation will be extremely valuable in the search for functional polymorphisms.
There is often a vast unexplored potential in large sequence datasets that have been collected for other purposes, for example, EST and whole genome sequencing (WGS) projects. In an effort to address these two issues, we have developed an in silico candidate SNP mining pipeline that uses all publicly available sequence data for a specific organism, and designed a database, CASCAD (CAscad SNP CAndidates Database), that allows storage of a wide variety of primary source data, cross-annotation to other databases, and analysis parameters for SNPs associated with expressed sequences.
Construction and content
Input data (number of sequence reads) for the CASCAD pipeline and number of predicted candidate SNPs.
19, 813, 313
11, 588, 394
Candidate SNPs predicted
In addition to primary sequence data analysis, the effect of all SNPs on protein coding capacity was evaluated and non-synonymous SNPs were categorized in classes reflecting the severity of the polymorphism using a BLOSUM-based score. The predicted missense SNPs were analyzed by SIFT  and Polyphen  programs that utilize not only substitution information but also phylogenetic conservation and structural protein information to predict a potential effect of the polymorphism on protein function.
Utility and discussion
For many applications, it is important to be able to distinguish between SNP candidates by their characteristics, as they may be predictive for verification success rate or carry biologically relevant information. Non-confirmed candidate polymorphisms may represent variants uncommon for a given population, but also sequencing errors (all types of sequences), RNA editing events and reverse transcriptase errors (EST reads). In order to minimize the contribution of false positives, one can exclude polymorphisms based on a single read for either allele, as is common for many in silico discovery pipelines . Although this is a valid approach when selecting SNPs for population or association genetics, one could inadvertently discard many rare variants that may be associated with phenotypic variation, for example by affecting protein structure or function. Information on such polymorphisms can be very useful when mapping disease or QTL alleles.
We have developed our database to fulfill the needs of any particular SNP application by providing control over every parameter we used in the polymorphism discovery step.
Applications of the CASCAD database include queries for potentially deleterious SNPs in a specific genomic region of interest, for example a QTL interval, design of SNP-based mapping panels using either RFLP or any other technology, and identification of informative SNPs for fine-mapping. Custom sequences can be provided to search for known SNPs in any sequence of interest. In addition, the CASCAD pipeline  can be used to build a candidate SNP database for any model organism of interest for which sufficient sequencing data is available.
The main purpose of CASCAD database is to provide flexible access to candidate single nucleotide polymorphisms, which were predicted using a computational approach from publicly available sequence data of the rat and zebrafish. The resulting database is crosslinked to most common public databases and can be queried for SNPs using accession numbers, sequence context, SNP characteristics, but also using parameters specific to the SNP discovery process, allowing stringent or relaxed conditions suitable for different types of applications.
Availability and requirements
The database is freely accessible through the website http://cascad.niob.knaw.nl. Programs, scripts, MySQL database dumps, and instructions for setting up a species-specific SNP database can be obtained from the authors upon request.
This work was supported by the Dutch Ministry of Economic Affairs through the Innovation Oriented Research Program on Genomics, grant #IGE010017.
- Guryev V, Berezikov E, Malik R, Plasterk RHA, Cuppen E: Single nucleotide polymophisms associated with rat expressed sequences. Genome Res. 2004, 14: 1438-1443. 10.1101/gr.2154304.PubMedPubMed CentralView ArticleGoogle Scholar
- Primer design interface. [http://primers.niob.knaw.nl]
- Berezikov E, Plasterk RHA, Cuppen E: GENOTRACE: cDNA-based local GENOme assembly from TRACE archives. Bioinformatics. 2002, 18: 1396-1397. 10.1093/bioinformatics/18.10.1396.PubMedView ArticleGoogle Scholar
- Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31: 3812-3814. 10.1093/nar/gkg509.PubMedPubMed CentralView ArticleGoogle Scholar
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: Server and survey. Nucleic Acids Res. 2002, 30: 3894-3900. 10.1093/nar/gkf493.PubMedPubMed CentralView ArticleGoogle Scholar
- Picoult-Newberg L, Ideker TE, Pohl MG, Taylor SL, Donaldson MA, Nickerson DA, Boyce-Jacino MB: Mining SNPs from EST databases. Genome Res. 1999, 9: 167-174.PubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.