UASIS: Universal Automatic SNP Identification System
© Poo et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Skip to main content
© Poo et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
SNP (Single Nucleotide Polymorphism), the most common genetic variations between human beings, is believed to be a promising way towards personalized medicine. As more and more research on SNPs are being conducted, non-standard nomenclatures may generate potential problems. The most serious issue is that researchers cannot perform cross referencing among different SNP databases. This will result in more resources and time required to track SNPs. It could be detrimental to the entire academic community.
UASIS (Universal Automated SNP Identification System) is a web-based server for SNP nomenclature standardization and translation at DNA level. Three utilities are available. They are UASIS Aligner, Universal SNP Name Generator and SNP Name Mapper. UASIS maps SNPs from different databases, including dbSNP, GWAS, HapMap and JSNP etc., into an uniform view efficiently using a proposed universal nomenclature and state-of-art alignment algorithms. UASIS is freely available at http://www.uasis.tk with no requirement of log-in.
UASIS is a helpful platform for SNP cross referencing and tracking. By providing an informative, unique and unambiguous nomenclature, which utilizes unique position of a SNP, we aim to resolve the ambiguity of SNP nomenclatures currently practised. Our universal nomenclature is a good complement to mainstream SNP notations such as rs# and HGVS guidelines. UASIS acts as a bridge to connect heterogeneous representations of SNPs.
SNP, or Single Nucleotide Polymorphism, is defined as a bi-allele polymorphism at a single base with a frequency of more than 1% in the population [1, 2]. Around 90% of the genome variations are limited to SNPs , which have been proven to be of great value for medical diagnostics and developing pharmaceutical products. They can also help identify multiple genes associated with complex diseases such as cancer and diabetes [4–6].
Alternative names of a SNP
Non-conventional names of a SNP
proteinB, Gly564Val; proteinB, Bly544Val
0014 FH NAPLES
There are many reasons for the existence of differing nomenclatures. Although Human Genome Variation Society (HGVS) has recommended widely-used guidelines for mutation notation, researchers of each laboratory have strong emotional attachment to their own naming system . Research articles that first report novel SNPs do not always follow the HGVS guidelines, and the final genomic sequence is complied over many separate entries. Previous nomenclatures sometimes subsist for historical reasons. For example, rs289↓2082 is still recorded as "FH NAPLES" or "Bly544Val" in OMIM (see Table 2).
Unambiguous and correct descriptions of SNPs in databases and in the literature are of utmost importance, not in the least since mistakes and uncertainties may lead to undesired errors in clinical diagnosis. HGVS nomenclature guidelines were proposed in as early as 1998  then extended later on [17, 18]. The guidelines have since been improved regularly (http://www.hgvs.org/mutnomen/). However, the sole existence of the guidelines by themselves is not sufficient. The standardization of SNP identification is far from complete [11, 12, 14].
It is clear that dbSNP is becoming a major center for deposition of SNPs from various sources. The SNP nomenclature of dbSNP, rs#, is unique, clear and stable. It has been widely adopted and heavily referenced in the literature. JSNP, GWAS, HapMap and PharmGKB provide corresponding rs# when displaying their own records. We highly respect its authority.
It is noted that overlapping of SNPs is very low (around 1%) among recognized databases . JSNP reported only 20.9% identity compared to dbSNP . Researchers have to submit their SNPs to dbSNP before they can get a rs#. However, some SNPs discovered in the research or diagnostic laboratory may even never be reported in any publication or database. Some SNPs have considerable delays in their public release due to commercial agreements, legal considerations or ethical reasons [6, 19]. They are unlikely to be assigned identifiers that can be uniformly used later on. Even for dbSNP itself, there are many rs#'s abandoned due to regular clustering . These identifiers may have been cited in publications, leading to confusion and ambiguity.
Another candidate is HGVS mutation nomenclature guidelines, which are largely adopted by researchers and enforced by some journals. The format is like "<Accession Number>.<version number>(<Gene symbol>):<sequence type>.<mutation>". However, it is not universally applied as a standard, since it is complex and not unique. Table 1 gives five alternative names that are legal for a SNP, where the coordinate systems are based on different reference sequences. The mutation position is obtained based on some reference sequences. In addition, reference sequences are evolving with each new version. That makes the names unstable. More effort is thus required to translate data in published papers and databases between different versions of reference sequences [21, 22]. Finally, the names may be too long and complex to remember and communicate.
Current SNP nomenclatures, including rs#, are mostly arbitrary combination of letters and digits maintained by manual curation. The major problem is that they are not informative and only available within a single database. Automatic ways of mapping SNPs based on their names are rare. One way is to perform searching in available databases separately, and then compare the obtained records manually. For example, given only SNP names, we are unable to answer these kind of simple questions: What SNPs have been discovered on gene CHR1 (chromosome 5, locus 26648951..26653073)? or What diseases have been found closely associated to rs28942082? HGVS nomenclature is searchable and informative, but suffers from complexity and non-unique feature.
With different nomenclatures, it is difficult to cross reference SNPs among the various databases. Research based on the data only from one SNP database will lead to an incomplete compilation of variants and inadequate genomic analysis. For researchers who track SNPs through literature scanning, it is very difficult to gain a global picture from overwhelming publications since SNPs are not uniformly searchable in the literature. It is also not possible to search by position or polymorphism information. That could be a tough data mining challenge, which consumes considerable resources and time. From the discussion above, we believe that the existing SNP nomenclatures do not provide a universal standard.
Tremendous efforts have been made to keep SNP data uniformly. Besides the continuous development of HGVS nomenclature guidelines, SNP databases are integrating data from more sources.
GWAS, previously HGVbaseG2P, is one of the largest SNP databases [23, 24]. It gathers information of SNPs from the literature, their own and collaborative discovery efforts and unsolicited submissions. It exchanges core data with dbSNP regularly. The pharmacogenomics knowledge base (PharmGKB) allows cross-referencing against dbSNP, JSNP and HapMap, as well as other sources such as UCSC Genome Browser .
Some applications focus on retrieving SNPs fulfilling certain criteria such as locus and haplotype tagging. SNPper is web-based platform to search and export SNP records from dbSNP . TAMAL (Technology And Money Are Limiting) provides a query portal to latest versions of five SNP sources (HapMap, Perlegen, Affymetrix, dbSNP and the UCSC genome browser) . It helps to select SNPs that are likely involved in the genetic determination of human complex traits. LS-SNP annotates from dbSNP the coding of non-synonymous SNPs (nsSNPs) that will result in mutation in protein . Other works place emphasis on intragenic SNPs .
Among the previous works carried out, Mutalyzer sequence variation nomenclature checker  and SNP-Converter  are similar to the work described here. These two applications aim to support HGVS nomenclature guidelines. Mutalyzer checks if an SNP name follows the HGVS guidelines. Furthermore, it is capable of generating legal identifiers given the pivot features of a SNP. SNP-Converter converts whatever SNP names into HGVS names by exploring certain gene databases to determine the correct locus. It treats the integration process as a knowledge mining task. SNP-Converter is based on a complete SNP notation in XML format, acting as an ontology, to create a uniform semantic environment [3, 30].
From the discussions above, it is clear that dbSNP is an important database that cannot be ignored by any application. However, it does take considerable effort to translate nomenclatures among the SNP databases. To overcome the shortcomings of rs# and HGVS nomenclatures, we propose a universal nomenclature and UASIS (Universal Automated SNP Identification System). We believe our nomenclature is a good complement to rs# and HGVS, acting as a bridge connecting various databases, including private and unpublished ones. A system of nomenclature has to strike a compromise between the convenience and simplicity required for everyday use and the need for adequate definition of the concepts involved . In 2006, Human Variome Project Meeting gathered leading representatives to discuss key problems of human gene variation industry . The meeting gave 96 recommendations. Two of them regarding to "Nomenclatures and Standards" are:
4. Develop tools to accurately translate and search earlier nomenclature systems into successor systems.
6. The most current genome build be unambiguously adapted as the reference sequence, and that a standard be developed for the submission of all variant data that includes both a genome coordinate as well as sufficient flanking sequence to map the variation independently.
Universal SIMP nomenclature
HG( numeric version)
Complete human reference genome
by UCSC. '19' is version number
1..22, X, Y
A, C, G, T, N
N for unclear nucleotide
Substitution: alleles are 'G' and 'A'
Insertion: 'A' is inserted
Deletion: 'T' is deleted
Compared to HGVS guidelines, we fix the coordinate to be the whole human genome. And we give only one position without "_", since we consider only single bi-allele mutations. The first advantage is that it allows for succinct comparison using the accession numbers. The nomenclature is based on the human reference genome and not any arbitrary reference sequences, resulting in the generation of unique identifiers. All SNPs would be given the same prefix "HG19" currently. Secondly, it is unambiguous, informative and stable since the name consists of all necessary information to uniquely define an SNP. More importantly, UASIS nomenclature gives names that are searchable and comparable. It helps SNP tracking in the literature if universally adopted.
Another difference is the representation of mutations. HGVS guidelines use a ">" symbol to mean " changed to". Here we only list all possible alleles delimited by a " /". "A/T" means that the major allele could be either "A" or "T". Normally the first is the one on the reference genome. This definition is for simplicity. Determining the frequency of alleles requires more effort in the laboratory. In different populations or laboratory testings the results could be non-identical. For SNPs which have more than two alleles, the " >" symbol will lose its clarity, leading to ambiguity. This syntax is also used by other browser viewers . But we would recommend that the leftmost allele should be the major allele.
The most important advantage of UASIS nomenclature is that, unlike rs#, it does not depend on any particular database. The naming process of an SNP can be done automatically, regardless of the database maintaining it, or the contig the SNP is derived from, etc. Researchers do not necessarily submit to a particular database to get identifiers. They will get names instantaneously without waiting for manual approval using UASIS. Although dbSNP designates a ss# once a SNP is submitted, the ss# suffers similar problems of rs#. For private SNPs that cannot be published due to various reasons, UASIS nomenclature is obviously a better choice.
UASIS nomenclature is not intended to replace the rs# since rs# already has significant influence on SNP nomenclatures, rs#'s are simple, unique and stable. Actually, UASIS nomenclature is a good complementary to rs#, playing a similar role as ss#. But we believe that it is more than ss# and it will benefit the whole process of SNP standardization. One disadvantage of our notation is that it depends on the human reference genome. That is an unavoidable trade off given all attractive benefits of our universal nomenclature. But HG19 is considered as "finished" by the Genome Reference Consortium. We expect a much lower updating frequency of human genome in future.
UASIS is a web-based server system (http://www.uasis.tk) for annotating novel SNPs and cross-referencing among databases instantaneously. There are utility tools available, i.e., UASIS Aligner and Universal SNP Name Generator. For newly discovered SNPs, UASIS aligner performs efficient sequence alignment and checks whether the polymorphism has been deposited in main databases, including GWAS, dbSNP, JSNP and HapMap. In addition, for each mutation, UASIS provides an identifier based on our proposed nomenclature as described above. These identifiers can be used immediately and instantaneously. In this way, researchers are free to map SNPs among various nomenclatures. More databases like PharmGKB are currently in the process of being integrated into UASIS. Universal SNP Name Generator and SNP Name Mapper take in information of a SNP and perform cross-checking among main databases.
UASIS is available at http://www.uasis.tk since August 2010. It is implemented in PHP and MySQL, and designed for various types of web browser. Detailed information on the use of UASIS is provided online at the website.
Efficiency and accuracy are critical for real time systems like UASIS. Bowtie  and BWA  are winners . Mah et al. conducted rigorous experiments to compare popular alignment tools MAQ, SOAP2, BWA and Bowtie with BLAST results as the benchmark. The results proved that MAQ could not handle reads longer than 76bp and SOAP2 was memory inefficient. Bowtie and BWA are able to align thousands of sequences every second. Both tools are developed based on Burrows-Wheeler Transform (BWT)  data structure and FM-index . Bowtie is optimized for short reads around 35 base pair, which is the output read length of NGS (Next Generation Sequencing) platforms Illumina Solexa and SOLiD . It supports up to 3 mismatches by enumerating all possible permutations. This strategy makes it ultra fast, but it does not support gapped alignment. BWA employs roughly the same idea but it implements gapped alignment.
NGS techniques are producing longer and longer reads, for instance, 454 (around 400bp) and Illumina (a few hundreds base pairs). Bowtie and BWA are sufficient to perform long read alignments if there are just a few mismatches. Bowtie supports queries up to 1024bp . Mah et al. have shown that BWA and Bowtie achieved high accuracy and efficiency for reads up to 1024bp . UASIS works perfectly for NGS sequences. In this study, we conducted experiments based on reads up to 512bp. For reads longer than 1024bp, we would explore some specific alignment tools in future, such as BWA-SW .
Query sequences are uploaded and aligned to reference human genome by executing Bowtie or BWA. Then UASIS checks whether the query SNP exists in dbSNP, GWAS, JSNP or HapMap by inspecting the allele position. UASIS is very responsive since the alignment tools are efficient.
Alignment evaluation on simulated & real SNPs
94771 reads were simulated from the human genome (Build 37.1) using MetaSim  package following the error pattern of Sanger reads. Meanwhile, 72241 flanking sequences were downloaded from dbSNP (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/) and JSNP (http://snp.ims.u-tokyo.ac.jp/map/Dump/). For Bowtie, we use the options "-best -k 2 -v 3", meaning that it will report at most two hits allowing three mismatches in decreasing quality order. And for BWA, the options are "-n 3 -o 3", meaning that the edit distance is at most three and there are at most three gaps.
For both dataset, all three tools were found to show reliability. As the read length grows, the accuracy improves. Bowtie generated higher error rate since it does not support gapped alignment. But Bowtie was very efficient, taking less than 4 seconds to process.
UASIS is also introduced briefly on CBAS-SYMBIO 2010 held in Singapore. Approximately 30 people outside UASIS group have tested it.
Differing SNP nomenclatures have been a large concern for a long period. UASIS (Universal Automated SNP Identification System) proposes an informative, unique and unambiguous nomenclature that serves as a good complement to the present methods of identifying SNPs. The universal nomenclature is important for naming newly discovered or unpublished SNPs. The most significant advantage is that it provides a bridge to cross reference SNP identifiers among various databases. UASIS Aligner is an utility to perform pairwise sequence alignment and cross referencing in real time (<20s). Through Universal Name Generator and SNP Name Mapper, SNPs from dbSNP, GWAS, JSNP and HapMap can be mapped to one another. More databases are being integrated into UASIS. UASIS not only helps to achieve uniform notation of SNPs in the literature, but also aid in determining accurate SNP genotypes and haplotypes.
Project name: UASIS (Universal Automated SNP Identification System)
Project home page: http://www.uasis.tk with no requirement of log-in.
Hardware specifications: Dell T710, quad core of a 2.4 GHz Xeon E5620 processor, 8G RAM
Operating system(s): Ubuntu Server 10.04, kernel 2.6.32
Programming language: C++ and PHP web interface, could be assessed by various browsers, including Molliza Firefox, Chrome, Internet Explorer, etc
Database: MYSQL 5.1.41, storing 68382797 SNP records from dbSNP and JSNP
UASIS focuses on the standardization of single nucleotide polymorphisms. Currently we do not handle more complicated variations, including reversions, deletion/insertions of multiple bases, rearrangements and CNVs (Copy Number Variations) . Because the nomenclatures of these variations require much more effort to reach a consensus. To the best of our knowledge, we have not found any efficient approaches to discover these variations instantaneously. We would leave this exciting topic in future studies. Users are able to perform batch processing through UASIS Aligner with constraints. The maximum upload size is 5MB. And only files with extensions fa, fas, fast, fasta, fq and fastq are allowed. If the query format is incorrect, UASIS Aligner will report an error message or list the results as "Not Aligned". A third limitation is the synchronization between UASIS and SNP databases. Now we store the relationship of nomenclatures from dbSNP, JSNP, HGVS and HapMap as a local repository. Once these databases update the records, we have to update our local copy manually. For GWAS, we fetch the webpage through its online query system and then extract necessary information. In this case, if the query system is changed, we should change the code accordingly. From our observations, all of these databases have not performed major changes in the past half a year. We believe that UASIS is relatively stable.
Dr. Poo and Dr. Mah proposed the novel idea of universal nomenclature, reviewed and compared various SNP nomenclatures. Mr. Cai completed the universal nomenclature, developed the web server system, conducted the experiments and drafted this paper.
This work was supported by Singapore Ministry of Education Academic Research Fund (AcRF) [T1 251RES0911]; and Agency for Science, Technology and Research (A*STAR) JCO Grant [JCOAG04_FG03_2009].
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.