Tracking chromosomal positions of oligomers - a case study with Illumina's BovineSNP50 beadchip
© Schmitt et al; licensee BioMed Central Ltd. 2010
Received: 14 August 2009
Accepted: 1 February 2010
Published: 1 February 2010
High density genotyping arrays have become established as a valuable research tool in human genetics. Currently, more than 300 genome wide association studies were published for human reporting about 1,000 SNPs that are associated with a phenotype. Also in animal sciences high density genotyping arrays are harnessed to analyse genetic variation. To exploit the full potential of this technology single nucleotide polymorphisms (SNPs) on the chips should be well characterized and their chromosomal position should be precisely known. This, however, is a challenge if the genome sequence is still subject to changes.
We have developed a mapping strategy and a suite of software scripts to update the chromosomal positions of oligomer sequences used for SNP genotyping on high density arrays. We describe the mapping procedure in detail so that scientists with moderate bioinformatics skills can reproduce it. We furthermore present a case study in which we re-mapped 54,001 oligomer sequences from Ilumina's BovineSNP50 beadchip to the bovine genome sequence. We found in 992 cases substantial discrepancies between the manufacturer's annotations and our results. The software scripts in the Perl and R programming languages are provided as supplements.
The positions of oligomer sequences in the genome are volatile even within one build of the genome. To facilitate the analysis of data from a GWAS or from an expression study, especially with species whose genome assembly is still unstable, it is recommended to update the oligomer positions before data analysis.
High-density genotyping arrays have become established as a valuable research tool in human genetics. Currently, more than 300 genome-wide association studies of humans were published, reporting about 1,000 SNPs that are associated with a phenotype . Also in animal sciences, high-density genotyping arrays are harnessed to analyze genetic variation [2, 3]. To exploit the full potential of this technology, SNPs on the chips should be well characterized and especially their chromosomal position should be precisely known. However, this is a challenge, if the SNPs are not so-called reference SNPs and if the genome sequence is still subject to changes. If reference identifiers (rs-IDs) are known, the position of oligomers can be updated comfortably through Biomart . Otherwise, oligomer position tracking is possible via aligning the oligomer sequences to the genome sequence. In this work, we describe such a mapping strategy in detail so that a scientist with moderate bioinformatics knowledge can reproduce it. Furthermore, we present a case study in which we mapped 54,001 oligomer sequences from Illumina's BovineSNP50 beadarray [5, 6] to the bovine genome.
Comparison between Illumina's and this study's annotations.
unique chromosome, but no unique position
neither unique chromosome nor position
Chromosomal assignments of SNPs.
Chromosomal positions of SNPs.
Oligomers were mapped to a chromosome by Illumina but not by our mapping procedure (122 cases) or vice versa (107 cases)
Oligomers mapped to one and the same chromosome by Illumina and by our procedure but mapped to a unique locus by Illumina and not by us (355 cases) or vice versa (74 cases)
Oligomers mapped to two different chromosomes by Illumina and by our procedure (five cases)
Oligomers mapped to one and the same chromosome by Illumina and by our procedure but mapped to loci more than two bases apart (329 cases)
Altogether, there were substantial changes for 992 SNPs (1.8%). Finally, we checked if oligomer sequences were represented on the chip more than once (Additional file 3: ComparingOligoSeqs.r). We found 18 cases, where one and the same oligomer sequence was assigned to two different oligomer identifiers (Additional file 4: Eighteen duplicate oligomer sequences). These 18 SNP pairs can be used to assess the genotyping quality. We also compared all oligomer sequences against their reverse-complementary sequences using the R-library seqinr but found no case of duplication.
The underlying genome build was Btau4.0 for both Illumina's annotation and for our mapping procedure, which makes it difficult to find reasons for the discrepancies. One reason could be that minor changes are incorporated into the genome sequence by the genome database curators without altering the build version. Another reason could be that the mapping methods differ. Although the extent of deviations might appear small, incorrect positions of oligomers can lead to artifacts in subsequent analyses like linkage disequilibrium studies or the establishment of candidate gene lists. The given examples illustrate that SNP oligomers that are mapped to different genomic locations may lead to completely functionally different inferences based on gene annotations. It is to be expected that further adjustment is necessary as the bovine genome proceeds to be completed. Currently, 90% of the bovine sequence is assigned to the 29 autosomes and the X chromosome .
Our case study has shown that discrepancies between the initially determined positions and the positions determined at a later time can occur. These discrepancies can make the correct association of an oligomer with a gene difficult. Above all, the classification of an SNP as e.g. non-synonymous coding, synonymous coding or as splice site SNP is critically dependent on its exact chromosomal position.
In a case study with Illumina's BovineSNP50 beadchip we found that almost 2% of the oligomer positions deviated substantially between the annotations given by the manufacturer and determined in our mapping procedure. Given the relatively easy realization of the mapping procedure described here, it is recommended to verify the manufacturer's data and adjust them, if necessary.
We furthermore would like to point out that the verification of results obtained by SNP or expression arrays can be considerably facilitated if the oligomer sequences are made available to the scientific community.
We downloaded the bovine genome Btau 4.0  on June 22, 2009 using a Perl-script (Additional file 5: DownloadGenome.pl) to the file GenomeBtauEnsemlb54.fasta. Altogether, the sequences for 29 autosomes, the X-chromosome, one mitochondrial and 11,869 unmapped sequences were downloaded. The bovine Y-chromosome sequence was not available. The total length of the downloaded genome sequence was 2,917,974,530 base pairs, of which 283,544,868 (9.72%) were from unmapped sequences. 108 contig sequences (total length 231,114 base pairs) were not considered, because they were neither assigned to a chromosome nor to an unmapped sequence. Next, we built index files from the genome sequences for the megablast search with the executable formatdb, available from the NCBI C++ toolkit http://www.ncbi.nlm.nih.gov/BLAST/download.shtml. The blast index files were created with the command: formatdb -t "GenomeBtau4" -i GenomeBtau4.fasta -p F -o T -V T -n GenomeBtau4, where the options mean:
-t: database name
-i: fasta input file to be formatted into BLAST database format
-p F: input file containing the chromosome nucleotide sequence
-V T: check for non-unique string IDs in the database
-n: base name for BLAST files
We obtained the file containing the oligomer sequences (BovineSNP50_B.csv, last updated on June 4, 2008) from the ftp login, https://www.illumina.com/ftp.ilm. Access to this site is restricted to customers. The file containing the oligomer positions (BovineSNP50_Final_SNPs_54001.zip) was downloaded from http://www.illumina.com/pages.ilmn?ID=256. We will refer to these two files as "oligomer sequence file" and "oligomer position file", respectively. In the oligomer position file, a "0" was interpreted as not assigned to a chromosome or a position. Before obtaining a file containing the oligomer sequences in fasta format, we commented out metadata lines (first seven and last 53 lines) in the oligomer sequence file. The function write.dna from the R-package APE was applied in an R-script (Additional file 6: BuildOligoSeqDB.r) to build a fasta file containing 54,001 oligomer sequences of length 50 base pairs (BovineSNP50OligoSeqs.fasta). In 4,275 cases, oligomer identifiers had to be made consistent. (Oligomer names starting with NFGL-NGS in the oligomer sequence file started with ARS-NFGL-NGS in the oligomer position file). The fasta file was subsequently searched against the bovine genome using the program megablast  as follows: megablast -i BovineSNP50OligoSeqs.fasta -d GenomeBtau4 -p 95 -v 1 -b 1 -m 9 -D 3 -F F -o OligosVsBovineGenome.megablast, where the options mean:
-p 95: identity percentage cut-off 95%
-v: number of database sequences to show one-line descriptions for
-b 0: prevent the printing of full alignments
-F F: do not filter query sequence for low complexity regions
-m 9: produce tabular output with comment lines
-D 3: produce tab-delimited output in one line format
-o OligosVsBovineGenome.megablast: name the output file OligosVsBovineGenome.megablast
We would like to thank Illumina's TechSupport for valuable advice and Jake Litke for comments on the manuscript. This work was supported by the German Ministry of Education and Research for financial support through GenoTrack (grant No. 0315134B).
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009, 106: 9362-9367. 10.1073/pnas.0903103106.PubMed CentralPubMedView ArticleGoogle Scholar
- Hayes BJ, Bowman PJ, Chamberlain AJ, Savin K, van Tassell CP, Sonstegard TS, Goddard ME: A validated genome wide association study to breed cattle adapted to an environment altered by climate change. PLoS ONE. 2009, 4: e6676-10.1371/journal.pone.0006676.PubMed CentralPubMedView ArticleGoogle Scholar
- Cole JB, VanRaden PM, O'Connell JR, Van Tassell CP, Sonstegard TS, Schnabel RD, Taylor JF, Wiggans GR: Distribution and location of genetic effects for dairy traits. J Dairy Sci. 2009, 92: 2931-2946. 10.3168/jds.2008-1762.PubMedView ArticleGoogle Scholar
- Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart-biological queries made easy. BMC Genomics. 2009, 10: 22-10.1186/1471-2164-10-22.PubMed CentralPubMedView ArticleGoogle Scholar
- Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5: 247-252. 10.1038/nmeth.1185.PubMedView ArticleGoogle Scholar
- Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O'Connell J, Moore SS, Smith TP, Sonstegard TS, Van Tassell CP: Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE. 2009, 4: e5350-10.1371/journal.pone.0005350.PubMed CentralPubMedView ArticleGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, R Foundation for Statistical Computing, Vienna, Austria, [ISBN 3-900051-07-0], [http://www.R-project.org]Google Scholar
- Charif D, Lobry J: SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Structural approaches to sequence evolution: Molecules, networks, populations, Biological and Medical Physics, Biomedical Engineering. Edited by: U Bastolla HR, Porto M, Vendruscolo M. 2007, New York: Springer Verlag, 207-232. [ISBN: 978-3-540-35305-8]View ArticleGoogle Scholar
- Elsik CG, Tellam RL, Worley KC, Gibbs RA, Muzny DM, Weinstock GM, Adelson DL, Eichler EE, Elnitski L, Guigó R, Hamernik DL, Kappes SM, Lewin HA, Lynn DJ, Nicholas FW, Reymond A, Rijnkels M, Skow LC, Zdobnov EM, Schook L, Womack J, Alioto T, Antonarakis SE, Astashyn A, Chapple CE, Chen HC, Chrast J, Câamara F, Ermolaeva O, Henrichsen CN, Hlavina W, Kapustin Y, Kiryutin B, Kitts P, Kokocinski F, Landrum M, Maglott D, Pruitt K, Sapojnikov V, Searle SM, Solovyev V, Souvorov A, Ucla C, Wyss C, Anzola JM, Gerlach D, Elhaik E, Graur D, Reese JT, Edgar RC, McEwan JC, Payne GM, Raison JM, Junier T, Kriventseva EV, Eyras E, Plass M, Donthu R, Larkin DM, Reecy J, Yang MQ, Chen L, Cheng Z, Chitko-McKown CG, Liu GE, Matukumalli LK, Song J, Zhu B, Bradley DG, Brinkman FS, Lau LP, Whiteside MD, Walker A, Wheeler TT, Casey T, German JB, Lemay DG, Maqbool NJ, Molenaar AJ, Seo S, Stothard P, Baldwin CL, Baxter R, Brinkmeyer-Langford CL, Brown WC, Childers CP, Connelley T, Ellis SA, Fritz K, Glass EJ, Herzig CT, Iivanainen A, Lahmers KK, Bennett AK, Dickens CM, Gilbert JG, Hagen DE, Salih H, Aerts J, Caetano AR, Dalrymple B, Garcia JF, Gill CA, Hiendleder SG, Memili E, Spurlock D, Williams JL, Alexander L, Brownstein MJ, Guan L, Holt RA, Jones SJ, Marra MA, Moore R, Moore SS, Roberts A, Taniguchi M, Waterman RC, Chacko J, Chandrabose MM, Cree A, Dao MD, Dinh HH, Gabisi RA, Hines S, Hume J, Jhangiani SN, Joshi V, Kovar CL, Lewis LR, Liu YS, Lopez J, Morgan MB, Nguyen NB, Okwuonu GO, Ruiz SJ, Santibanez J, Wright RA, Buhay C, Ding Y, Dugan-Rocha S, Herdandez J, Holder M, Sabo A, Egan A, Goodell J, Wilczek-Boney K, Fowler GR, Hitchens ME, Lozado RJ, Moen C, Steffen D, Warren JT, Zhang J, Chiu R, Schein JE, Durbin KJ, Havlak P, Jiang H, Liu Y, Qin X, Ren Y, Shen Y, Song H, Bell SN, Davis C, Johnson AJ, Lee S, Nazareth LV, Patel BM, Pu LL, Vattathil S, Williams RL, Curry S, Hamilton C, Sodergren E, Wheeler DA, Barris W, Bennett GL, Eggen A, Green RD, Harhay GP, Hobbs M, Jann O, Keele JW, Kent MP, Lien S, McKay SD, McWilliam S, Ratnakumar A, Schnabel RD, Smith T, Snelling WM, Sonstegard TS, Stone RT, Sugimoto Y, Takasuga A, Taylor JF, Van Tassell CP, Macneil MD, Abatepaulo AR, Abbey CA, Ahola V, Almeida IG, Amadio AF, Anatriello E, Bahadue SM, Biase FH, Boldt CR, Carroll JA, Carvalho WA, Cervelatti EP, Chacko E, Chapin JE, Cheng Y, Choi J, Colley AJ, de Campos TA, De Donato M, Santos IK, de Oliveira CJ, Deobald H, Devinoy E, Donohue KE, Dovc P, Eberlein A, Fitzsimmons CJ, Franzin AM, Garcia GR, Genini S, Gladney CJ, Grant JR, Greaser ML, Green JA, Hadsell DL, Hakimov HA, Halgren R, Harrow JL, Hart EA, Hastings N, Hernandez M, Hu ZL, Ingham A, Iso-Touru T, Jamis C, Jensen K, Kapetis D, Kerr T, Khalil SS, Khatib H, Kolbehdari D, Kumar CG, Kumar D, Leach R, Lee JC, Li C, Logan KM, Malinverni R, Marques E, Martin WF, Martins NF, Maruyama SR, Mazza R, McLean KL, Medrano JF, Moreno BT, Moré DD, Muntean CT, Nandakumar HP, Nogueira MF, Olsaker I, Pant SD, Panzitta F, Pastor RC, Poli MA, Poslusny N, Rachagani S, Ranganathan S, Razpet A, Riggs PK, Rincon G, Rodriguez-Osorio N, Rodriguez-Zas SL, Romero NE, Rosenwald A, Sando L, Schmutz SM, Shen L, Sherman L, Southey BR, Lutzow YS, Sweedler JV, Tammen I, Telugu BP, Urbanski JM, Utsunomiya YT, Verschoor CP, Waardenberg AJ, Wang Z, Ward R, Weikard R, Welsh TH, White SN, Wilming LG, Wunderlich KR, Yang J, Zhao FQ: The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009, 324: 522-528. 10.1126/science.1169588.PubMed CentralPubMedView ArticleGoogle Scholar
- Paradis E, Claude J, Strimmer K: APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004, 20: 289-290. 10.1093/bioinformatics/btg412.PubMedView ArticleGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.