- Methodology article
- Open Access
Tracking chromosomal positions of oligomers - a case study with Illumina's BovineSNP50 beadchip
BMC Genomicsvolume 11, Article number: 80 (2010)
High density genotyping arrays have become established as a valuable research tool in human genetics. Currently, more than 300 genome wide association studies were published for human reporting about 1,000 SNPs that are associated with a phenotype. Also in animal sciences high density genotyping arrays are harnessed to analyse genetic variation. To exploit the full potential of this technology single nucleotide polymorphisms (SNPs) on the chips should be well characterized and their chromosomal position should be precisely known. This, however, is a challenge if the genome sequence is still subject to changes.
We have developed a mapping strategy and a suite of software scripts to update the chromosomal positions of oligomer sequences used for SNP genotyping on high density arrays. We describe the mapping procedure in detail so that scientists with moderate bioinformatics skills can reproduce it. We furthermore present a case study in which we re-mapped 54,001 oligomer sequences from Ilumina's BovineSNP50 beadchip to the bovine genome sequence. We found in 992 cases substantial discrepancies between the manufacturer's annotations and our results. The software scripts in the Perl and R programming languages are provided as supplements.
The positions of oligomer sequences in the genome are volatile even within one build of the genome. To facilitate the analysis of data from a GWAS or from an expression study, especially with species whose genome assembly is still unstable, it is recommended to update the oligomer positions before data analysis.
High-density genotyping arrays have become established as a valuable research tool in human genetics. Currently, more than 300 genome-wide association studies of humans were published, reporting about 1,000 SNPs that are associated with a phenotype . Also in animal sciences, high-density genotyping arrays are harnessed to analyze genetic variation [2, 3]. To exploit the full potential of this technology, SNPs on the chips should be well characterized and especially their chromosomal position should be precisely known. However, this is a challenge, if the SNPs are not so-called reference SNPs and if the genome sequence is still subject to changes. If reference identifiers (rs-IDs) are known, the position of oligomers can be updated comfortably through Biomart . Otherwise, oligomer position tracking is possible via aligning the oligomer sequences to the genome sequence. In this work, we describe such a mapping strategy in detail so that a scientist with moderate bioinformatics knowledge can reproduce it. Furthermore, we present a case study in which we mapped 54,001 oligomer sequences from Illumina's BovineSNP50 beadarray [5, 6] to the bovine genome.
51,870 of 54,001 (96%) oligomer sequences had a single megablast hit and could therefore be assigned to both a chromosome and a position unambiguously (Fig. 1). 1,611 oligomer sequences (3%) had no hits, and 520 (1%) had multiple hits. 114 of these 520 oligomer sequences had a clear best hit and could therefore be assigned both to a chromosome and a position. 360 had hits with an equally high bitscore on one and the same chromosome but at different positions. They were assigned to a chromosome but not to a position. 46 had hits with an equally high bitscore but on different chromosomes. They were assigned neither to a chromosome nor to a position. Altogether, we mapped 51,984 (51,870 + 114; 96.3%) oligomer sequences to a unique chromosomal position, 360 (0.7%) to a chromosome but not to a position, and 1,657 (1,611 + 46; 3.1%) neither to a chromosome nor to a position (Fig. 1; Additional file 1: Annotation file for 54,001 oligomer sequences of the Bovine50SNP beadchip). For comparison, the corresponding figures according to Illumina were 52,255 (96.8%), 74 (0.1%), and 1,672 (3.1%) for oligomers with a unique position, for oligomers with a chromosome but without position, and for oligomers without chromosome and position, respectively (Table 1). We wrote an R-script  to perform an extensive comparison between Illumina's and our mapping annotation (Additional file 2: Diagnostics.r). In 107 cases, Illumina assigned oligomers to a chromosome, while our mapping procedure did not. In 122 cases, the situation was vice versa (Table 2). Next, we examined all 52,222 SNPs assigned to a unique chromosome by both Illumina and our mapping procedure. In five cases (SNP-IDs Hapmap24571-BTA-154685, BTB-01222201, BTB-01798822, INRA-443, and BTB-01386016), the oligomers were assigned a different chromosome by Illumina and our mapping procedure. In 355 cases, SNPs were given a unique locus in Illumina's annotation but not by this mapping procedure; in 74 cases, the situation was vice versa (Table 3). We then examined if the positions provided by Illumina and by our mapping procedure differed in case the chromosomal annotation was consistent and, if so, to what extent. 51,788 SNPs had a unique chromosomal position, according to both Illumina and this mapping procedure. In 329 (0.7%) cases, the positions differed by three or more base pairs. In 566 (1.0%) cases, the positions were exactly identical, and in 50,367 (93.3%) cases, the positions differed by one base pair (Table 2). This can probably be attributed to differences in the mapping procedures applied by Illumina and by us. We consider the following cases substantial discrepancies between Illumina's annotation and our results:
Oligomers were mapped to a chromosome by Illumina but not by our mapping procedure (122 cases) or vice versa (107 cases)
Oligomers mapped to one and the same chromosome by Illumina and by our procedure but mapped to a unique locus by Illumina and not by us (355 cases) or vice versa (74 cases)
Oligomers mapped to two different chromosomes by Illumina and by our procedure (five cases)
Oligomers mapped to one and the same chromosome by Illumina and by our procedure but mapped to loci more than two bases apart (329 cases)
Altogether, there were substantial changes for 992 SNPs (1.8%). Finally, we checked if oligomer sequences were represented on the chip more than once (Additional file 3: ComparingOligoSeqs.r). We found 18 cases, where one and the same oligomer sequence was assigned to two different oligomer identifiers (Additional file 4: Eighteen duplicate oligomer sequences). These 18 SNP pairs can be used to assess the genotyping quality. We also compared all oligomer sequences against their reverse-complementary sequences using the R-library seqinr but found no case of duplication.
The underlying genome build was Btau4.0 for both Illumina's annotation and for our mapping procedure, which makes it difficult to find reasons for the discrepancies. One reason could be that minor changes are incorporated into the genome sequence by the genome database curators without altering the build version. Another reason could be that the mapping methods differ. Although the extent of deviations might appear small, incorrect positions of oligomers can lead to artifacts in subsequent analyses like linkage disequilibrium studies or the establishment of candidate gene lists. The given examples illustrate that SNP oligomers that are mapped to different genomic locations may lead to completely functionally different inferences based on gene annotations. It is to be expected that further adjustment is necessary as the bovine genome proceeds to be completed. Currently, 90% of the bovine sequence is assigned to the 29 autosomes and the X chromosome .
Our case study has shown that discrepancies between the initially determined positions and the positions determined at a later time can occur. These discrepancies can make the correct association of an oligomer with a gene difficult. Above all, the classification of an SNP as e.g. non-synonymous coding, synonymous coding or as splice site SNP is critically dependent on its exact chromosomal position.
In a case study with Illumina's BovineSNP50 beadchip we found that almost 2% of the oligomer positions deviated substantially between the annotations given by the manufacturer and determined in our mapping procedure. Given the relatively easy realization of the mapping procedure described here, it is recommended to verify the manufacturer's data and adjust them, if necessary.
We furthermore would like to point out that the verification of results obtained by SNP or expression arrays can be considerably facilitated if the oligomer sequences are made available to the scientific community.
We downloaded the bovine genome Btau 4.0  on June 22, 2009 using a Perl-script (Additional file 5: DownloadGenome.pl) to the file GenomeBtauEnsemlb54.fasta. Altogether, the sequences for 29 autosomes, the X-chromosome, one mitochondrial and 11,869 unmapped sequences were downloaded. The bovine Y-chromosome sequence was not available. The total length of the downloaded genome sequence was 2,917,974,530 base pairs, of which 283,544,868 (9.72%) were from unmapped sequences. 108 contig sequences (total length 231,114 base pairs) were not considered, because they were neither assigned to a chromosome nor to an unmapped sequence. Next, we built index files from the genome sequences for the megablast search with the executable formatdb, available from the NCBI C++ toolkit http://www.ncbi.nlm.nih.gov/BLAST/download.shtml. The blast index files were created with the command: formatdb -t "GenomeBtau4" -i GenomeBtau4.fasta -p F -o T -V T -n GenomeBtau4, where the options mean:
-t: database name
-i: fasta input file to be formatted into BLAST database format
-p F: input file containing the chromosome nucleotide sequence
-V T: check for non-unique string IDs in the database
-n: base name for BLAST files
We obtained the file containing the oligomer sequences (BovineSNP50_B.csv, last updated on June 4, 2008) from the ftp login, https://www.illumina.com/ftp.ilm. Access to this site is restricted to customers. The file containing the oligomer positions (BovineSNP50_Final_SNPs_54001.zip) was downloaded from http://www.illumina.com/pages.ilmn?ID=256. We will refer to these two files as "oligomer sequence file" and "oligomer position file", respectively. In the oligomer position file, a "0" was interpreted as not assigned to a chromosome or a position. Before obtaining a file containing the oligomer sequences in fasta format, we commented out metadata lines (first seven and last 53 lines) in the oligomer sequence file. The function write.dna from the R-package APE was applied in an R-script (Additional file 6: BuildOligoSeqDB.r) to build a fasta file containing 54,001 oligomer sequences of length 50 base pairs (BovineSNP50OligoSeqs.fasta). In 4,275 cases, oligomer identifiers had to be made consistent. (Oligomer names starting with NFGL-NGS in the oligomer sequence file started with ARS-NFGL-NGS in the oligomer position file). The fasta file was subsequently searched against the bovine genome using the program megablast  as follows: megablast -i BovineSNP50OligoSeqs.fasta -d GenomeBtau4 -p 95 -v 1 -b 1 -m 9 -D 3 -F F -o OligosVsBovineGenome.megablast, where the options mean:
-p 95: identity percentage cut-off 95%
-v: number of database sequences to show one-line descriptions for
-b 0: prevent the printing of full alignments
-F F: do not filter query sequence for low complexity regions
-m 9: produce tabular output with comment lines
-D 3: produce tab-delimited output in one line format
-o OligosVsBovineGenome.megablast: name the output file OligosVsBovineGenome.megablast
The output file contained 102,434 hits, i. e. about twice the number of oligomer sequences. The megablast output was processed using an R-script (Additional file 7: ProcessMegablastOutput.r) in the following way: Only hits on one of the autosomes or the X-chromosome with an alignment of length 49 base pairs or longer were accepted (53,039 hits). If an oligomer sequence produced just one hit, the position of the end of that hit in chromosomal co-ordinates (plus 1, if oligomer sequence and chromosomal sequence were in the same orientation, otherwise minus 1) was taken as the chromosomal position of the SNP. If an oligomer sequence had no hit, both chromosome and position were assigned NA (not available). In the case of more than one hit, the following strategy of disambiguation was applied: If there was only one best hit of the SNP oligomer in terms of bitscore, its position and chromosome were recorded (case 1 in Fig. 1; Fig. 2). If the hit was neither on an autosome nor on the X-chromosome, NA was assigned to both chromosome and position. Otherwise, if there were several hits with an equally high bitscore on the same chromosome but at different positions that chromosome was assigned to the oligomer, but NA to its position (case 2 in Fig. 1; Fig. 3). Otherwise, if there were several hits with an equally high bitscore on different chromosomes, NA was assigned to both its position and its chromosome (case 3 in Fig. 2; Fig. 4).
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA. 2009, 106: 9362-9367. 10.1073/pnas.0903103106.
Hayes BJ, Bowman PJ, Chamberlain AJ, Savin K, van Tassell CP, Sonstegard TS, Goddard ME: A validated genome wide association study to breed cattle adapted to an environment altered by climate change. PLoS ONE. 2009, 4: e6676-10.1371/journal.pone.0006676.
Cole JB, VanRaden PM, O'Connell JR, Van Tassell CP, Sonstegard TS, Schnabel RD, Taylor JF, Wiggans GR: Distribution and location of genetic effects for dairy traits. J Dairy Sci. 2009, 92: 2931-2946. 10.3168/jds.2008-1762.
Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart-biological queries made easy. BMC Genomics. 2009, 10: 22-10.1186/1471-2164-10-22.
Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5: 247-252. 10.1038/nmeth.1185.
Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O'Connell J, Moore SS, Smith TP, Sonstegard TS, Van Tassell CP: Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE. 2009, 4: e5350-10.1371/journal.pone.0005350.
R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, R Foundation for Statistical Computing, Vienna, Austria, [ISBN 3-900051-07-0], [http://www.R-project.org]
Charif D, Lobry J: SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Structural approaches to sequence evolution: Molecules, networks, populations, Biological and Medical Physics, Biomedical Engineering. Edited by: U Bastolla HR, Porto M, Vendruscolo M. 2007, New York: Springer Verlag, 207-232. [ISBN: 978-3-540-35305-8]
Elsik CG, Tellam RL, Worley KC, Gibbs RA, Muzny DM, Weinstock GM, Adelson DL, Eichler EE, Elnitski L, Guigó R, Hamernik DL, Kappes SM, Lewin HA, Lynn DJ, Nicholas FW, Reymond A, Rijnkels M, Skow LC, Zdobnov EM, Schook L, Womack J, Alioto T, Antonarakis SE, Astashyn A, Chapple CE, Chen HC, Chrast J, Câamara F, Ermolaeva O, Henrichsen CN, Hlavina W, Kapustin Y, Kiryutin B, Kitts P, Kokocinski F, Landrum M, Maglott D, Pruitt K, Sapojnikov V, Searle SM, Solovyev V, Souvorov A, Ucla C, Wyss C, Anzola JM, Gerlach D, Elhaik E, Graur D, Reese JT, Edgar RC, McEwan JC, Payne GM, Raison JM, Junier T, Kriventseva EV, Eyras E, Plass M, Donthu R, Larkin DM, Reecy J, Yang MQ, Chen L, Cheng Z, Chitko-McKown CG, Liu GE, Matukumalli LK, Song J, Zhu B, Bradley DG, Brinkman FS, Lau LP, Whiteside MD, Walker A, Wheeler TT, Casey T, German JB, Lemay DG, Maqbool NJ, Molenaar AJ, Seo S, Stothard P, Baldwin CL, Baxter R, Brinkmeyer-Langford CL, Brown WC, Childers CP, Connelley T, Ellis SA, Fritz K, Glass EJ, Herzig CT, Iivanainen A, Lahmers KK, Bennett AK, Dickens CM, Gilbert JG, Hagen DE, Salih H, Aerts J, Caetano AR, Dalrymple B, Garcia JF, Gill CA, Hiendleder SG, Memili E, Spurlock D, Williams JL, Alexander L, Brownstein MJ, Guan L, Holt RA, Jones SJ, Marra MA, Moore R, Moore SS, Roberts A, Taniguchi M, Waterman RC, Chacko J, Chandrabose MM, Cree A, Dao MD, Dinh HH, Gabisi RA, Hines S, Hume J, Jhangiani SN, Joshi V, Kovar CL, Lewis LR, Liu YS, Lopez J, Morgan MB, Nguyen NB, Okwuonu GO, Ruiz SJ, Santibanez J, Wright RA, Buhay C, Ding Y, Dugan-Rocha S, Herdandez J, Holder M, Sabo A, Egan A, Goodell J, Wilczek-Boney K, Fowler GR, Hitchens ME, Lozado RJ, Moen C, Steffen D, Warren JT, Zhang J, Chiu R, Schein JE, Durbin KJ, Havlak P, Jiang H, Liu Y, Qin X, Ren Y, Shen Y, Song H, Bell SN, Davis C, Johnson AJ, Lee S, Nazareth LV, Patel BM, Pu LL, Vattathil S, Williams RL, Curry S, Hamilton C, Sodergren E, Wheeler DA, Barris W, Bennett GL, Eggen A, Green RD, Harhay GP, Hobbs M, Jann O, Keele JW, Kent MP, Lien S, McKay SD, McWilliam S, Ratnakumar A, Schnabel RD, Smith T, Snelling WM, Sonstegard TS, Stone RT, Sugimoto Y, Takasuga A, Taylor JF, Van Tassell CP, Macneil MD, Abatepaulo AR, Abbey CA, Ahola V, Almeida IG, Amadio AF, Anatriello E, Bahadue SM, Biase FH, Boldt CR, Carroll JA, Carvalho WA, Cervelatti EP, Chacko E, Chapin JE, Cheng Y, Choi J, Colley AJ, de Campos TA, De Donato M, Santos IK, de Oliveira CJ, Deobald H, Devinoy E, Donohue KE, Dovc P, Eberlein A, Fitzsimmons CJ, Franzin AM, Garcia GR, Genini S, Gladney CJ, Grant JR, Greaser ML, Green JA, Hadsell DL, Hakimov HA, Halgren R, Harrow JL, Hart EA, Hastings N, Hernandez M, Hu ZL, Ingham A, Iso-Touru T, Jamis C, Jensen K, Kapetis D, Kerr T, Khalil SS, Khatib H, Kolbehdari D, Kumar CG, Kumar D, Leach R, Lee JC, Li C, Logan KM, Malinverni R, Marques E, Martin WF, Martins NF, Maruyama SR, Mazza R, McLean KL, Medrano JF, Moreno BT, Moré DD, Muntean CT, Nandakumar HP, Nogueira MF, Olsaker I, Pant SD, Panzitta F, Pastor RC, Poli MA, Poslusny N, Rachagani S, Ranganathan S, Razpet A, Riggs PK, Rincon G, Rodriguez-Osorio N, Rodriguez-Zas SL, Romero NE, Rosenwald A, Sando L, Schmutz SM, Shen L, Sherman L, Southey BR, Lutzow YS, Sweedler JV, Tammen I, Telugu BP, Urbanski JM, Utsunomiya YT, Verschoor CP, Waardenberg AJ, Wang Z, Ward R, Weikard R, Welsh TH, White SN, Wilming LG, Wunderlich KR, Yang J, Zhao FQ: The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science. 2009, 324: 522-528. 10.1126/science.1169588.
Paradis E, Claude J, Strimmer K: APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004, 20: 289-290. 10.1093/bioinformatics/btg412.
Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.
We would like to thank Illumina's TechSupport for valuable advice and Jake Litke for comments on the manuscript. This work was supported by the German Ministry of Education and Research for financial support through GenoTrack (grant No. 0315134B).
AOS designed the strategy, RHB verified the results and tested the scripts, AOS and GAB wrote the manuscript. All authors read and approved the final manuscript.