- Open Access
xMAN: extreme MApping of OligoNucleotides
© Li et al.; licensee BioMed Central Ltd. 2008
- Published: 20 March 2008
The ability to rapidly map millions of oligonucleotide fragments to a reference genome is crucial to many high throughput genomic technologies.
We propose an intuitive and efficient algorithm, titled ex treme MA pping of OligoN ucleotide (xMAN), to rapidly map millions of oligonucleotide fragments to a genome of any length. By converting oligonucleotides to integers hashed in RAM, xMAN can scan through genomes using bit shifting operation and achieve at least one order of magnitude speed increase over existing tools. xMAN can map the 42 million 25-mer probes on the Affymetrix whole human genome tiling arrays to the entire genome in less than 6 CPU hours.
In addition to the speed advantage, we found the probe mapping of xMAN to substantially improve the final analysis results in both a spike-in experiment on ENCODE tiling arrays and an estrogen receptor ChIP-chip experiment on whole human genome tiling arrays. Those improvements were confirmed by direct ChIP and real-time PCR assay. xMAN can be further extended for application to other high-throughput genomic technologies for oligonucleotide mapping.
- Query Sequence
- Hash Table
- Tiling Array
- Average Probe Spacing
- Tiling Microarrays
The ability to rapidly map millions of oligonucleotide fragments to a reference genome is crucial to many high throughput genomic technologies. For example, Affymetrix, Nimblegen, and Agilent have recently developed oligonucleotide arrays to tile all the non-repetitive genomic sequences of complex eukaryotic genomes. Since the arrays were usually designed based on an older genome assembly, it is important to remap all the probes to the newest genome assembly or transcriptome annotation during data analysis , under the assumption that the current genome and transcriptome are more precise than the earlier ones. It is not uncommon for a probe to map to multiple locations in the genome. As a result, the probe could give rise to unexpected behaviour if information on its genome copy number is unknown or ignored. Other examples are the next-generation sequencing approaches to annotate novel transcripts [2, 3] and regulatory elements . In these applications, millions of short oligonucleotide fragments are generated and mapped to a reference genome to identify the transcript or cis-element clusters.
The current algorithms available for fast sequence similarity search such as BLAST , MegaBLAST , BLAT  and MUMmer  were not specifically designed for mapping millions of query sequences. As a result, using such algorithms often take very long time, even on computer clusters with fast processors and big memory. For example, in the recent MicroArray Quality Control project , it took approximately two days to map the half a million probes on the Affymetrix U133 expression microarray to the RefSeq mRNA database of approximately 70 MB size on a Beowulf cluster with 248 AMD Opteron Dual Processor nodes. This translates into inhibitive time and resources required to map the 42 million probes on Affymetrix genome tiling microarrays to complete mammalian genomes.
We propose an intuitive and efficient method xMAN (namely ex treme MA pping of OligoN ucleotide) for the rapid mapping of millions of query oligonucleotide fragments to the reference genome of any given length. xMAN differs significantly from existing algorithms. First, instead of indexing the reference genome which is memory expensive, xMAN transforms all the query sequences into integers and stores them in RAM as a hash table. Secondly, when scanning through a genome of any size for query mapping, xMAN hinges on bit shifting operation over sliding windows to boost its search speed. In this paper, we will explain xMAN's underlying algorithm, compare its performance with other methods, and discuss its application in tiling microarray data analysis.
Tiling microarray probes remapping
Tiling microarrays have probes that cover essentially the entire non-redundant genome in an unbiased fashion. Such arrays have diverse applications, including chromatin-immunoprecipitation coupled with DNA microarray analysis (ChIP-chip), comparative genome hybridization, empirical detection of novel transcripts and polymorphism discovery . The average nucleotides spacing between centres of neighbouring probes defines the tiling ‘resolution’. There are several tiling array platforms with different probe length, resolution, and manufacturing characteristics. We focus our study on Affymetrix tiling arrays since they have the highest probe density, with approximate 42 million 25-mer probes covering the non-repetitive human genome at 35 bp resolution. The sheer amount of raw data generated on these arrays poses challenges for data analysis.
Despite extensive efforts to design statistical algorithms to analyze Affymetrix tiling microarray [11–16], potential probe mapping problems exist and might have serious downstream consequences. Early assessments  on Affymetrix expression arrays revealed that remapping probe sets to the newest genome could create as much as 50% discrepancy in predicted differentially expressed genes, regardless of the analysis methods used. Therefore, it is crucial to remap all the tiling array probes to the current genome to ensure more precise downstream analyses. Furthermore, one primary objective in microarray analysis is to minimize probe cross-hybridization. Most tiling microarrays are designed based on repeat-masked genome, which still contains many repetitive elements including tandem repeats  with period longer than 12 bp and segmental duplications . Affymetrix tiling probe selection operates in a local fashion, which does not check whether a probe matches elsewhere in the whole genome (S. Cawley, pers. commun.). It does map all the probes to the genome afterwards, although only to the repeat-masked genome which could be problematic. This approach also sometimes maps the same probe to multiple locations in a short genomic region. When calculation on the region is performed assuming all the probe measures are independent, this mapping is likely to inflate the p-value of the region and create a false positive. These problems could be potentially addressed by taking into account each probe's copy number  and filtering out repetitive probes  in the analysis.
xMAN probe mapping of various Affymetrix tiling arrays to the most recent genome assembly
Affymetrix tiling arrays
Human ENCODE 1.0a
Human Tiling 1.0 & 2.0b
Mouse Tiling 1.0 & 2.0c
Arabidopsis thaliana 1.0d
Similar situation was observed when xMAN was applied to remap several other Affymetrix tiling arrays for human, mouse and Arabidopsis (Table 1).
Based on the above observation, xMAN used the following rules to remove probe redundancy in the human genome tiling array BPMAP: 1) all the probes with copy number more than 10 are filtered out; 2) the same 25-mer is mapped only once within a 1 kb window along the genome; 3) the remaining probes with multiple copy numbers are not mapped unless it is at least 30bp apart from the previous probe (the average probe spacing for the array is 35 bp). The correct whole genome probe copy number is stored with each probe's 25-mer sequence, regardless of whether it is mapped to a certain region or not by rules 2) and 3).
Comparison with existing tools
Many tools are available for sequence similarity search, among which BLAST , MegaBLAST , BLAT  and MUMmer  are probably the most widely used. We compared the performance of xMAN with these tools in the same Linux computer to map the 42 million human genome tiling probes to the human genome. BLASTN  scans for short matches (usually 11-mer) in the genome and extends those matches into high-scoring pairs (HSPs). MegaBLAST utilizes a greedy algorithm  to search for nucleotide sequence alignment and is optimized for aligning slightly different sequences. BLAT  indexes non-overlapping K-mers in the genome and hashes them inside the computer RAM, then scans linearly through the query sequence. MUMmer  adopts a suffix tree data structure for rapid sequence alignment, but is memory intensive. MUMmer 3.0 uses approximately 17 bytes for each nucleotide in the reference genome, thus requires ~51 GB (3 G* 17) of RAM to create the human genome data structure. We do not have access to a computer that meets this RAM requirement, thus did not include MUMmer in our comparison.
It took BLAST 2,482 minutes, BLAT 19 minutes, and MegaBLAST (with an optimal word size of 20) 0.8 minute to search the first 10 thousand probes against the human genome. Since search time is approximately proportional to the query size, extrapolating these numbers predicts that the three algorithms will take about 173,740, 1,330 and 56 CPU hours to map all 42 million probes, respectively. MegaBLAST appears to be a proper solution to this specific probe-mapping problem, and is extremely efficient with this word size of 20. Nevertheless, its advantage diminishes when mapping shorter fragments with smaller word size. For instance, MegaBLAST requires almost 2 CPU minutes to search 10 thousand 18-mer probes against the genome with word size of 12, i.e. 140 CPU hours for the 42 million 18-mer probes. In comparison, word size in xMAN, which is the minimal length of an identical match, is always equal to the length of the query oligonucleotide and has no effect on the searching time. In any case, xMAN needs less than 6 CPU hours in completing the 42 million probe mapping, and is at least an order of magnitude faster than other popular algorithms.
Impact of the updated probe mapping on the tiling array analysis
We aim to systemically investigate the impact of the updated probe mapping on tiling array data analysis. We recently developed a Model-based Analysis of Tiling arrays (MAT) algorithm  to reliably detect ChIP-enriched regions on Affymetrix tiling arrays. MAT employs a linear model to estimate the baseline probe behaviour based on probe sequence and copy number. To our knowledge, it is the only algorithm that considers probe copy number information in tiling array analysis. As our goal here is to assess the pure effect of probe mapping on tiling array data analysis, we only used MAT to investigate the impact of xMAN probe mapping. Since the Affymetrix BPMAP does not contain probe copy number information, we used 1 copy for every probe.
Whole genome ER ChIP-chip results based on either Affymetrix or xMAN probe mapping under different FDR thresholds
FDR thresholds (%)
Percentage of Shared Regionsc
As many genome sequencing projects continuously update the genome assembly, and high-throughput sequencing/microarray technologies frequently introduce millions of oligonucleotides, algorithm for fast mapping of oligonucleotides to the newest genome is needed. We introduce an intuitive and effective algorithm xMAN, which is optimized for mapping millions of oligonucleotide fragments to the genome simultaneously and is at least an order of magnitude faster than other popular algorithms. It also works on mapping long oligonucleotide probes from NimbleGen and Agilent, and can be further adopted to map the short sequence tags from high throughput sequencing technologies. xMAN can also be used to convert probe genome coordinate between closely related species for analysis, e.g. when Chimpanzee DNA is hybridized to human tiling arrays.
Although the LiftOver program from UCSC can convert genome coordinates, it relies on library files which are not available for all the possible genome assemblies. xMAN, on the other hand, could be used for any coordinate conversion when the query and reference genome sequences are available. In addition, LiftOver only transfers coordinates to a new version without the ability to find other matches of the query DNA sequence. For example, the previously mentioned probe on chromosome 19 which matches ~100 thousand genome locations, will only be converted to a single new genome coordinate by LiftOver.
Most index-based sequence similarity search programs involve two major stages, a heuristic search stage to locate potential similar blocks (anchors) and an alignment stage to combine the anchors. Since xMAN only finds exact matches of short fragments, it simplifies the index-based method by eliminating the second stage. xMAN encodes the query sequences into hash table in RAM and linearly scans through the genome for the exact matches. Query size only affects hash table generation and output writing time, but not genome scanning time. xMAN does require RAM to hash all the query sequences, which is much smaller than the genome size and often available with current computing capacity. Besides, when query sequences are too big, they can be split to multiple files, so xMAN can be carried out sequentially on each smaller query files.
Using xMAN to generate probe mapping BPMAP is important for tiling array data analysis. It not only converts the probe coordinates to newer genome assemblies, but also removes many redundant probes, and allows algorithms such as MAT to consider and control probe cross-hybridization effect. During the BPMAP comparison analysis on ER ChIP-chip data, we not only removed several false-positive ChIP-regions residing in highly repetitive sequences
Hash table for query sequences
xMAN seeks to eliminate the time-consuming disk access operations by creating a query sequence hash table that resides entirely in the computer RAM. Each nucleotide is encoded as two bits, i.e. A: 00; C: 01; G: 10; T: 11 (denoted as BaseIndex), thus any N-mer sequence can be converted into a 2N-bits integer, which is a key in the hash table. In order to avoid encoding ambiguities, all the query sequences are restricted to have the same length. The value of each key in the hash table is a feature associated with the corresponding sequence. The feature can be the X and Y coordinates of a given probe on the microarray, or the position of a sequence in the query file. Redundant query sequences (e.g. if multiple array spots have the same probe sequence) share the same key, with multiple features (e.g. different X and Y coordinates on the array). The memory requirement is proportional to the total size of the query sequence.
Scanning the genome
xMAN scans the reference genome using a sliding window of size N (query sequence length) at steps of size 1. To accelerate search speed, xMAN also encodes each nucleotide in the genome sequence in 2 bits, and takes advantage of the fastest bit-shifting operation for the encoding, as described below:
Denote the 2N bits integer from the previous window as A
Reset the two left-most bits of A to 0 (i.e. remove the left most base in the previous window)
Left-shift A by 2 bits
Read one more base and add its BaseIndex to A to form the new 2N bits integer for the current window
If this new A is a key in the query hash table, xMAN stores the corresponding chromosome, strand and genomic position.
Step one base 5′ to 3′, and repeat steps 1-5 until the end of the genome
The scanning is linear in time to the reference genome size. At the end, for each query sequence, xMAN has all its genome copy number and positions, which will be output to a tab-separated values (tsv) file. Query sequences no longer match to the genome are also output as if they are in a chromosome called “NOmatch”. Furthermore, the following xMAN running statistics will be reported. Please see Fig. 5 for an example.
NumUniqSeq: Number of unique sequences in the query
NumSeq.MEntries: Number of sequences with multiple entries in the query
NumQueryEntries: Number of entries in the query file
NumSeq.MGenomeMatches: Number of sequences with multiple genomic matches
NumSeq. NoGenomeMatch: Number of sequences with no match in the genome
xMAN is implemented in open source Python, and is freely available at http://chip.dfci.harvard.edu/~wli/xMAN. It requires the query oligonucleotide file and the subject genome file(s). The query file can be in either in Affymetrix binary probe mapping (BPMAP) format or plain-text tab-separated values (tsv) format. The first column in the tsv file must be the sequence and the other columns (optional) could be feature(s) associated with the sequence. The reference genome files must be in Fasta format and can be downloaded from http://genome.ucsc.edu/.
This project was funded by NIH grant 1R01 HG004069-01. The authors would like to thank Kevin Struhl, Clifford A. Meyer and Jun Song for the helpful discussions and insights.
This article has been published as part of BMC Genomics Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S1.
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005, 33: e175-10.1093/nar/gni179.PubMedPubMed CentralView ArticleGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMedPubMed CentralGoogle Scholar
- Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G: Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution. Science. 2005, 308 (5725): 1149-1154. 10.1126/science.1108625.PubMedView ArticleGoogle Scholar
- Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z: A global map of p53 transcription-factor binding sites in the human genome. Cell. 2006, 124: 207-219. 10.1016/j.cell.2005.10.043.PubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.PubMedView ArticleGoogle Scholar
- Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.PubMedPubMed CentralView ArticleGoogle Scholar
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006, 24: 1151-1161. 10.1038/nbt1239.PubMedView ArticleGoogle Scholar
- Mockler TC, Chan S, Sundaresan A, Chen H, Jacobsen SE, Ecker JR: Applications of DNA tiling arrays for whole-genome analysis. Genomics. 2005, 85: 1-15. 10.1016/j.ygeno.2004.10.005.PubMedView ArticleGoogle Scholar
- Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics. 2005, 21: 3629-3636. 10.1093/bioinformatics/bti593.PubMedView ArticleGoogle Scholar
- David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM: A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A. 2006, 103: 5320-5325. 10.1073/pnas.0601091103.PubMedPubMed CentralView ArticleGoogle Scholar
- Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ: Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004, 116: 499-509. 10.1016/S0092-8674(04)00127-8.PubMedView ArticleGoogle Scholar
- Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS: Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci U S A. 2006, 103: 12457-12462. 10.1073/pnas.0601180103.PubMedPubMed CentralView ArticleGoogle Scholar
- Li W, Meyer CA, Liu XS: A hidden Markov model for analyzing ChiP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics. 2005, 21 (Suppl 1): i274-i282. 10.1038/nbt1053.PubMedView ArticleGoogle Scholar
- Keles S, van der Laan MJ, Dudoit S, Cawley SE: Multiple testing methods for ChIP-Chip high density oligonucleotide array data. J Comput Biol. 2006, 13: 579-613. 10.1089/cmb.2006.13.579.PubMedView ArticleGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27: 573-580. 10.1093/nar/27.2.573.PubMedPubMed CentralView ArticleGoogle Scholar
- Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE: Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001, 11: 1005-1017. 10.1101/gr.GR-1871R.PubMedPubMed CentralView ArticleGoogle Scholar
- Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF: Genome-wide analysis of estrogen receptor binding sites. Nat Genet. 2006, 38: 1289-1297. 10.1038/ng1901.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.