UPS 2.0: unique probe selector for probe design and oligonucleotide microarrays at the pangenomic/genomic level
© Chen et al. 2010
Published: 2 December 2010
Skip to main content
© Chen et al. 2010
Published: 2 December 2010
Nucleic acid hybridization is an extensively adopted principle in biomedical research, in which the performance of any hybridization-based method depends on the specificity of probes to their targets. To determine the optimal probe(s) for detecting target(s) from a sample cocktail, we developed a novel algorithm, which has been implemented into a web platform for probe designing. This probe design workflow is now upgraded to satisfy experiments that require a probe designing tool to take the increasing volume of sequence datasets.
Algorithms and probe parameters applied in UPS 2.0 include GC content, the secondary structure, melting temperature (Tm), the stability of the probe-target duplex estimated by the thermodynamic model, sequence complexity, similarity of probes to non-target sequences, and other empirical parameters used in the laboratory. Several probe background options, Unique probe within a group , Unique probe in a specific Unigene set , Unique probe based onthe pangenomic level , and Unique Probe in the user-defined genome/transcriptome , are available to meet the scenarios that the experiments will be conducted. Parameters, such as salt concentration and the lower-bound Tm of probes, are available for users to optimize their probe design query. Output files are available for download on the result page. Probes designed by the UPS algorithm are suitable for generating microarrays, and the performance of UPS-designed probes has been validated by experiments.
The UPS 2.0 evaluates probe-to-target hybridization under a user-defined condition to ensure high-performance hybridization with minimal chance of non-specific binding at the pangenomic and genomic levels. The UPS algorithm mimics the target/non-target mixture in an experiment and is very useful in developing diagnostic kits and microarrays. The UPS 2.0 website has had more than 1,300 visits and 360,000 sequences performed the probe designing task in the last 30 months. It is freely accessible at http://array.iis.sinica.edu.tw/ups/
Nucleic acid hybridization is an extensively adopted principle in biomedical research. Several methods have been derived based on this principle to detect targets using sequence-specific probes (e.g., northern blot, southern blot, and in situ hybridization). The lab protocol for probe hybridization is further optimized and miniaturized into microarray format to detect transcriptional activity of thousands of genes simultaneously . The performance of these widely adopted methods depends on the specificity of probes to their targets. The selection of suitable oligonucleotide probes remains a bottleneck in the microarray workflow . To find the best, unique probe(s) for detecting target(s) from a sample cocktail, we developed an algorithm and implemented this algorithm in a probe design web platform, the Unique Probe Selector (UPS) . The parameters used in UPS include GC content, GC clamps, the duplex stability estimated by thermodynamic theory model, the secondary structure of probes, a low-complexity mask, and other empirical preferences of wet-lab researchers. Thus, this probe design tool is able to overcome the problem of background noise during hybridization.
Several tools are available for probe design. OligoArray 2.1 [4, 5], GoArrays , OligoPicker , ArrayOligoSelector , YODA , ProbeSelect , and Picky  are stand-alone tools that run in command mode and provide unique probe candidates for input sequences under the constraints of GC%, Tm, absence of low complexity, and position near the 3' end. OligoWiz2  is a java-based server-client solution for probe design in a graphical user interface. Oligodb  and OligoArrayDb  are databases providing pre-calculated probes for human/mouse transcriptome and sequenced Archaea/Bacteria/Eukaryote species, respectively. Golfier et al.  developed an informatics tool called SOL, which updates mouse sequences in the NCBI/RefSeq semi-automatically and outputs a collection of mouse oligonucleotide probe candidates. Researchers can set up an SOL and refine the parameters when working on probe design tasks for other species. Although most of these tools are useful when designing nucleotide probes, few consider background noise of cross-hybridization, or require user skills to set up and optimize the tool for an experiment. Thus, we implemented a user-friendly web-based tool UPS  for designing probes with small likelihoods of forming non-specific duplexes to the other submitted targets in the same query batch or to the sequence set of an NCBI/Unigene-listed species, and upgrades its capability for a customized reference sequence set in the current version.
Recent advances in sequencing methods, such as next generation sequencing (NGS) technology, allow researchers to decrypt gene contexts of an unknown genome faster and cheaper than ever before. These methods run millions of sequencing reactions in parallel, generating short sequence reads on the gigabase-scale, and harvest assembled contigs by algorithms. These methods have been successfully applied for genome re-sequencing and have been expended to resolve novel genomes and multiple genomes in a mixture (the approach of metagenome). An application for sequencing a transcriptome to resolve the gene context of an unknown genome is also available. Since both the time and cost of sequencing a transcriptome or genome are markedly reduced, researchers are willing to use a sequencing approach to leap over the bottlenecks when conducting a classical genome/EST library sequencing project. Thus, the collection of non-target sequences of a hybridization experiment has an increased possibility of its availability, even when the target organism is not a well-studied model species. The NGS assemblies can also be raw material for designing probes for fabricating arrays such as 16S rRNA gene-based diagnostic arrays . Several widely used tools, such as ORMA , ARB , and PRIMROSE , are available for classification and phylogenetic analysis of bacterial species. These tools conduct the probe design process when the collection of molecular phylogenetic markers (i.e., 16S rRNA) is updated along with the renewal of the sequence database. The primer design strategy of each tools are different; it is difficult in choosing probes provided from different tools.
To reduce the likelihood of non-target hybridization, we designed a workflow for finding the best probes of input sequences based on uniqueness of probe sequence and probe-to-target duplex stability. The Nearest-Neighbor model  and blastn were applied to identify probe candidates for each input sequence with a minimal likelihood of cross-hybridization. First, probes with single, di-, or tri- nucleotide repeats are excluded as described previously. Next, the similarity of probes for a non-target sequence is evaluated. The following criteria developed by Li et al.  are used to further exclude unsuitable oligonucleotides: 1) overall similarity of a probe to its non-target sequences in the same submitted batch should be <85%; and, 2) a continuous identical stretch of a probe to non-target sequences must not exceed 17 bases. Both criteria integrated with blastn are called in silico hybridization in this work. Furthermore, the uniqueness of probes is assessed using the frequency table built from the reference sequence set given in one of three customized options. The simplest condition is finding unique probes for each sequence among an input sequence set (the “ Unique probe within a group ” option). The second option, “ Unique probe in the specific Unigene , ” is for devising unique probes for each submitted sequence with the smallest chance of cross-hybridizing the selected Unigene set. To meet the needs of researchers conducting metagenomic studies, this work developed a new option, " Unique Probe based on the pangenomic level ", which allows users to design probes using the environmental nucleotide sequences (Env_NT) or the non-redundant NCBI nucleotides database (NCBI_NT) as background reference set. Via another novel option, “ Unique Probe in the user-defined organism , ” users can design probes by adopting uploaded sequences as the background reference. The file size of the uploaded reference sequence set can be up to 30 Mb, and the sequence length of a single sequence entry is up to 5 million bases. Probe candidates passing the criteria (e.g., Tm cutoff, GC%, the proximity of 3' end, in silico hybridization (sequence complexity and homology to non-target sequence), are ranked by the E-value of the probe-target duplex (blastn) using the scoring matrix built from the customized reference sequence. Additional details of in silico hybridization can be found on the UPS 2.0 online HELP page (http://array.iis.sinica.edu.tw/ups/help1.html).
Using common primer sets, such as a consensus primer set of small subunit rRNA gene region, a bundle of PCR products can be amplified from a sample with mixed microbial genomes. To detect the presence of a particular species, a dot blotting assay can be contrived based on a small set of probes with good discrimination power. Using the UPS option “Unique Probe within a group,” a user can find unique probes for each sequence with minimal likelihood of matching non-target sequences in the uploaded sequence set. Such a probe set is suitable for detecting malicious pathogens in environmental samples or food contaminants. If target sequences are mixed with various genomes, such as sludges from sanitary sewers, one can use the option “Unique Probe based on the pangenomic level” to assign all known environmental sequences as the reference background. The assay can be started with a PCR using degenerated primer sets to amplify target sequences, following by an hybridization on a small array with sequence-specific probes to detect the presence of specific bacteria species.
The spatiotemporal activity of a pathogenic genome is a critical issue. To design probes for detecting pathogen sequences in a specimen from a particular organism in the Unigene species list (e.g., human) or a closely related one, we may set pathogen sequences as the input set in the option “Unique Probe in a specific Unigene set” and assign the sequence set of the species (Homo_sapiens, in this case) as the reference background. Then, UPS 2.0 will find suitable probes for identifying pathogen sequences with low false-positive hits on host sequences.
Unigene collects EST clusters from more than 100 species; however, partially or unpublished sequence assemblies from species other than those in Unigene list may be available in individual labs. To design probes working in a biopsy from a particular organism, such as detecting pathogen transcripts, one can assign the background by uploading a set of reference sequences (a file in FASTA format) of this species using the option “Unique Probe in the user-defined organism.” Thus, pathogen-specific probes will be selected using UPS criteria, which ensure reduced likelihood of cross-hybridization of a probe to a non-target sequence.
UPS 2.0 is an updated from the previous platform, which is a probe design tool for identifying the unique segment in each submitted sequence in a query dataset, or the unique segment of input sequences in the user-defined background. With the flexible and intuitive interface in UPS 2.0, users can easily customize their probe design parameters by following the user-friendly online instruction and obtain probes with quality information. Probes designed in UPS 2.0 can be utilized to detect a target in a classical hybridization-based assay. Notably, probes from such batch-wise query meet the requirements of microarray probe design. In UPS 2.0, Unigene data are updated, and UPS 2.0 can take the user-uploaded sequence set as a background reference. To date, more than 100 species are listed in the current Unigene dataset and this number continues to increase. The new option allows users to upload a reference set other than the Unigene list. These improvements expand the choice of a user-defined reference during probe design.
The UPS workflow mimics the condition of target/non-target mixture in an actual experiment to be carried out in a lab bench. This is beneficial when developing a kit for molecular diagnosis or typing. Thus, probes designed by UPS 2.0 are particularly suitable for detecting infectious agents, such as bacteria and viruses, with little chance of cross hybridization of host transcripts, or for detecting a specially selected gene list (gene signatures) using small arrays with minimal background noise. A single entry in the user-uploaded reference set can be 5×106 bps, meaning that most prokaryote genomes can serve as the working probe background in UPS 2.0.
UPS has run stably for over 2 years. More than 1,300 visitors have submitted their queries and about 360,000 sequences have been submitted to UPS 2.0. The performance of UPS will be continually improved by applying a relatively more sophisticated algorithm and reducing calculation time. The spectrum of reference backgrounds has been expanded by incorporating published metagenomic datasets, environmental nr, and the entire genome sequence set in the public domain into customized options to optimize oligonucleotide probe design for experiments on the pangenomic/genome scale.
To improve the performance of UPS 2.0, the hardware was upgraded on symmetrical multi-processor (SMP) PCs equipped with quad CPUs (Intel Xeon E5420 2.5GHz) and 16 GB RAM. The LAPP structure (SUSE Linux Enterprise Server 10, Apache v.2.04, PostgreSQL v.8.2.4, PHP v.5.1.0) facilitates web access, file uploading, format checking, result storing/downloading, and job controlling, and supports e-mail notifying functions.
All calculations in the core algorithm for suffix arrays, constraints for probe selection (GC%, low complexity masking, distance from the 3' end), free energy for a probe’s secondary structure, melting temperature estimated in a thermodynamic model, and in silico hybridization based on the NCBI blastn  with several checking rules, i.e., percentage of similarity and maximum overlapped bases of continued sequences between probe and possible targets, are performed in Boland Delphi 2006 on an MS Windows 2003 server. To provide the most updated background reference, an automatic procedure for rebuilding suffix arrays for each species from the updated Unigene databases is currently scheduled.
UPS parameters were described in Chen et al . Some probe parameters are adjustable with increased flexibility in UPS 2.0 in response to user requests. Briefly, probe length is adjustable (range, 30 - 120 mers in length). The salt concentration is set at 0.58M and is adjustable (range 0 - 1M) to fit the ionic strength in the user-defined condition. Probe candidates with a low complexity segment(s), such as five or more continual nucleotides and continuous di-/tri-nucleotide repeats, are excluded. The acceptable GC% of probes is 30 - 70% empirically and can be adjusted by users. Further constraints on probe features, GC% and the lower-bound of Tm, are available in the output browsing interface. The melting temperature of probes is calculated based on the Nearest-Neighbor model , and the secondary structure of a designed probe is calculated via the Perl program UNAFold.pl . Thermodynamic theory was integrated to estimate the ΔG of probe-target duplexes, probe-non-target duplexes, and the probe secondary structure.
The authors would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this research under Contract Nos. NSC 98-2221-E-001-018 and NSC 98-3112-B-400-010. Yu-Bin Wang, Jan-Hrong Lu, and Chu-Fang Lo from National Taiwan University are appreciated for their works on experimental validation and Ted Knoy for his editorial assistance.
This article has been published as part of BMC Genomics Volume 11 Supplement 4, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S4
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.