- Open Access
Automated design of genomic Southern blot probes
BMC Genomics volume 11, Article number: 74 (2010)
Sothern blotting is a DNA analysis technique that has found widespread application in molecular biology. It has been used for gene discovery and mapping and has diagnostic and forensic applications, including mutation detection in patient samples and DNA fingerprinting in criminal investigations. Southern blotting has been employed as the definitive method for detecting transgene integration, and successful homologous recombination in gene targeting experiments.
The technique employs a labeled DNA probe to detect a specific DNA sequence in a complex DNA sample that has been separated by restriction-digest and gel electrophoresis. Critically for the technique to succeed the probe must be unique to the target locus so as not to cross-hybridize to other endogenous DNA within the sample.
Investigators routinely employ a manual approach to probe design. A genome browser is used to extract DNA sequence from the locus of interest, which is searched against the target genome using a BLAST-like tool. Ideally a single perfect match is obtained to the target, with little cross-reactivity caused by homologous DNA sequence present in the genome and/or repetitive and low-complexity elements in the candidate probe. This is a labor intensive process often requiring several attempts to find a suitable probe for laboratory testing.
We have written an informatic pipeline to automatically design genomic Sothern blot probes that specifically attempts to optimize the resultant probe, employing a brute-force strategy of generating many candidate probes of acceptable length in the user-specified design window, searching all against the target genome, then scoring and ranking the candidates by uniqueness and repetitive DNA element content. Using these in silico measures we can automatically design probes that we predict to perform as well, or better, than our previous manual designs, while considerably reducing design time.
We went on to experimentally validate a number of these automated designs by Southern blotting. The majority of probes we tested performed well confirming our in silico prediction methodology and the general usefulness of the software for automated genomic Southern probe design.
Software and supplementary information are freely available at: http://www.genes2cognition.org/software/southern_blot
Southern blotting is a DNA analysis technique that allows one to detect a specific DNA sequence in a complex DNA sample . Gel electrophoresis is used to size separate restriction-digested DNA, which is then transferred (or blotted) to a solid support such as a filter for probing and detection by radioactive or luminescent labelling.
The method has found widespread application throughout molecular biology. It has been used for gene discovery and mapping. It also has diagnostic and forensic applications, such as mutation detection in patient samples and DNA fingerprinting in criminal investigations. It has been employed as the definitive method for detecting transgene integration , and successful homologous recombination in gene targeting experiments that ablate or modify a gene's function in vivo.
For the technique to succeed one needs to identify a probe sequence that is unique within the genome for the gene or locus of interest so that it does not cross-hybridize with other endogenous DNA sequences present in the sample. Like others we have routinely used a manual approach to design and test our probes, which is labor intensive and usually requires trial of different probes before the desired result is obtained. It is therefore highly desirable to have bioinformatic tools that aid in the design process and can also optimize the probes.
The typical manual approach is to choose a probe of at least 300 bp in length, to ensure efficient labeling in the random priming reaction , and in practice probes of 500-1000 bp are generally employed. Following identification of the genomic locus one wishes to probe, the DNA sequence from a genome browser (such as Ensembl ) is examined for repetitive sequence elements as these can result in an intense background smear on hybridization that obscures single copy gene hybridization signals. The test probe is then searched against the genome using BLAST  or other means and the results inspected. One hopes to obtain a single perfect match to the target locus on the genome, with little or no cross-reactivity to other loci. If this is not the case, one has to return to the genome browser and move and/or shorten the sequence before repeating the BLAST search. With each genome search taking several minutes this is a time consuming exercise and is unlikely to yield the best possible probe.
Clearly this method is amenable to bioinformatic automation. Many programs already exist for oligonucleotide probe discovery, principally in the area of microarray design . These programs are generally designed to find probes less than 100 bp rendering them inapplicable for the considerably longer Southern blot probes. To address this need we have written a system to find (near) unique probes in a specified region of a genome, which contain little or no repetitive DNA sequence, and also to design PCR primers to facilitate the recovery of the probes from cellular DNA for subsequent Southern blotting. We went on to experimentally validate a number of these designs by Southern blotting in the mouse genome.
Given user-supplied chromosomal coordinates, and a desirable size range for the southern blot probe (default 500-1300 bp), we used a tiling approach to generate many possible probes in the specified design window. The program starts from the maximum allowable probe length, tiling the window by moving by a small percentage of the probe length each time (default 5%). Once this is completed the probe length is reduced by 50 bases (configurable) and the window re-tiled, generating more candidate probes. The process is repeated until the minimum probe length is reached (see Figure 1). The candidate probes are searched against the target genome using the Exonerate pairwise sequence alignment program .
We calibrated the method using a set of manually-designed genomic probes that we have previously successfully employed for Southern blotting (see Table 1). These had an average length of ~800 bp and when searched with Exonerate (with parameters --model affine:local --score 150) all of these produced a perfect match to their genomic locus (as would be expected) and a number of additional lower-scoring alignments to other loci. On average these second best matches spanned 17.5 ± 5.8% (mean ± standard error) of the probe length, with 74.6 ± 3.2% DNA sequence identity (n = 8). From the scores of the on-target or 'self-hit' and the highest scoring off-target locus alignments we calculated a score ratio as measure of uniqueness of the candidate probe. Our calibration probes averaged 19.5 ± 3.6 (n = 8). This score ratio is proportional to both the length and sequence identity of the two matches.
Comparing the probe sequences to a version of the genomic assembly that has been screened for repeats and low-complexity regions by RepeatMasker  and DUST  allows us to estimate the repetitive DNA content of individual probes. Our calibration probes contained 18.2 ± 10.8% such DNA.
Considering these results we chose a minimum score ratio of 10 and a maximum combined repetitive and low-complexity base content of 5% as the minimum requirements for probe acceptance (configurable). Candidate probes reaching these criteria that were completely overlapped by a longer and better-scoring probe are considered redundant and removed from the passing set.
With the number of genome searches to be carried out potentially taking several hours for each Southern blot probe design, we thought that employing a single program and computer to complete the whole task was unlikely to achieve a reliable and timely solution. Instead we decided to use a database to store and retrieve the design information for each probe, and subsequently to hold the results of the many genome searches carried out for candidate probes. Multiple processors and cores as available from a compute cluster are employed to perform the genome searching, reducing the real time taken to test the probe designs in silico. When all the searches are complete the whole set of genome-search results are analyzed to find the best probe candidates.
A MySQL database (12 tables) was designed for this purpose together with a set of Perl data objects and SQL adaptor classes to allow programs to write and retrieve from the database. These follow the Ensembl API and schema design where one creates a set of classes representing the core objects in the system, in this case probe designs, candidate probe sequences belonging to a design to be tested, and their matches to the target genome, partnered by a set of complementary adaptor classes that hold the cognate SQL necessary for storing and retrieving these from the database. Changes to the database schema can the then be made without impact on the object classes .
We then decomposed the task into three principal steps:
create_probe_search takes the user-specified chromosomal coordinates for the design and generates many candidate probes at the granularity governed by the window tiling parameters, storing the design specification and the candidate probe sequences in the database. Use is made of the Ensembl API to extract the DNA sequence (or Slice) from the genome assembly covering the probe design window, then subsequently to extract sub-sequences to generate each candidate probe sequence. These sequences are grouped into batches (or jobs) for efficient searching with Exonerate in step 2.
run_probe_search takes the set of sequences specified by a particular job, searches them against the target genome, and parses the Exonerate output results, storing the hits, including their scores, location, and masked sequence content (as ascertained from the soft-masking) in the database. run_probe_search is not launched interactively but is initiated by submit_probe_search that utilizes the LSF job scheduling system to run many separate instances of run_probe_search in parallel to complete the genome searches required for a probe design.
analyse_probe_search is the final step in the probe design process. It checks that all the genome-searching jobs for the probe design have been completed successfully, then fetches the alignment results from the database, applying the specified cut-off criteria for score-ratio and repetitive/low-complexity DNA content. These are used to separate and rank the sequences into unique, passed and failed groups. Redundant (but passed) sequences are filtered into a fourth bin.
If none of the sequences in the probe design pass at the specified criteria, the cut-offs are automatically relaxed to find the best (but poorly-scoring) probes in what is likely to be a difficult portion of the genome to design Southern blot probes. Primer3  is then used to generate primers for recovery of the passed candidate probes, run using the BioPerl-Run wrapper . Chosen primers can be manually-checked for potentially confounding polymorphisms if required, by search of dbSNP.
Static web output is generated for user inspection (see Figure 2). This includes the probe design window coordinates, counts of the candidate probes generated and subsequently placed in each bin and results of the quality assurance checks that each genome search generates an 'on target' hit to the correct position on the genome for the candidate probe sequence. A graphic plot is rendered showing the frequency of occurrence of each base position in the set of sequences found in the unique, passed or failed probe groups across the design window.
Four accessory scripts are also provided. create_probe_search_db_tables creates the MySQL database tables for the design pipeline. delete_probe_design removes a probe design from the database should it have been wrongly specified, or is no longer needed. delete_job_results removes the results of a particular batch of Exonerate genome searches from the database should an error have occurred, allowing the job to be resubmitted, and finally, get_probe_search_cpu_time calculates the total time to execute the searches for a given probe design.
Each of these programs read their customizable parameters from a .ini type configuration file.
Results and Discussion
To date we have designed 124 probes using flanking regions in about 60 genes that we chose to perturb by gene targeting in mouse embryonic stem cells. Given a ~3 kb window in which to search for a Southern blot probe and a desirable length range for the final probe of 500-1300 bp, the tiling strategy outlined produces on average ~900 candidate probes (when used with the default granularity) to search against the genome (see Figures 1 and 3).
In total 103/124 (83%) of these designs passed by the criteria above of score ratio ≥ 10 and repetitive/low-complexity DNA content ≤ 5%. On average the best candidate probes for each design were 818.1 ± 25.0 bp long and contained only 4.1 ± 1.0% repetitive and low complexity DNA, the latter being significantly lower than our manually-designed calibration set (p < 0.05, Student's t-test, Table 2).
Additionally by such brute-force searching and scoring of genomic probes, in exactly half the design cases (62) it was possible to find one or more unique probes amongst those tested. These had a single (i.e. ideal) hit to their target genomic locus with no cross-reactivity to other loci. It is worth noting that none of the calibration probes gave this 'ideal' result, when evaluated using the same exonerate search parameters.
The remaining probes (97/124) that passed our empirical cut-off criteria, had an average score-ratio of 23.7 ± 1.3, which was not significantly different to the experimental calibration set (p > 0.05; Student's t-test).
In order to confirm the system does design effective Southern blot probes we experimentally tested 16 of the in silico designs. Blots were performed on mouse genomic DNA extracted from embryonic stem cell lines in order to confirm homologous recombination had occurred thus correctly targeting the gene to be ablated as part our of our high-throughput mouse knockout and molecular neurobiological phenotyping programme . 13/16 of the probes tested gave a usable signal upon blotting, the remainder gave a smear likely indicative of non-specific probe binding, or no resolvable signal. Representative Sothern blots are shown in Figure 4.
We have developed an automated system for the effective design of Southern blot probes. Many candidate probes that lie in a given genomic window are searched against the target genome in a brute-force approach to finding the best probe in the locus, as assessed by uniqueness and repetitive DNA sequence content. Using these in silico measures we can automatically design probes that would be predicted to perform as well, or better, than previous manual designs, while reducing the time taken by the molecular biologist to yield a successful probe. The majority of the probes we tested experimentally in Southern blotting performed well confirming our in silico prediction methodology, and the usefulness of the software for automated genomic Southern blot probe design.
Availability and requirements
Project name: southern_blot
Project home page:http://www.genes2cognition.org/software/southern_blot and Additional file 1.
Operating system(s) UNIX and Linux variants
Programming language: Perl and SQL
Other requirements: BioPerl core 1.5.0 or higher, BioPerl run 1.4 or higher, Ensembl core 32 or higher, Config::IniFiles 2.38 or higher, DBI 1.32 or higher, GD 2.17 or higher, Exonerate 1.0.0, Primer3 1.0.0, LSF 5.1 or higher, MySQL 5.045 or higher
License: Artistic License 2.0
Any restrictions to use by non-academics: none
application programming interface
basic local alignment search tool
load sharing facility
polymerase chain reaction
structured query language.
Southern EM: Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol. 1975, 98 (3): 503-517. 10.1016/S0022-2836(75)80083-0.
Grant SG, Jessee J, Bloom FR, Hanahan D: Differential plasmid rescue from transgenic mouse DNAs into Escherichia coli methylation-restriction mutants. Proc Natl Acad Sci USA. 1990, 87 (12): 4645-4649. 10.1073/pnas.87.12.4645.
Komiyama NH, Watabe AM, Carlisle HJ, Porter K, Charlesworth P, Monti J, Strathdee DJ, O'Carroll CM, Martin SJ, Morris RG: SynGAP regulates ERK/MAPK signaling, synaptic plasticity, and learning in the complex with postsynaptic density 95 and NMDA receptor. J Neurosci. 2002, 22 (22): 9721-9732.
Sambrook J, Russell DW: Molecular Cloning A Laborarory Manual. 2001, New York: Coldspring Harbour Laboratory Press, 1: third
Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L: Ensembl 2009. Nucleic Acids Res. 2009, D690-697. 10.1093/nar/gkn828. 37 Database
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.
Li F, Stormo GD: Selection of optimal DNA oligos for gene expression arrays. Bioinformatics. 2001, 17 (11): 1067-1076. 10.1093/bioinformatics/17.11.1067.
Slater GS, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005, 6: 31-10.1186/1471-2105-6-31.
RepeatMasker Open-3.0. [http://www.repeatmasker.org]
Morgulis A, Gertz EM, Schaffer AA, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006, 13 (5): 1028-1040. 10.1089/cmb.2006.13.1028.
Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E: The Ensembl core software libraries. Genome Res. 2004, 14 (5): 929-933. 10.1101/gr.1857204.
Rozen SaS HJ: Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by: Krawetz SaMS. 2000, Towota, NJ: Humana Press, 365-386.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.
Grant SG: Systems biology in neuroscience: bridging genes to cognition. Curr Opin Neurobiol. 2003, 13 (5): 577-582. 10.1016/j.conb.2003.09.016.
Funding: Wellcome Trust
The authors wish to thank Drs Louie N. van de Lagemaat, René Frank and Rob Andrews for their critical reading of an earlier version of the manuscript.
MDRC conceived and implemented the method. DGF experimentally validated the resulting probes. NHK and SGNG directed the investigation. All authors read and approved the final manuscript.