Input
H-MAPD accepts one or more nucleotide sequences in FASTA format. Due to technical limitations, electrophoresis-based MLPA supports up to 45 assays per reaction (per fluorescent reporter), while bead-coupled MLPA currently supports up to 100 assays per reaction. The software sets a maximum limit of 50 and 100 sequences for electrophoresis-based and bead-coupled MLPA, respectively. To prevent server overloading, aggregate sequence lengths up to 100,000 bases are allowed (equal to 100 sequences with an average length of 1000 nucleotides). Two platforms are provided, depending on the available technology: one for traditional electrophoresis-based and one for FlexMap bead-coupled MLPA.
Probe generation
For the electrophoresis-based MLPA platform, the probe length is increased by 4 bases for each input sequence, starting with the minimum length of ligation product specified by the user. If the user chooses to use stuffer sequences (See additional file 3: Stuffer sequences), the probe length increment is achieved by adding stuffer sequences between primer and hybridizing sequences in both the left (LPO) and right probe oligos (RPO). Otherwise, the increment is achieved by extending the length of the hybridizing sequences. For bead-coupled MLPA, all probes have the same length. The tag sequences included in the probe for each bead are provided in additional file 4: FlexMAP bead tag sequences (these are complementary to the anti-tag sequences that are provided as pre-synthesized bead-coupled oligos).
For each input sequence, the length of the hybridizing sequence (left and right combined) is calculated by subtracting the length of primers, stuffers or bead tag sequence from the total probe length. A series of hybridizing sequences are generated by walking along the input sequence at 1 base steps, and extracting fragments of the desired length. The hybridizing sequences are split in the middle: the left hybridizing sequence (LHS) becomes the 3' end of the LPO and the right hybridizing sequence (RHS) becomes the 5' end of the RPO.
Probe screening
MRC-Holland has the most experience in MLPA probe design. H-MAPD adopted most of the criteria described in MRC-Holland's probe design guidelines [4]. The workflow of probe selection is outlined in Figure 1. It has been observed that the first nucleotide following the left primer sequence affects probe signal strength [1], and it is suggested that adenosine should be avoided at this position [4]. H-MAPD has followed this suggestion and excludes adenosine at the first position following the left primer in the LPO. Empirically, the ideal GC content of hybridizing sequence is around 50% [3, 4], however a GC content of 28% has been successfully applied in MLPA [3]. Our software allows users to specify different ranges of GC content for the LHS and RHS, but ranges close to 50%, for example, 40–60% and 35–65% are recommended. The Tm is very important when considering how hybridizing sequences anneal to template DNA. Accurate Tm prediction for oligonucleotides is complicated and different algorithms result in different Tm values. MRC-Holland uses RAW for Tm prediction. While RAW runs only on the Windows operating system, H-MAPD calculates Tm using UNAFold software that supports various operating systems [5]. Additional file 5: Comparison of Tm calculated by RAW and UNAFold, compares melting temperatures calculated using RAW and UNAFold for the reference sequences mentioned in the MRC-Holland MLPA probe design guidelines [4]. Tm calculated by RAW is on average 9.1°C higher than that calculated by UNAFold with a standard deviation of 2.8°C. MRC-Holland recommends a minimum Tm (calculated by RAW) of 8°C above hybridization temperature. H-MAPD ensures both the LHS and RHS should have a Tm (calculated by UNAFold) of 2.5°C above hybridization temperature. Secondary structure prediction is performed on the LPO and RPO, also using UNAFold software. Both LPO and RPO should pass a minimum threshold for ΔG. Although a ΔG ≥ 0 is preferred, many probes having a negative ΔG actually work [4], therefore the software allows the user to choose a negative ΔG. Finally, to ensure efficient ligation, it is recommended that the sequences immediately adjacent to the ligation site should contain no more than three G and/or C [4]. H-MAPD favours low GC content at the ligation site by assigning a low ligation site score to high GC occurrence. The final score calculation for a probe set is shown in Figure 2.
Probe sets that meet all physical-chemical criteria (with a final score > 0) are subject to uniqueness screening. Homology search is performed for LPO and RPO using our local partial mirror of the UCSC BLAT server [6, 7]. Since hybridizing sequence is from genomic DNA, in order to avoid interference from pseudogenes or closely related genes, both LHS and RHS should have one and only one perfect match in the same region of the genomic DNA. However, it is conceivable that a user would try to design MLPA probes for a region which has multiple copies in the reference genome assembly. In order to allow this possibility, users can specify the maximum number perfect matches for the full hybridizing sequence (LHS + RHS) in the reference genome. The Tm is calculated for all other non-specific (undesirable) matches. If the Tm is above (hybridization temperature – 5.0)°C, the probe set is dropped. This ensures that the Tm of LHS and RHS is at least 7.5°C above any non-specific matches. Next, a SNP search is performed using the latest SNP database (snp128 at time of writing) included in the UCSC BLAT server. In the Stuffer and Bead protocols, LHS and RHS are usually not long. In the No-Stuffer protocol, LHS and RHS can often reach more than 100 nucleotides in length. For short LHS/RHS (less than 40 nucleotides), if any SNP is detected anywhere in the LHS or RHS, the probe set is dropped. For LHS/RHS longer than 40 nucleotides, only the adjacent 40 nucleotides on each side of the ligation site are tested for SNP occurrence. SNP(s) that are located 40 nucleotides away from the ligation site should not affect annealing and ligation of hybridizing sequences. Finally, repeat sequences are more prone to cause non-specific binding due to their abundance in the genome. Therefore, if either LHS or RHS overlaps with regions defined in the UCSC genome browser RepeatMasker track, an extra criterion (Maximum repeat sequence match allowed) is applied.
Output
Probe sets passing all criteria will be sorted by their scores and returned online (a link will be sent via email to the user upon completion of the analysis). Depending on the size of the input sequence, results will be returned in minutes or hours.
Comments
View archived comments (4)