Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT
- Jost Neigenfind^{1, 2}Email author,
- Gabor Gyetvai^{3},
- Rico Basekow^{1, 2},
- Svenja Diehl^{2},
- Ute Achenbach^{3},
- Christiane Gebhardt^{3},
- Joachim Selbig^{4} and
- Birgit Kersten^{1, 2}Email author
https://doi.org/10.1186/1471-2164-9-356
© Neigenfind et al; licensee BioMed Central Ltd. 2008
Received: 14 February 2008
Accepted: 30 July 2008
Published: 30 July 2008
Abstract
Background
Haplotype inference based on unphased SNP markers is an important task in population genetics. Although there are different approaches to the inference of haplotypes in diploid species, the existing software is not suitable for inferring haplotypes from unphased SNP data in polyploid species, such as the cultivated potato (Solanum tuberosum). Potato species are tetraploid and highly heterozygous.
Results
Here we present the software SATlotyper which is able to handle polyploid and polyallelic data. SATlo-typer uses the Boolean satisfiability problem to formulate Haplotype Inference by Pure Parsimony. The software excludes existing haplotype inferences, thus allowing for calculation of alternative inferences. As it is not known which of the multiple haplotype inferences are best supported by the given unphased data set, we use a bootstrapping procedure that allows for scoring of alternative inferences. Finally, by means of the bootstrapping scores, it is possible to optimise the phased genotypes belonging to a given haplotype inference. The program is evaluated with simulated and experimental SNP data generated for heterozygous tetraploid populations of potato. We show that, instead of taking the first haplotype inference reported by the program, we can significantly improve the quality of the final result by applying additional methods that include scoring of the alternative haplotype inferences and genotype optimisation. For a sub-population of nineteen individuals, the predicted results computed by SATlotyper were directly compared with results obtained by experimental haplotype inference via sequencing of cloned amplicons. Prediction and experiment gave similar results regarding the inferred haplotypes and phased genotypes.
Conclusion
Our results suggest that Haplotype Inference by Pure Parsimony can be solved efficiently by the SAT approach, even for data sets of unphased SNP from heterozygous polyploids. SATlotyper is freeware and is distributed as a Java JAR file. The software can be downloaded from the webpage of the GABI Primary Database at http://www.gabipd.org/projects/satlotyper/. The application of SATlotyper will provide haplotype information, which can be used in haplotype association mapping studies of polyploid plants.
Background
In the case of homozygous genotypes, such as maize or many other inbreeding crop species, haplotypes can be directly drawn from comparison of the amplified genomic sequence at a given locus between different individuals [1]. Difficulties arise if homozygous genotypes are not available, for example, in non-inbred, tetraploid potato [2]. In such cases, it is necessary to determine the haplotype phase from unphased SNP (single nucleotide polymorphism) data. There are several approaches for inferring haplotypes, based on (i) statistical methods, such as the EM algorithm and Gibbs sampling or (ii) the parsimony principle [3]. These approaches have, however, been developed for biallelic and diploid species. There is currently no software available for haplotype identification in more complex polyploids [2, 4]. In the case of autotetraploids [4], one has to tackle more phase-unknown alleles than in diploids, which results in a combinatorial explosion of possible haplotypes.
In this study, we aimed at the development and evaluation of a generalised approach for calculating haplo-types in polyploid species using the parsimony principle. The goal of haplotype inference is to find a set of haplotypes explaining every genotype present in a given unphased population. The parsimony principle can be used to find the smallest set of haplotypes, such that each genotype in the population can be explained by a ploidy-specific number of haplotypes from the set of haplotypes. The objective of minimising the number of haplotypes explaining a SNP data set is called Haplotype Inference by Pure Parsimony (HIPP) [5] and was shown to be NP-hard [6]. Lynce and Marques-Silva recently formulated the problem as an instance of the Boolean satisfiability problem, called SAT [7, 8], that can be solved orders of magnitude faster than the existing ILP (integer linear programming) formulation [7, 9, 10]. Unfortunately, the SAT formulation is also restricted to unphased biallelic SNP data of diploid species. Here, we present a generalisation for polyploids of the SAT approach developed by Lynce and Marques-Silva [7, 8]. This generalisation resulted in the development of the SATlotyper software tool. We tested and evaluated SATlotyper with simulated and experimental data sets of unphased SNP sites from a specific potato locus. SNP data were obtained from different populations of tetraploid individuals. For a subset of individuals, we compared the computed haplotypes with experimental haplotypes identified by amplicon cloning and sequencing [2].
Implementation
First, basic terms are defined and the basic problem is formulated. Then, the SAT model for biallelic polyploids is presented. This is followed by the extension of the model to polyploid and polyallelic SNP sites. After that, constraints are given for breaking symmetries in haplotypes and genotypes. Constraints are also formulated for alternative most parsimonious sets of explaining haplotypes as well as for alternative inferences of genotypes. A bootstrapping procedure for scoring haplotypes and a method for optimising alternative genotype inferences based on these scores is presented. Afterwards, lower and upper bounds of the most parsimonious explanation are mentioned and the definition of a norm for comparing genotypes is given. Finally, the realisation of SATlotyper is considered.
Basic definitions
Single Nucleotide Polymorphism or SNP is a DNA sequence variation, occurring when a single nucleotide is altered [7]. Thus, a site in a population of a species is a SNP site if at least a second sort of nucleotide occurs at this site at least once.
An allele is a different form of some segment of a chromosome, such as a second sort of nucleotide at a SNP site. Here, we focus on SNP sites. A SNP site that contains two different alleles is called biallelic, a SNP site that contains three different alleles is called triallelic and a SNP site that contains four different alleles is called tetraallelic.
A haplotype is the genetic constitution of a sequence of nucleotides [7]. The underlying data that forms a haplotype can be the full DNA sequence in the region, or more commonly the SNP sites in that region [7]. Polyploid organisms contain two or more homologous haplotypes.
A genotype describes the conflated data of a set of homologous haplotypes. In other words, an explanation for a genotype is a ploidy-specific number of homologous haplotypes. An unphased genotype is a genotype for which no set of explaining haplotypes is defined. There are, however, many possible sets of haplotypes explaining one given unphased genotype. A phased genotype is a genotype for which at least one set of explaining haplotypes is defined. If for a given site all explaining haplotypes have the same value, then the genotype is said to be homozygous at that side. Otherwise the genotype is said to be heterozygous at that side.
Figure 1 illustrates how the different terms are related to each other.
Problem formulation
A SNP site of an individual is a string over the nucleotide alphabet Σ = {A, C, G, T} with size determined by the ploidy of the considered species. A sequence of such SNP sites defines a genotype, with the number of SNP sites as the length of the genotype. Let n denote the number of individuals in the sample, m be the number of SNP sites, and p be the ploidy of the considered species. Furthermore, a specific genotype is denoted by g_{ i }, with 1 ≤ i ≤ n, and for a specific site j, with 1 ≤ j ≤ m, in g_{ i }we use g_{i, j}. Finally, let ${g}_{i,j}^{l}$ with 1 ≤ l ≤ p, denote the l^{ th }state at site j in genotype i. Given a set $\mathcal{G}$ of n genotypes, each of length m, the haplotype inference problem is that of finding a set $\mathcal{H}$ of not necessarily distinct haplotypes. Furthermore, for each genotype g_{ i }∈ $\mathcal{G}$ there is at least one set of p haplotypes {h^{1}, ..., h^{ p }} ∈ $\mathcal{H}$ such that g_{ i }is explained by {h^{1}, ... h^{ p }}. The values of nucleotides are determined by the number of different alleles at the corresponding SNP site: for a biallelic SNP site the values are {0, 1}, for a triallelic SNP site the values are {0, 1, 2}, while for a tetraallelic SNP site the values are {0, 1, 2, 3}. Thus, a specific haplotype h_{ k }is a string over the alphabet {0, 1, 2, 3}, with 1 ≤ k ≤ |$\mathcal{H}$|.
Possible encodings of SNP sites
Number of alleles at given site | Possible encodings | ||
---|---|---|---|
Tetraallelic SNP site | Triallelic SNP site | Biallelic SNP site | |
Homozygous individual | (0, 0, 0, 0), (1, 1, 1, 1), (2, 2, 2, 2), (3, 3, 3, 3) | (0, 0, 0, 0), (1, 1, 1, 1), (2, 2, 2, 2) | (0, 0, 0, 0), (1, 1, 1, 1) |
Biallelic individual | (0, 1, 1, 1), (0, 0, 1, 1), (0, 0, 0, 1), (0, 2, 2, 2), (0, 0, 2, 2), (0, 0, 0, 2), (0, 3, 3, 3), (0, 0, 3, 3), (0, 0, 0, 3), (1, 2, 2, 2), (1, 1, 2, 2), (1, 1, 1, 2), (1, 3, 3, 3), (1, 1, 3, 3), (1, 1, 1, 3), (2, 3, 3, 3), (2, 2, 3, 3), (2, 2, 2, 3) | (0, 1, 1, 1), (0, 0, 1, 1), (0, 0, 0, 1), (0, 2, 2, 2), (0, 0, 2, 2), (0, 0, 0, 2), (1, 2, 2, 2), (1, 1, 2, 2), (1, 1, 1, 2) | (0, 1, 1, 1), (0, 0, 1, 1), (0, 0, 0, 1) |
Triallelic individual | (0, 1, 2, 2), (0, 1, 1, 2), (0, 0, 1, 2), (0, 1, 3, 3), (0, 1, 1, 3), (0, 0, 1, 3), (0, 2, 3, 3), (0, 2, 2, 3), (0, 0, 2, 3), (1, 2, 3, 3), (1, 2, 2, 3), (1, 1, 2, 3) | (0, 1, 2, 2), (0, 1, 1, 2), (0, 0, 1, 2) | |
Tetraallelic individual | (0, 1, 2, 3) |
One of the approaches to the haplotype inference problem is called Haplotype Inference by Pure Parsimony [5]. A solution to this problem minimises the total number of distinct haplotypes used. The SAT-based formulation of the HIPP models whether there is a set $\mathcal{H}$ of r distinct haplotypes, with r = |$\mathcal{H}$| haplotypes, such that each genotype g_{ i }∈ $\mathcal{G}$ is explained by p haplotypes in $\mathcal{H}$. The SAT-based algorithm considers increasing sizes for $\mathcal{H}$, from a lower bound lb to an upper bound ub [7]. Trivial lower and upper bounds are, respectively, 1 and pn. The algorithm terminates for a size of $\mathcal{H}$ for which there are r = |$\mathcal{H}$| haplotypes such that every genotype in $\mathcal{G}$ is explained by p haplotypes in $\mathcal{H}$. The smallest r for which such a set $\mathcal{H}$ exists is a most parsimonious set of explaining haplotypes.
All variables of the Boolean satisfiability problem are two-valued. Depending on the truth assignment, a Boolean formula is either true or false. Then, SAT consists of the determination if an assignment to a given Boolean formula in conjunctive normal form (CNF) exists such that the formula evaluates to true, or the proof that such an assignment does not exist. Solving SAT is NP-complete [6]. The Boolean satisfiability problem for HIPP, however, can efficiently be solved [7] by SAT solvers such as MiniSat [11, 12], MiraXT [13] or Sat4J [14]. This may be explained by a unknown hidden structure in the genotype data, which makes the problem easier to solve.
SAT model for biallelic polyploids
The first SAT formulation for HIPP was introduced in [7] and the presented constraints were implemented in the software SHIPs [15]. Unfortunately, this approach is restricted to diploid and biallelic species. Here, we extend the formulation of constraints from [7] to polyploid biallelic populations of genotypes. In a tetraploid, biallelic population of genotypes, the possible alleles are modeled by 0 or 1 respectively (e.g. SNP site j of individual i with g_{i, j}= (0, 1, 1, 1), ${g}_{i,j}^{1}$ = 0, ${g}_{i,j}^{2}$ = 1, ${g}_{i,j}^{3}$ = 1 and ${g}_{i,j}^{4}$ = 1). Furthermore, the haplotypes can be modeled such that h_{k, j}∈ {0, 1}, where h_{k, j}denotes the j^{ th }site of haplotype k. A haplotype h_{ k }can then be viewed as a binary word h_{k,1}... h_{k, m}of length m over the alphabet {0, 1}.
For a given value of r, the model considers r haplotypes and aims at finding p haplotypes (which can possibly represent the same haplotype) with each genotype g_{ i }. As a result, for each genotype g_{ i }, the model uses selector variables for selecting which haplotypes are used for explaining g_{ i }. Since the genotype is to be explained by p haplotypes, the model uses p sets of r selector variables, ${s}_{k,i}^{l}$. Hence, genotype g_{ i }is explained by haplotypes ${h}_{{k}_{1}},\mathrm{...},{h}_{{k}_{p}}$,if ${s}_{{k}_{1},i}^{1}=1,\mathrm{...},{s}_{{k}_{p},i}^{p}=1$.
where 1 ≤ k ≤ r and 1 ≤ l ≤ p. Hence, if haplotype k is selected for explaining genotype i, by at least one of the p representatives, then the value of haplotype k at site j must be 1.
Efficient method for obtaining the model for biallelic polyploids
For the case p > 2, it is straightforward to formulate the constraints for the ${g}_{i,j}^{l}$ variables from heterozygous SNP sites in DNF by enumerating all allele arrangements. Each formula in DNF can be transformed into an equivalent formula in CNF using Tseitin's transformation [16]. However, the enumeration of all arrangements is of exponential complexity. Our objective here is to find an equivalent representation of enumeration of arrangements. This representation is to be in CNF and to allow formulation in polynomial time. Combinatorial problems as described above can also be represented by sums. For instance, for an individual from a tetraploid species with two 0 and two 1 alleles at a biallelic SNP site, all six allele arrangements are determined if the sum of the elements of a binary vector that represents the allele composition is constrained to 2: (0, 0, 1, 1), (0, 1, 0, 1), (1, 0, 0, 1), (1, 0, 1, 0), (1, 1, 0, 0) and (0, 1, 1, 0).
A simplification is achieved if vector C^{l,1}, where 1 ≤ l ≤ p, (Figure 4) contains the possible allele arrangements [17] and the A vectors store the accumulation of the sum. In this situation, all B variables can be set to zero. The constraints of carry overs reduce to:
(A^{l, t}∧ C^{l, t}) ⇔ C^{l, t+1},
where 1 ≤ l ≤ p. Additionally, if ${S}_{full}^{l,t}$ variables are replaced by A^{l+1, t}variables, the sums reduce to:
((¬A^{l, t}∨ ¬C^{l, t}) ∧ (A^{l, t}∧ C^{l, t})) ⇔ A^{l+1, t},
For each individual and SNP site in a biallelic population, variables corresponding to A and C need to be defined. Let variables ${a}_{i,j}^{l,t}$, with 1 ≤ l ≤ p + 1 and 1 ≤ t ≤ w, denote the accumulation of the sum.
Additionally, let variables ${c}_{i,j}^{l,t}$, with 1 ≤ l ≤ p and 1 ≤ t ≤ w, stand for the carry overs. For a SNP site j from genotype i, summing constraints can easily be obtained by replacing the ${c}_{i,j}^{l,1}$ variables with the ${g}_{i,j}^{l}$ variables [17], with 1 ≤ l ≤ p. Finally, the ${a}_{i,j}^{p+1,t}$ variables, with 1 ≤ t ≤ w, are constrained to the binary representation of the required sum.
Extension to SAT model for polyallelic polyploids
Dependent on the input, Function 16 defines if the substituted variable is negated.
The ${g}_{i,j}^{l}$ variables are insufficient for describing arrangements of more than two alleles at a SNP site. The representation of, for instance, three different states needs at least two bits. We define o_{ j }as the number of different alleles from all individuals at SNP site j of a population of genotypes. If o_{ j }> 2, the representation of the SNP site is extended to w_{ j }= ⌈log_{2} o_{ j }⌉ binary columns. Thus, ${g}_{i,j}^{l}$ is split to ${g}_{i,j}^{l,1},\mathrm{...},{g}_{i,j}^{l,{w}_{j}}$. Each allele is encoded by its corresponding binary number. For instance, the four alleles 0, 1, 2 and 3 at a tetraallelic SNP site are encoded as 00, 01, 10 and 11, respectively. The haplotypes are then extended analogously, such that ${h}_{k,j}\in {\{0,1\}}^{{w}_{j}}$ denotes the j^{ th }site of haplotype k.
For generalisation to polyallelic SNP sites, the formulation of binary sums can be reused. Let ${z}_{i,j}^{{u}_{j}}$ be the number of allele u_{ j }at a specific SNP site j in unphased genotype i, where 1 ≤ u_{ j }≤ o_{ j }. The value of o_{ j }can be greater than p but ${\sum}_{{u}_{j}=1}^{{o}_{j}}{z}_{i,j}^{{u}_{j}}=p$. For a set of nucleotide sequences, it holds that o_{ j }≤ 4.
where 1 ≤ l ≤ p and 1 ≤ u_{ j }≤ o_{ j }. It is necessary to formulate the sums $\sum}_{l=1}^{p}{v}_{i,j}^{l,{u}_{j}$ which must equal ${z}_{i,j}^{{u}_{j}}$, as described in the previous sections such that each allele at SNP site j occurs ${z}_{i,j}^{{u}_{j}}$ times in genotype i.
where 1 ≤ t_{ j }≤ w_{ j }, 1 ≤ k ≤ r and 1 ≤ l ≤ p.
Complexity of the model
Number of variables | |
---|---|
h | r_{ f }m log_{2} o_{ max } |
g | n m p log_{2} o_{ max } |
s | r _{ f } n p |
a | n m p^{2} log_{2} p |
c | n m p^{2}log_{2} p |
v | n m p ^{2} |
Constraints for breaking symmetries in haplotypes
It is important to note that the model proposed above is not practical for most existing problem instances, even with the most efficient SAT solvers [7]. This problem, however, can be solved by breaking symmetries to prune the search space. As described in [7, 8], symmetries in explaining haplotypes can be broken by sorting the haplotypes lexicographically. A strict lexicographic ordering can be achieved by the formulation of constraints that become true if h_{1} is strictly smaller than h_{2}, h_{2} is strictly smaller than h_{3}, and so on. If the ordering is not strict it is not guaranteed that all explaining haplotypes are pairwise distinct.
where 1 ≤ j ≤ m + 1. If an e_{k, j}becomes true because the constraint h_{k, j}<h_{k+1, j}is satisfied, it is not necessary to compare h_{k, j'}to h_{k+1, j'}, where j' <j. We must, however, ensure that h_{k, j'}≤ h_{k+1, j'}, where j' > j. Then, the model requires that the following is satisfied:
(¬h_{k, j}∨ h_{k+1, j}∨ e_{k, j}),
where 1 ≤ j ≤ m. If an assignment can be found such that all clauses in Formulas 19 – 20 are true, where 1 ≤ k ≤ r - 1, the haplotypes are in lexicographical order.
Constraints for breaking symmetries in genotypes
Haplotypes that infer a genotype can be lexicographically ordered in a way similar to the set of r explaining haplotypes [7, 8]. The l^{ th }haplotype inferring an unphased genotype i is marked by a binary variable ${s}_{k,i}^{l}$. The sum $\sum}_{k=0}^{r}{s}_{k,i}^{l$ is constrained to equal 1 so that exactly one haplotype is selected for explaining the l^{ th }row of g_{ i }. In contrast to the most parsimonious set of explaining haplotypes, the selection variables have to be ordered non-strict lexicographically since homozygous genotypes can not be explained by sets of pairwise distinct haplotypes.
where 1 ≤ k ≤ r. Because it is almost the formulation of a strict lexicographic order, except that the variable ${f}_{1,i}^{l}$ does not have to be true, it has to be relaxed to become a non-strict order. This can be done by formulating the constraints for either vector ${s}_{i}^{l}$ is strict smaller than vector ${s}_{i}^{l+1}$ or both vectors are equal.
where 1 ≤ k ≤ r. If an assignment can be found for which all clauses in Formulas 21 – 23 are true, where 1 ≤ l ≤ p - 1, the selection variables of genotype g_{ i }are in non-strict lexicographic order.
Constraints for alternative most parsimonious sets of haplotypes
A nice feature of constraining alternative haplotype inferences is that Formula 25 is automatically in CNF.
Constraints for alternative genotype inferences
Formula 27 is in CNF and no reformulation is necessary. Such clauses have to be given for each previously found genotype inference j.
For constraining alternative genotype inferences, it is very important that genotype symmetries are broken as shown in Section "Constraints for breaking symmetries in genotypes". If symmetries are not broken, and if an assignment to the ${s}_{k,i}^{l}$ variables is excluded for a given unphased genotype g_{ i }, the SAT solver can still report an assignment that represents a permutation of the excluded assignment. For instance, vector ${s}_{i}^{{l}_{1}}$ with length r is exchanged by vector ${s}_{i}^{{l}_{2}}$ with length r.
Note that the number of alternative genotype inferences is equal to or greater than the number of alternative most parsimonious sets of haplotypes, since each alternative set of haplotypes defines at least one inference of genotypes. As a result, we do not calculate complete alternative genotype inferences in this study. Instead, we introduce an optimisation method for genotypes, based on explaining haplotypes and bootstrapping (see Section "Bootstrapping" and Section "Optimisation of genotypes").
Bootstrapping
All possible minimal inferences are treated equally by the SAT approach. It is unlikely that the first haplotype inference found is the most probable one under the assumed model and given data. It is also unlikely that the first haplotype inference is the inference with fewest differences compared to the real data. The question which haplotype inference should be taken for further analysis remains. There must be one or more inferences which are supported better by the input data. To introduce a quality measurement of the haplotypes and alternative inferences which have been calculated, a bootstrapping procedure is introduced as follows. Bootstrapping is widely used (e.g. in phylogenetic reconstruction [18]) for estimating properties of an estimator. Those properties are measured when sampling from an approximate distribution. One standard choice for an approximate distribution is the empirical distribution of the observed data. To use the bootstrap to assess the uncertainty of estimates of the phased genotypes, the data should be a series of independently sampled points. Here, we assume that haplotypes are drawn independently from a most parsimonious set of explaining haplotypes which is the base of the population of genotypes. Thus, the independently drawn haplotypes satisfy the independence assumptions of the bootstrap method.
Optimisation of genotypes
The number of all alternative genotype inferences for a given most parsimonious set of haplotypes is the product of the alternative inferences of each genotype. Moreover, the product of the number of alternative genotype inferences from each alternative most parsimonious set of haplotypes is the number of all valid most parsimonious HIPPs.
Calculation of lower and upper bounds
In contrast to integer linear programming formulations of HIPP [9, 10], the SAT approach is not able to optimise a target function directly. Thus, each possible number of explaining haplotypes has to be tested incrementally starting with r = 1. Methods for the computation of lower and upper bounds [7, 19] can be applied to avoid the iteration until a most parsimonious solution is found. Furthermore, a lower bound can be used for reducing the size of the model [7]. Genotypes which only can be explained by distinct sets of haplotypes are called incompatible. Incompatible genotypes can be used for deriving a lower bound such that the size of the model can be reduced by eliminating s variables and corresponding clauses.
In the existing version of SATlotyper, the computation of lower and upper bounds is not implemented. It was found empirically that, if the approach is able to find a most parsimonious set of haplotypes in reasonable time, it is also able to prove the unsatisfiability of smaller sets of haplotypes in reasonable time. Nevertheless, it is not clear how large the increase in solvable instances would be if a calculation of lower and upper bounds were used in haplotype inference of polyploids. The computation of upper and lower bounds according to [7, 19] may be added to SATlotyper in future versions.
Comparing inferences with real data
To define a standard for the measurement of an inference of haplotypes and corresponding genotypes, we sum up the differences between genotypes from inference and corresponding genotypes from real data. The number of differences is defined as the distance d between both sets.
d = min (D(G, G')) (29)
The complexity of calculating d is exponential but for small p this is still possible.
Software realisation
SATlotyper is implemented in Java and realises the constraints described above. Additionally, there are some obvious improvements included in the program, such as converting the ${v}_{i,j}^{l,{u}_{j}}$ vectors, with 1 ≤ l ≤ p, to the corresponding Boolean inverse if min(q, p - q) = p - q, where q is the number of 1s. Another improvement is the enumeration of constraints for a SNP site such that instead of o_{ j }only o_{ j }- 1 sums have to be given if o_{ j }≤ p.
The SAT approach that we generalised [7] can not optimise a target function directly (but there are efforts to combine ILP and SAT features [12, 17, 20, 21]). Therefore, the SAT formulation of an assumed number of explaining haplotypes has to be tested for satisfiability by the SAT solver. If it fails, the number of explaining haplotypes is incremented and then tested again. This is repeated until the SAT solver reports satisfiability. For unphased genotypes, given in CSV format, the program generates corresponding constraints and writes the resulting formula in CNF format to the file system. Next, the binary of the corresponding SAT solver is executed with the newly generated CNF file as input. After successful termination of the solver, SATlotyper reads, analyses and reports the output of the solver in XML format (Additional file 1).
SATlotyper is able to execute different SAT solvers and was tested with MiniSat [11, 12], MiraXT [13] (a multithreaded SAT solver) and Sat4J [14] but can be easily adapted to other solvers accepting standard CNF file format. Access to single Boolean variables is realised by a hash which contains corresponding matrices. This object allows indexing by means of the corresponding keyword, for instance "haplotype", for a given type of variable.
Results
The following results were computed on a laptop with 2048 MB RAM and AMD Turion™ 64 X2 Mobile Technology TL-56 (2 × 1.80 GHz). The operation system was a Linux system (Debian 4.0 ("etch")), kernel version 2.6.18-5-amd64. MiniSat 2 (minisat2-070721.zip [11, 12]) was used for solving SAT.
Development of SATlotyper
The presented generalisation of the original SAT approach [7] led to the development of SATlotyper, which can infer polyploid and polyallelic input. The SATlotyper algorithm is able to handle incomplete data sets where SNP sites are partly missing, without bringing in unjustified assumptions. For instance, SNP sites are missing when genotypes are heterozygous for alleles with indels (insertions or deletions) that may result in an interruption of analysable sequence data. Unknown sites are marked "N". With the SAT approach, no assumptions are made for individuals that contain SNP sites with no information available, i.e. the formulation of constraints for the corresponding individual and SNP site is omitted. If the formulation of constraints for a site is omitted, the SAT solver uses a set of haplotypes inferred from other unphased genotypes, provided that these haplotypes are compatible with the known sites of the unphased genotype containing missing information. The choice of haplotypes for explaining such a genotype is independent of the alleles that the explaining haplotypes show at the site with missing information.
Testing SATlotyper on simulated data
In order to test SATlotypers performance, we simulated haplotypes comprising six SNP sites for ten tetraploid, biallelic populations with 100 individuals each. For every population, six simulated haplotypes were used as a pool for further simulation. These six different haplotypes of one population were sampled uniformly to generate a population of tetraploid individuals. The alleles of these haplotypes were also sampled uniformly. The simulation resulted in ten data sets of 100 individuals each.
Comparison of the different methods of SATlotyper
Method | Features of SATlotyper | ||
---|---|---|---|
Alt. expl. hap. | Bootstrapping | Optimisation | |
1 | No | No | No |
2 | Yes | Yes | No |
3 | Yes | Yes | Yes |
4 | No | No | Yes |
Method 1: for each data set exactly one haplotype inference was calculated.
Method 2: for each data set up to 250 alternative most parsimonious sets of explaining haplotypes and the corresponding haplotype inferences were calculated. Additionally, bootstrapping was performed based on the calculated haplotypes by generation of 250 bootstrapping replicates. Phased genotypes were then scored by the sum of the scores of their constituent haplotypes, and these values were summed up to score complete haplotype inferences (see Section "Implementation"). The best scored haplotype inference was selected without further optimisation with regard to genotype inference.
Method 3: the analysis described in Method 2 was further refined by an optimisation with regard to genotype inference performed for each alternative most parsimonious set of haplotypes (see Section "Implementation").
Method 4: the first haplotype inference was used to optimise the genotype inference. For this purpose, the haplotypes were scored by their frequency in all genotypes of the first haplotype inference. Next, optimisation of the genotypes was carried out as described.
Without noise, all methods gave predictions close to 100% correctness. With noise added the results of the four analysis methods showed an increasing correctness to the original data in the following order: Method 1 < Method 2 < Method 4 < Method 3. This means that Method 3, which is the method with bootstrapping and genotype optimisation, gave the best results for all values of noise. The comparison between Method 2 and Method 1 demonstrated that the application of bootstrapping in order to select the highest scored haplotype inference (Method 2) gives better results than the method without bootstrapping (Method 1). The distributions of nucleotide distances (minimal Hamming distance) from Method 1 and Method 3 for a given amount of noise were compared by the Kruskal-Wallis test with a significance level of 5%. All p-values except for the 0%-noise case were below 0.05, and consequently the null hypothesis of both distributions being the same was rejected. Although the distributions of nucleotide distances from Method 1 and Method 2 were not significantly different, the mean values of the distances of Method 2 were always smaller than those of Method 1.
Performance of SATlotyper with unphased SNP data from tetraploid potato genotypes
The performance of SATlotyper was tested using unphased SNP data from the locus BA213c14t7 of Solanum tuberosum. Locus BA213c14t7 corresponds to the sequenced T7-end of the BAC (bacterial artificial chromosome) clone BA213c14 and is located on potato chromosome V between the markers GP21 and GP179 near the R1 gene for resistance to late blight [22] (see Chromosome V in PoMaMo, The Potato Maps and More Database [23, 24]). This intergenic sequence region is characterised by high sequence variability. The BA213c14t7 sequence also includes SNP sites associated with resistance against the parasitic root cyst nematode Globodera pallida [25].
Comparison of SATlotyper results with experimentally determined haplotypes
In order to evaluate SATlotyper further, we compared computed haplotypes with experimentally determined haplotypes at the BA213c14t7 locus using a subset of nineteen heterozygous tetraploid individuals out of the two populations described above. We identified the haplotypes for twelve SNP sites both computationally and experimentally. The sequence of the BA213c14t7 locus and the SNP sites analysed are shown in Figure 9.
Computational haplotype inference
The unphased SNP data from the nineteen individuals were used as input for the computational haplotype inference with SATlotyper analysis (Method 2). Up to 250 alternative most parsimonious sets of haplotypes and the corresponding haplotype inferences were calculated. On the basis of the calculated haplotypes bootstrapping was performed (250 samples) in order to score the alternative haplotype inferences. The haplotype inference with the highest score was selected. SATlotyper identified 114 alternative most parsimonious sets of haplotypes for this data set with a minimal number of twelve explaining haplotypes. Additional file 1 (XML output of SATlotyper) gives the input data, the bootstrapping results for all haplotypes and the different scored haplotype inferences which are in order of score. For each alternative haplotype inference the first corresponding genotype inference is given. In Figure 9, the twelve haplotypes obtained from the haplotype inference with the highest bootstrapping score are listed, together with the experimentally determined haplotypes. In addition, an optimisation with regard to genotype inference was performed for all alternative haplotype inferences (Method 3).
Experimental haplotype inference
The inference of haplotypes by SATlotyper from experimental SNP data requires the scoring of the SNP allele dosage (zero, one, two, three or four in a tetraploid individual) in PCR amplicons derived from partially heterozygous individuals. Preferential amplification of one allele versus the other may occur at heterozygous loci, resulting in erroneous scores of the allele dosage [25], which leads to the calculation of erroneous haplotypes by SATlotyper. Even with a low percentage of erroneous scores of allele dosage per single SNP site, the combination of errors from several SNP sites can lead to an inflated number of haplotypes that do not exist. To verify haplotype models computed by SATlotyper from experimental SNP data, which are not error free, we performed an independent experimental haplotyping. The number and dosage of haplotypes present at a specific locus in a given individual can be experimentally determined by cloning and sequencing a sufficient number of PCR fragments generated from genomic DNA of that individual at that specific locus. The number of different haplotypes is inferred from the number of consensus sequence variants found in the clone sample, and the haplotype dosage is inferred from the frequency of each consensus sequence variant in the clone sample.
Experimental haplotypes H1 to H10
Hapl. | SNP139 | SNP143 | SNP152 | SNP157 | SNP178 | SNP214 | SNP218 | SNP236 | SNP244 | SNP253 | SNP273 | SNP274 | Fr. [%] |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H1 | A | G | T | T | G | C | C | C | C | T | G | T | 26.3 |
H2 | A | G | T | T | G | C | A | C | T | C | T | A | 1.3 |
H3 | G | G | T | T | G | C | C | C | C | T | G | T | 2.6 |
H4 | A | A | T | G | A | C | C | C | T | T | G | A | 2.6 |
H5 | G | G | T | T | G | C | A | C | T | C | T | A | 19.7 |
H6 | G | G | T | T | G | T | A | C | T | C | T | A | 2.6 |
H7 | G | G | A | T | G | C | A | C | T | C | T | A | 1.3 |
H8 | G | G | A | T | G | C | C | T | C | T | G | A | 38.1 |
H9 | G | G | A | T | G | C | C | C | C | T | G | A | 1.3 |
H10 | G | G | T | T | G | C | C | C | C | T | G | A | 3.9 |
Haplotypes found in nineteen tetraploid potato individuals and resulting genotype model
Individual | H1 | H2 | H3 | H4 | H5 | H6 | H7 | H8 | H9 | H10 | Genotype model | χ^{2}-value | p-value |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S9 | 19 | 2 | 3 | 1/1/2/5 | 8.25 | 0.02 | |||||||
S25 | 21 | 27 | 4/4/8/8 | 0.75 | 0.39 | ||||||||
S50 | 18 | 3 | 3 | 1/1/3/6 | 6.00 | 0.05 | |||||||
S73 | 18 | 14 | 1/1/8/8 | 0.50 | 0.48 | ||||||||
S83 | 26 | 2 | 1/1/1/5 | 4.76 | 0.03 | ||||||||
S89 | 7 | 2 | 14 | 8/8/5/7 | 3.26 | 0.20 | |||||||
S93 | 26 | 8/8/8/8 | 0 | 1 | |||||||||
B1 | 7 | 17 | 8/8/8/5 | 0.22 | 0.64 | ||||||||
B4 | 4 | 19 | 8/8/8/5 | 0.71 | 0.40 | ||||||||
B26 | 24 | 0 | 1 | ||||||||||
B30 | 79 | 5/5/5/5 | 0 | 1 | |||||||||
B37 | 7 | 16 | 8/8/8/1 | 0.36 | 0.55 | ||||||||
B43 | 28 | 2 | 1/1/1/3 | 5.38 | 0.02 | ||||||||
B52 | 9 | 14 | 1/1/8/8 1/8/8/8 | 1.09 or 2.45 | 0.30 0.12 | ||||||||
B63 | 2 | 17 | 2 | 8 | 5/8/9/10 | 20.79 | 0.0001 | ||||||
B75 | 2 | 9 | 13 | 10/10/5/8 | 4.25 | 0.12 | |||||||
B80 | 10 | 12 | 1/1/5/5 | 0.18 | 0.67 | ||||||||
B86 | 5 | 19 | 8/8/8/5 | 0.22 | 0.64 | ||||||||
B108 | 20 | 3 | 1/1/1/5 | 1.75 | 0.19 |
Comparison of computed and experimental haplotype and genotype models
From the nineteen individuals analysed, nine haplotypes were identified by both methods (Figure 9). The only experimental haplotype not detected computationally was haplotype H2 (Table 4). H2 was identified only in individual S9, in only two out of twenty-four analysed clones (Table 5). Haplotype H2 did not occur in any of the 114 alternative computed inferences (Additional file 1), which leads to the conclusion that the unphased data do not support haplotype 2 in the most parsimonious set of explaining haplotypes. Three haplotypes were identified computationally but not experimentally. This may result from imperfect input data, for example, from erroneous assignment of SNP allele dosage, which leads to the creation of additional "non real" haplotypes by SATlotyper, which are needed to satisfy the input data. Alternatively, the experimental haplotype inference may have missed rare but real haplotypes owing to underrepresentation of the sequence in the cloned amplicons.
Formula 31 is motivated analogously to Formula 30, except that the amount of noise is not known. As a result, the denominator represents the total number of nucleotides. As shown in Figure 12, we obtained for 80% of the genotypes a correctness of at least 90% when compared with the experimental data. For two individuals (B43, B4) the predicted and experimental haplotypes were exactly the same. For twelve of the nineteen individuals at least three of the four predicted haplotypes were confirmed by the experimental data (Additional file 1, Table 5). The additional optimisation with regard to genotype inference according to Method 3 did not result in a further improvement of the Hamming distance and correctness between predicted and experimental phased genotypes for the nineteen individuals analysed.
Discussion
Existing approaches for inferring haplotypes from unphased SNP data are only applicable to biallelic and diploid species. This study therefore aimed at the development of an approach for calculating haplotypes in heterozygous polyploid species. Generalising the approach from [7], a Java based program was developed which formulates HIPP for the Boolean satisfiability problem. Instead of giving the constraints for combinatorial sub problems explicitly, SATlotyper generates constraints for summing such that the complexity decreases from exponential to polynomial for polyploid and polyallelic data sets. Other methods for summing based on the SAT approach have been described [17], which are possibly easier to solve by the SAT solver so that a future version of SATlotyper will be further optimised. SATlotyper is able to handle missing SNP information by omitting constraints for such sites so that no unjustified assumptions about nucleotide frequencies have to be made.
For a given data set of unphased genotypes, it is possible by means of SATlotyper to calculate the first most parsimonious set of explaining haplotypes and corresponding phased genotypes (Method 1). One drawback of the parsimony approach is the sparsity of statistical information. A bootstrapping procedure can therefore be used to score haplotype inferences, in case there is more than one possible haplotype inference (Method 2). Since unphased genotypes can also have alternative inferences, it is possible to optimise the phased genotypes in respect to a most parsimonious set of haplotypes and the corresponding bootstrapping scores (Method 3). It is also possible to score the haplotypes without bootstrapping simply by their frequencies in the phased genotypes, which can be used for selecting the best haplotype inference in the case of alternative inferences and for performing an optimisation with regard to alternative genotype inferences (Method 4).
In this study, SATlotyper was tested and evaluated with simulated and experimental data sets of unphased SNP sites from tetraploid individuals. The different SATlotyper methods were compared with the simulated data (Figure 8). Prior to analysis, noise from 0% to 10% was added to the data, to account for erroneous SNP scores in the input data.
Without noise all methods were able to predict the correct set of haplotypes which were used in the simulation (Figure 8). Compared with the original simulation, the haplotype compositions of the phased genotypes were close to the composition of the simulated genotypes (> 99% correctness). With noise added, Method 3 using bootstrapping and optimisation gave the best results (Figure 8). It is likely that the relatively small difference between Method 2 with bootstrapping and Method 1 without bootstrapping (Figure 8) can be explained by the simulation. Because the haplotypes are uniformly distributed, it is very likely that – even in case of noise – all original haplotypes are present in the first found haplotype inference. Thus, an analysis of different distributions of haplotypes in populations is still missing. In the case of real data we would expect a larger difference between the applications of Method 1 and Method 2.
The results obtained when Method 4 (Figure 8) was applied suggest that for some purposes it could be sufficient simply to score the haplotypes corresponding to their frequencies in the phased genotypes for optimising genotype inference. This suggests that data sets that are time consuming to infer can be optimized by Method 4 such that also time consuming bootstrapping can be omitted.
SATlotyper was also applied to an experimental data set of twelve unphased SNP markers, which were scored by sequencing of the amplicons of a 500 bp-region at potato locus BA213c14t7. As we have verified only one locus so far, it is not possible to make a firm conclusion how representative the data set of the BA213c14t7 locus is. Some variation is expected between different loci with respect to the quality of an amplicon [2] for direct sequencing and whether the amplicon is representative for the genotype at the amplified locus. The performance of the approach was much higher with the experimental data than with simulated data. Nevertheless, the running time increased exponentially with the linearly increasing number of SNP sites (Figure 10).
In addition to performance, the quality of the prediction was evaluated by comparison of predicted haplotypes with experimental haplotypes that were determined by amplicon cloning and sequencing [2]. Unfortunately, the experimental validation of haplotypes is time consuming and expensive. Thus, only a subset of nineteen heterozygous unphased individuals was available for the direct comparison. Furthermore, it has to be taken into account that the evaluation of predicted haplotypes based on comparison with experimentally determined haplotypes is slightly restricted by the fact that the experimental haplotypes are not error-free. In this study, new insights were gained in the experimental set-up for haplotype inference in autotetraploid species by molecular cloning and sequencing of amplicons. In several cases, the observed frequency of amplicon sequences deviated from the expected frequency (0.25, 0.50 or 0.75). One reason could be a difference in the G/C-content of the alleles resulting in altered performances of the PCR-reaction [25, 26].
Even slight differences in the initial PCR cycles are enhanced further on in the downstream reactions. This first comparison of computed with experimental haplotypes gave promising results: nine of the ten experimental haplotypes were also identified by SATlotyper prediction out of the sub-population of nineteen individuals (Figure 9). With respect to the phased genotypes, the SATlotyper analysis achieved a correctness of at least 90% (for 80% of the individuals) compared with the experimental result (Figure 12). With the exception of Method 1, all SATlotyper methods gave similar results with this data set.
Conclusion
The study demonstrates that HIPP can efficiently be solved for data sets of unphased SNP sites from heterozygous polyploids by a generalisation of the SAT approach from [7]. Our results are encouraging for the future application and further development of SATlotyper. Existing or newly generated unphased SNP data can be analysed by SATlotyper to infer haplotypes. Haplotype information can be used instead of individual SNP sites in association mapping that exploits the biodiversity in existing cultivars and breeding lines [2]. Compared with methods based on individual SNP sites, the haplotype mapping method significantly improves the power and robustness of gene mapping techniques [27] as there are fewer haplotypes than SNP sites [2].
Availability and requirements
SATlotyper was developed in the scope of GABI (Genome analysis of the plant biological system) projects and can be downloaded from the SATlotyper project page [28] of GabiPD, The GABI Primary Database [29]. The software is distributed as a Java JAR file and requires Java Runtime Environment 1.5.0 or higher. For the user's convenience, the downloadable archive contains statically linked versions of MiniSat [11, 12] and MiraXT [13]. The software is accessed from command line. Under UNIX like systems the program runs out of the box with MiniSat [11, 12], MiraXT [13] and the Sat4J solver [14]. Users with Microsoft^{®} Windows are restricted on running the Sat4J solver. SATlotyper is freeware for scientific use and is distributed under the SATlotyper licence, which is also included in the downloadable package.
Declarations
Acknowledgements
We are very grateful to the potato breeding companies Saka-Ragis, Windeby, Germany, and Böhm-Nordkartoffel Agrarproduktion, Ebstorf, Germany, who kindly provided the analysed potato individuals. We thank Prof. Knut Reinert, Axel Nagel, Zoran Nikoloski, Liam Childs and Stefanie Hartmann for helpful discussions. This study was supported by grants from the German Federal Ministry of Education and Research (BMBF grants 0313112, 0313114A, 0315046), by the Max-Planck-Society and the former RZPD.
Authors’ Affiliations
References
- Ching A, Caldwell KS, Jung M, Dolan M, Smith OS, Tingey S, Morgante M, Rafalsk AJ: SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines. BMC Genetics. 2002, 3: 19-PubMedPubMed CentralView ArticleGoogle Scholar
- Simko I: One potato, two potato: haplotype association mapping in autotetraploids. Trends in Plant Science. 2004, 9 (9): 441-448.PubMedView ArticleGoogle Scholar
- Salem RM, Wessel J, Schork NJ: A comprehensive literature review of haplotyping software and methods for use with unrelated individuals. Human Genomics. 2005, 2: 39-66.PubMedPubMed CentralView ArticleGoogle Scholar
- Vaughan DA, Balazs E, Heslop-Harrison JS: From crop domestication to super-domestication. Annals of Botany. 2007, 100 (5): 893-901.PubMedPubMed CentralView ArticleGoogle Scholar
- Gusfield D: Haplotype inference by Pure Parsimony. Proceedings of the 14th annual Symposium on Combinatorial Pattern Matching. 2003, 144-155.View ArticleGoogle Scholar
- Cook SA: The complexity of theorem-proving procedures. Proceedings of the third annual ACM symposium on Theory of computing. 1971, 151-158.Google Scholar
- Lynce I, Marques-Silva JP: Efficient haplotype inference with Boolean Satisfiability. National Conference on Artificial Intelligence (AAAI). 2006Google Scholar
- Lynce I, Marques-Silva JP: Breaking symmetries in SAT matrix models. Theory and Applications of Satisfiability Testing – SAT Volume 4501 of Lecture Notes in Computer Science. Edited by: Marques-Silva JP, Sakallah K. 2007, Springer Berlin/Heidelberg, 22-27.Google Scholar
- Brown DG, Harrower IM: A new Integer Programming formulation for the Pure Parsimony Problem in haplotype analysis. WABI. 2004, 254-265.Google Scholar
- Brown DG, Harrower IM: Integer Programming approaches to haplotypes inference by Pure Parsimony. IEEE/ACM Trans Comput Biol Bioinform. 2006, 3 (2): 141-154.PubMedView ArticleGoogle Scholar
- Eén N, Sörensson N: An extensible SAT-solver. 2003, [http://minisat.se/Papers.html]Google Scholar
- The MiniSat Page. [http://minisat.se/]
- Lewis M, Schubert T, Becker B: Multithreaded SAT Solving. Proceedings of the 2007 conference on Asia South Pacific design automation. 2007, 926-931.View ArticleGoogle Scholar
- Berre DL: SAT4J, the satisfiability library for java. 2004, [http://www.sat4j.org/]Google Scholar
- A SAT-based system for haplotype inference from genotype data, using the pure parsimony criterion. [http://users.ecs.soton.ac.uk/jpms/soft/]
- Tseitin GS: On the complexity of derivation in propositional calculus. Studies in Constructive Mathematics and Mathematical Logic, Part II. 1968Google Scholar
- Eén N, Sörensson N: Translating Pseudo-Boolean constraints into SAT. Journal on Satisfiability, Boolean Modeling and Computation. 2006, 2: 1-25.Google Scholar
- Felsenstein J: Inferring Phylogenies. 2004, Sunderland, Massachusetts: Sinauer Associates, IncGoogle Scholar
- Marques-Silva JP, Lynce I, Graça A, Oliveira AL: Efficient and tight upper bounds for haplotype inference by Pure Parsimony using delayed haplotype selection. Progress in Artificial Intelligence, Volume 4874 of LectureNotes in Computer Science, Springer Berlin/Heidelberg. 2007, , 4874: 621-632.Google Scholar
- Graça A, Marques-Silva JP, Lynce I, Oliveira AL: Efficient Haplotype Inference with Pseudo-Boolean Optimization. Algebraic Biology, Second International Conference, AB 2007, Castle of Hagenberg, Austria, July 2–4, 2007, Proceedings, Volume 4545 of Lecture Notes in Computer Science. Edited by: Anai H, Horimoto K, Kutsia T. 2007, Springer Berlin/Heidelberg, 125-139.Google Scholar
- Graça A, Marques-Silva JP, Lynce I, Oliveira AL: Efficient Haplotype Inference with Combined CP and OR Techniques. Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, Volume 5015 of Lecture Notes in Computer Science, Springer Berlin/Heidelberg. 2008, 308-312.Google Scholar
- Ballvora A, Jöcker A, Viehöver P, Ishihara H, Paal J, Meksem K, Bruggmann R, Schoof H, Weisshaar B, Gebhardt C: Comparative sequence analysis of Solanum and Arabidopsis in a hot spot for pathogen resistance on potato chromosome V reveals a patchwork of conserved and rapidly evolving genome segments. BMC Genomics. 2007, 8: 112-PubMedPubMed CentralView ArticleGoogle Scholar
- The Potato Maps and More Database PoMaMo. [http://www.gabipd.org/projects/Pomamo/]
- Meyer S, Nagel A, Gebhardt C: PoMaMo – a comprehensive database for potato genome data. Nucleic Acids Res. 2005, 33: D666-D670.PubMedPubMed CentralView ArticleGoogle Scholar
- Sattarzadeh A, Achenbach U, Lübeck J, Strahwald J, Tacke E, Hofferbert HR, Rothsteyn T, Gebhardt C: Single nucleotide polymorphism (SNP) genotyping as basis for developing a PCR-based marker highly diagnostic for potato varieties with high resistance to Globodera pallida pathotype Pa2/3. Molecular Breeding. 2006, 18 (4): 301-312.View ArticleGoogle Scholar
- Kuang HH, Wei FS, Marano MR, Wirtz U, Wang XX, Liu J, Shum WP, Zaborsky J, Tallon LJ, Rensink W, Lobst S, Zhang PF, Tornqvist CE, Tek A, Bamberg J, Helgeson J, Fry W, You F, Luo MC, Jiang JM, Buell CR, Baker B: The R1 resistance gene cluster contains three groups of independently evolving, type I R1 homologues and shows substantial structural variation among haplotypes of Solanum demissum. Plant Journal. 2005, 44: 37-51.PubMedView ArticleGoogle Scholar
- Akey J, Jin L, Xiong M: Haplotypes vs single marker linkage disequilibrium tests: what do we gain?. European Journal of Human Genetics. 2001, 9: 291-300.PubMedView ArticleGoogle Scholar
- The GABI Primary Database GabiPD – SATlotyper. [http://www.gabipd.org/projects/satlotyper/]
- The GABI Primary Database GabiPD. [http://www.gabipd.org/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.