New algorithm improves fine structure of the barley consensus SNP map

Endelman, Jeffrey B

doi:10.1186/1471-2164-12-407

Methodology article
Open access
Published: 10 August 2011

New algorithm improves fine structure of the barley consensus SNP map

Jeffrey B Endelman¹

BMC Genomics volume 12, Article number: 407 (2011) Cite this article

4040 Accesses
12 Citations
6 Altmetric
Metrics details

Abstract

Background

The need to integrate information from multiple linkage maps is a long-standing problem in genetics. One way to visualize the complex ordinal relationships is with a directed graph, where each vertex in the graph is a bin of markers. When there are no ordering conflicts between the linkage maps, the result is a directed acyclic graph, or DAG, which can then be linearized to produce a consensus map.

Results

New algorithms for the simplification and linearization of consensus graphs have been implemented as a package for the R computing environment called DAGGER. The simplified consensus graphs produced by DAGGER exactly capture the ordinal relationships present in a series of linkage maps. Using either linear or quadratic programming, DAGGER generates a consensus map with minimum error relative to the linkage maps while remaining ordinally consistent with them. Both linearization methods produce consensus maps that are compressed relative to the mean of the linkage maps. After rescaling, however, the consensus maps had higher accuracy (and higher marker density) than the individual linkage maps in genetic simulations. When applied to four barley linkage maps genotyped at nearly 3000 SNP markers, DAGGER produced a consensus map with improved fine structure compared to the existing barley consensus SNP map. The root-mean-squared error between the linkage maps and the DAGGER map was 0.82 cM per marker interval compared to 2.28 cM for the existing consensus map. Examination of the barley hardness locus at the 5HS telomere, for which there is a physical map, confirmed that the DAGGER output was more accurate for fine structure analysis.

Conclusions

The R package DAGGER is an effective, freely available resource for integrating the information from a set of consistent linkage maps.

Background

The need to integrate information from multiple linkage maps into a consensus map is a long-standing problem in genetics [1, 2]. Consensus maps have been developed for many crops, including wheat [3], sorghum [4], and potato [5]. Within the barley research community alone, at least seven consensus maps have been published in the past six years [6–12].

Wenzl et al. [7] differentiated between two broad strategies for constructing consensus maps. In the traditional approach, the consensus map is determined directly from the genotypic data, using extensions of the maximum-likelihood methods developed for single populations [13]. While this approach was effective for many years, it has not always produced consistent and timely results as marker densities have continued to increase [7, 14]. These limitations have spurred innovation in a second ("synthetic" [7]) strategy, in which the consensus map is generated from the linkage maps without recourse to the original genotypic data.

It was Yap et al. [15] who first recognized that the complex ordinal relationships present in a series of linkage maps can be exactly represented as a directed graph (Figure 1). Each graph represents a single chromosome and each vertex in the graph represents a bin of markers. Provided there are no ordering conflicts between the linkage maps, the ordinal relationships in the directed graph are equivalent to those in the linkage maps. In other words, there is a path from vertex v to vertex w in the graph if and only if bin v comes before bin w in one of the linkage maps.

Wu et al. [14] have capitalized on this graph-theoretic formulation in their software MergeMap [16], a free resource for constructing consensus maps. MergeMap contains an efficient algorithm for resolving ordering conflicts between linkage maps, which appear as cycles in the consensus graph. The removal of these cycles produces a directed acyclic graph, or DAG, which MergeMap then simplifies and linearizes to produce a consensus map.

Close et al. [11] used MergeMap to integrate four barley linkage maps genotyped at nearly 3000 SNP markers. The resulting consensus SNP map has been used for association mapping [17–19] by members of the Barley Coordinated Agriculture Project (Barley CAP) [20]. In addition to the consensus map, Close et al. [11] published consensus graphs for each of the seven barley chromosomes. While examining the fine structure of the hardness locus on the 5HS telomere, which has been sequenced [21], several unphysical orderings were detected in the consensus graph that are not present in any of the linkage maps.

The discovery that the simplified consensus graphs produced by MergeMap are not ordinally equivalent to the linkage maps prompted the development of new algorithms for the simplification and linearization of consensus graphs. These algorithms have been implemented as a package for the R computing environment [22] called DAGGER [23]. To validate DAGGER, a new barley consensus SNP map was constructed and compared with the results of Close et al. [11]. The performance of DAGGER was also evaluated with simulated data.

Results and Discussion

Constructing the consensus graph

Given a list of linkage maps, DAGGER builds up the consensus graph sequentially. The first two linkage maps are merged to create a consensus graph, which is then merged with the next linkage map, and so on. The merging algorithm proceeds from the first to the last bin in the linkage map M that is to be integrated with graph G. Let S_v be the set of vertices in G that contain one or more markers in bin v of M. For each vertex w in S_v , if all of the markers in w are contained in v, then w remains intact. If only some of the markers in w are in v, then w is split into two vertices: w₁ contains the common markers between v and w, and w₂ contains the remaining markers in w. All of the edges directed in and out of w are replicated for w₁ and w₂. Vertex w₁ also receives new edges directed in and out of it according to the immediately proximal and distal bins in M. Any markers in v that were not present in G are added as a new vertex with appropriate edges. When completed, the consensus graph contains a directed edge for every map interval between adjacent bins in the linkage maps (Figure 1).

During construction of the consensus graph, DAGGER also keeps track of which markers were binned in the linkage maps. The map distance between binned markers is zero, and this information is needed to minimize the error between the consensus map and linkage maps (see below). Although depicted in the consensus graph (Figure 1), the zero-length edges are not part of its topology (i.e., they do not imply ordinal relationships).

DAGGER checks for ordering conflicts between the linkage maps by identifying the strongly connected components of the consensus graph. Two vertices v and w are strongly connected if there is a path from v to w and from w to v. In the context of markers, that would mean the markers in v mapped before those in w, which in turn mapped before those in v; this represents an inconsistency between the linkage maps.

Identifying the strongly connected components of a directed graph is a standard exercise in computer science [24]. DAGGER performs two depth-first searches, one on the reverse graph G^R (formed by reversing the edges in G), and then on G itself. The traversal of G^R generates a topologically sorted list of the vertices. By conducting the depth-first search of G in this order, the algorithm identifies the strongly connected components of G.

DAGGER will only proceed with graph simplification and linearization if the marker order between linkage maps is consistent. To facilitate manual curation of errors in the linkage maps, DAGGER will output the strongly connected components of G for visualization with the Graphviz dot software [25]. When the inconsistency is due to a small misplacement, this visual will indicate where to look in the genotypic data to resolve the ordering conflict. For more complex and long-distance inconsistencies, the conflict-resolution algorithm in MergeMap [14] is a valuable resource.

Simplifying the consensus graph

In the absence of ordering conflicts, DAGGER will output the consensus graph for visualization in one of two forms. The first option is to display the graph with all of the distance information used to generate the consensus map. The second option is to simplify the graph so that it conveys the ordering of the markers but not their distances, which potentially allows many edges to be removed (Figure 1).

The ordinal consensus graph produced by DAGGER contains only those edges that are needed to satisfy the equivalency property stated in the introduction, namely, that there exists a path from vertex v to vertex w if and only if bin v mapped before bin w in one of the linkage maps. The edge reduction proceeds by first using the topologically sorted vertex list from the depth-first search of G^R to determine, for each vertex v, the list of all vertices reachable from v. Next, the edges for each vertex are considered for elimination according to the topological sort order of the vertices to which they are directed. Let Q be a topologically sorted list of the vertices to which there are directed edges from vertex v. If vertex Q_k is reachable from vertex Q_j with j < k, then the edge to Q_k is removed.

Following the edge reduction, DAGGER performs one other simplification without loss of ordinal information. Some of the marker bins that were split during the construction of the consensus graph, due to different distance estimates, may have the same pattern of inward and outward edges after the edge reduction. These vertices are combined and their markers put back in one bin.

Linearizing the consensus graph

Whereas the ordinal consensus graph produced by DAGGER will not be invalidated by further research (provided the component linkage maps are correct), this is not true of the marker order in the consensus map. Because recombination distances are population-dependent, the relative order of markers that have not been mapped together in a single population cannot always be determined unambiguously.

DAGGER generates a consensus map by minimizing the "error" between the consensus map and the linkage maps. In mathematical terms, this error may be written as

(1)

where x(m₁) and x(m₂) are the positions of markers m₁ and m₂ in the consensus map x, and d_k (m₁,m₂) ≥ 0 is a distance interval from the k^th linkage map. When the exponent α = 1, DAGGER minimizes the L₁ norm; when α = 2, the L₂ norm is minimized. The outer sum in Equation 1 is over the linkage maps, and the inner sum is over all pairs of markers (m₁,m₂) in either the same or adjacent bins in the k^th linkage map.

Because this double summation exactly corresponds to the edges (both directed and zero-length) in the consensus graph, Equation 1 can be rewritten as a sum over edges with an appropriate weighting factor:

(2)

Analogous to Equation 1, x(v) and x(w) are the positions of marker bins v and w in the consensus map (every vertex in the consensus graph becomes a bin in the consensus map), and d(v, w) ≥ 0 is an edge length. For Equation 1 to equal Equation 2, the weighting factor q_v,w = (# markers in v) × (# markers in w).

In addition to having minimum error, it is desirable that the consensus map be entirely consistent with the consensus graph (and by extension the linkage maps) in terms of marker order. This means that for every directed edge t → u in the consensus graph, the inequality x(t) ≤ x(u) is required for the consensus map. When this set of linear inequalities is combined with minimizing Equation 2, the constrained optimization problem solved by DAGGER can be written as

(3)

It is straightforward to show that Equation 3 can be solved by linear (α = 1) or quadratic (α = 2) programming methods. Let A be the p × n adjacency matrix for the directed edges in the consensus graph (p edges, n vertices). This means that in correspondence with the k^th edge t → u, the k^th row of A contains -1 at the position for bin t and 1 at the position for bin u. The r × n adjacency matrix B for the zero-length edges is defined similarly, and the combined (p+r)× n adjacency matrix C is formed by stacking A on B. If the vector d contains the edge lengths d(v, w) corresponding to the rows of A (d > 0) and B (d = 0), then the error vector is Cx - d and the consistency constraints are Ax ≥ 0. For α = 2, Equation 3 is equivalent to the following quadratic program (QP):

(4)

where the matrix Q = diag(q), and q is the vector of weights q_{v, w} in Equation 3. By introducing p+r additional variables, denoted by z, Equation 3 with α = 1 is equivalent to the following linear program (LP) [26]:

(5)

Validation with simulated data

To test the performance of DAGGER, it was applied to simulated linkage maps generated from 1000 evenly spaced markers with a total map length of 100 cM. The results in Table 1 are for random parents with an average sequence identity of 70%, for which the expected number of markers per linkage map is 0.3 × 1000 = 300. Using more divergent parents led to maps with more markers, but the trends were the same.

Table 1 DAGGER performance on simulated data

Full size table

The simulated data illustrate the primary benefit of integrating multiple linkage maps, which is that higher marker densities are possible. Table 1 shows that the average number of markers in each linkage map was indistinguishable from the expected value of 300. Consensus maps based on two linkage maps contained an average of 514 markers, and with eight linkage maps an average marker density of 944 out of 1000 was achieved.

An unusual feature of both the LP and QP linearization methods is their tendency to compress map intervals, whereas most consensus maps are inflated relative to the original linkage maps (e.g., [7, 11]). As shown in Table 1, the amount of compression increased as more linkage maps were integrated. For two linkage maps, the LP and QP consensus maps were compressed by 17% and 12% respectively, while for eight linkage maps the compression was 44% (LP) and 38% (QP). The QP consensus maps had consistently less compression than the LP maps. For the simple example shown in Figure 1, where both linkage maps have a total length of 3.0 cM, one can verify analytically that the QP consensus map is compressed to 2.8 cM (the LP map is 2.5 cM). By default DAGGER rescales the consensus map so that its total length equals the mean length of the component linkage maps, but the user can also request the compressed map.

Table 1 shows that, in addition to increasing marker density, integrating more linkage maps tends to reduce the error between the consensus map and the simulated physical map. This was true regardless of whether error was measured with the L₁ norm (mean absolute error) or L₂ norm (root-mean-squared [RMS] error). With only one linkage map the average RMS error was 4.7 (SE 0.2) cM, while with eight linkage maps the average RMS error was 2.4 (SE 0.1) cM for the QP method. Using either norm, the QP maps had consistently less error than the LP maps (contrast: RMSE_QP - RMSE_LP = -0.3 cM < 0, p < 10^-4).

Validation with barley data

The original impetus for developing DAGGER came from analyzing the results of Close et al. [11], who used MergeMap [14] to integrate the information from four barley linkage maps genotyped at nearly 3000 SNP markers. These same linkage maps were used as input for DAGGER, and no ordering conflicts were detected. The ordinal consensus graphs for the seven barley chromosomes (Figures S1-S7, Additional Files 1, 2, 3, 4, 5, 6, and 7) and the QP consensus map (Table S1, Additional File 8) produced by DAGGER are available online.

Figure 2 compares the ordinal consensus graph from DAGGER and the simplified consensus graph from MergeMap at the 5HS telomere. The four linkage maps for this region are reproduced in Table 2. Nine of the markers are for genes at the barley hardness locus, for which there is a physical map [21]. The physical order of these genes is hinb/hina/gsp from the distal end.

Table 2 Barley linkage maps in the 5HS telomere region

Full size table

None of the ordinal information in the DAGGER graph violates this physical order. The only relationships among the hinb/hina/gsp markers in the DAGGER graph are that marker 3_0984(hinb) is distal to markers 3_0975(gsp), 3_0979(hina), and 3_0977(gsp). Table 2 shows that this information comes from the Oregon Wolfe Barley (OWB) linkage map, and that no other ordinal information for the hardness locus markers is present in the linkage maps.

MergeMap also captures this ordering, but in addition there are numerous relationships that are not present in the linkage maps. This is apparent from the unphysical ordering of the hinb/hina/gsp markers. MergeMap shows the markers 3_0976(gsp), 3_0978(hina), 3_0980(hina), and 2_0226(hina) as mapping distal to both hinb markers, and the marker 3_0975(gsp) is shown distal to 3_0979(hina). The order of these markers is indeterminate from the linkage maps, which is how they are portrayed by DAGGER. The unphysical relationships in the simplified graph from MergeMap arise because of the way markers are binned. MergeMap will potentially bin markers if they are binned in at least one of the linkage maps and if their order is indeterminate [14].

To appreciate the consequences of this rule, consider the vertex with six markers at the top of the MergeMap graph. Four of these markers are from the hardness locus and were only present in the Steptoe-Morex (SM) map: 3_0976(gsp), 3_0978(hina), 3_0980(hina), and 2_0226(hina). The other two markers in the vertex--1_0745 and 2_0894--are binned with the hardness locus markers in the SM linkage map (see Table 2). The markers 1_0745 and 2_0894 were also co-segregating in the Morex-Barke (MB) population, where they mapped distal to the hinb marker 3_0983. Because the aforementioned hina and gsp markers were binned with 1_0745 and 2_0894 in the SM linkage map, and because their relationship to 3_0983(hinb) is indeterminate from the linkage maps, MergeMap binned all six markers together. This creates the implication that the hina and gsp markers are distal to 3_0983(hinb), which is consistent with the linkage maps but not implied by them (and known to be false from the physical map [21]). As this example illustrates, the simplified graphs from MergeMap are not ordinally equivalent to the linkage maps and will potentially contain unphysical relationships even if the linkage maps do not. The simplified graphs from DAGGER are ordinally equivalent to the linkage maps and will not contain unphysical relationships provided the linkage maps are physically correct.

Table 3 shows the number of markers, number of marker bins, and amount of compression in the DAGGER consensus map (using the QP method). Because DAGGER linearizes the unsimplified consensus graph, whereas MergeMap linearizes a simplified graph, the consensus map from DAGGER had twice the number of marker bins (1886 vs. 942). Over the entire genome the DAGGER map was compressed by 13%, while the consensus map from MergeMap was inflated by 46%. The average correlation coefficient between the two maps was 0.95 on a scale of 10 cM, indicating a high degree of similarity.

Table 3 Barley consensus map statistics

Full size table

Nonetheless, there were several indications of improved fine structure with DAGGER. Figure 3 compares the two (rescaled) consensus maps at the 5HS telomere. On the physical map the hardness locus (hinb to gsp) is 110 kb, within which there was only 1 recombinant out of the 278 total individuals in the SM, OWB, and MB populations [11]. This is reflected in the linkage maps in Table 2, where all the hardness locus markers appear in one bin in both the SM and MB linkage maps and in two bins separated by 1.1 cM in the OWB linkage map. From these data the hardness locus should occupy less than 1 cM in the consensus map, which was true with DAGGER (0.8 cM interval) but not with MergeMap (4.2 cM interval). Figure 4 shows the RMS error between the consensus map and the linkage maps for each of the seven barley chromosomes. The DAGGER map had consistently less error, with an average of 0.82 cM per marker interval compared to 2.28 cM with MergeMap.

Conclusions

New algorithms for the simplification and linearization of consensus graphs have been implemented as a package for the R computing environment called DAGGER. The package offers both linear and quadratic programming options for linearizing the consensus graph. When these two methods were compared using simulated data, the consensus maps generated by quadratic programming had less compression and higher accuracy. When applied to four barley linkage maps genotyped at nearly 3000 SNP markers, in less than one minute DAGGER produced a consensus map with improved fine structure compared to the existing barley consensus SNP map. The RMS error between the linkage maps and the DAGGER map was 0.82 cM per marker interval compared to 2.28 cM for the existing consensus map. Examination of the barley hardness locus at the 5HS telomere confirmed that the DAGGER output was more accurate for fine structure analysis. DAGGER is an effective, freely available resource for integrating the information from a set of consistent linkage maps.