### Construction of simulated datasets

Details of the chosen reference sequences for all of the genomes studied are shown in Additional file 6. Any missing bases (N) in the reference were replaced with A/T or C/G based on the frequencies of these nucleotides in that genome's sequence. Lengths and quality scores for the simulated reads were extracted from actual *T. cacao* 454 sequencing data (B Scheffler, unpubl. data). The mean length of unpaired reads was approximately 350 bp and standard deviation 150 bp. The mean length of the paired reads we generated was 170 bp and standard deviation70 bp and was based on estimates of 454 paired read lengths from *T. cacao* (K Mockaitis, pers. comm.).

Simulated reads were generated as follows. To model non-uniformity in the read coverage, each BAC sequence was divided into windows of 100 bp. Each window was first assigned a generating density of max{N(5,1),0}, (where N stands for the normal distribution), after which they were scaled so that all windows' densities sum to 1. Each read was first assigned to a window, probability of the assignment being equal to the window density, and then the read's starting position within the window was chosen randomly.

Substitution errors were introduced at a probability of 0.1% per nucleotide, insertions and deletions each at 0.5% probability per nucleotide, with runs of the same nucleotide (homopolymer runs) of length at least 3 being more prone to errors. These error rates correspond to published estimates on 454 error profiles [9]. In more detail, the error injection protocol is as follows. The read sequence is read one homopolymer run after another (they can be of length 1 or more). If the homopolymer run has length at least 3, it is assigned a higher probability of containing errors than a run of length 1 or 2. We define *totalErrorRate* as the sum of desired substitutions, insertions, and deletion error percentages. The per-nucleotide probability that there is an error in a homopolymer run of length one or two is *a*totalErrorRate*, while for runs of length at least three the probability is *b*totalErrorRate* (factors *a* = 0.8 and *b* = 1.6 were empirically chosen to approximately yield *totalErrorRate* = 1.1% in the read collection). If it is determined that a nucleotide has an error, the error type is determined according to the fraction of substitution, insertion, and deletion type errors. The only restriction on error type is, that if an insertion occurs in the homopolymer run, only insertions and substitutions, no deletions, can occur in the remaining positions in the run (and the same for deletions).

Paired reads were generated by first randomly choosing locations for inserts, according to the window density described above. Mean insert size was 3 kbp and std 20 bp. Short sequences from each end of the insert were extracted to generate two reads. The two reads were assigned labels indicating the index of the insert they originated from. Finally, errors were injected to the reads according to the procedure described above.

The minimum tiling path (MTP) BAC structure for the 3 Mbp pool was extracted from an actual rice tiling path (MSU v6.1; [15]). The statistics are as follows: min. BAC size 47 kbp, max. BAC size 191 kbp, mean BAC size 145 kbp, min. overlap between BACs 0.6 kbp, max. overlap between BACs 113 kbp, mean overlap between BACs 36 kbp. The same MTP was concatenated multiple times to cover the larger pools. Additional BACs were generated randomly across the reference sequence and simulated 600—700 bp Sanger sequencing reads were extracted from their ends. The frequency of the additional BAC ends was such that one would be expected to occur every 20 kbp. Substitutions were injected at a rate of 0.006% per nucleotide and insertions and deletions were each introduced with a probability of 0.0002% per nucleotide, in agreement with the estimate of Sanger per base accuracy being as high as 99.999% [18]. Actually any error rate < 0.1% is expected to yield less than one erroneous base for each 600—700 bp BES we simulate.

### Scoring functions

Five characteristics that reflect fundamental aspects of assembly quality were scored: relocation, RL; inversion, I; redundancy, RD; match, M; and coverage, C (defined below). For each score, value 1 is best and 0 is worst. We computed the values of these scores by comparing the assembled pseudomolecule against the known reference sequence using Blast version 2.2.15 (default parameters; [21, 22]). BLAST hits that were at least 1 Kbp long were subjected to the following.

Relevant information, as described below, was extracted from the Blast matches and included in a data table T of size n, n being the reference sequence length. If we assume a reference R of length n, and an assembly A of length m, then for each position i, where i = 1,...,n, the value of T[i] is the coordinate of the *closest* position j in the assembly, T[i] = argmin_{j}{| i- j |}, such that there is a match between R[i] and A[j]. A match on the reverse strand of the assembly indicates an inversion and the value of T[i] becomes -j. If there does not exist any match for position i, then T[i] = 0. Length l is defined as the number of reference positions that have a match to the assembly being evaluated.

**Relocation score, RL**, accounts for pairs of points that are in an incorrect order in the assembly with regard to the reference sequence. Because performing pairwise comparisons on millions of locations is computationally infeasible, we identify large relocation errors using a sampling approach. Relocation score, RL, is computed as follows. Table T is sampled at x locations (in the experiments reported herein x = 10,000), that is, a point from T is sampled every n/x positions. When the order of two sampled points, a and b, and their values do not agree, e.g. T[a] > T[b] but a < b, the number of disagreements, #d, is increased for both a and b. As there are p = (x ^{2} - x) possible disagreements, the relocation score is normalized to [0,1] by defining RL = 1 - #d/p.

**Inversion score, I**, denotes the fraction of the matching assembly positions that match the same strand on the reference sequence, inverted positions decrease the score. Inversion score is defined as I = 1 - ∑1_{i}/l, where 1_{i} = 1 when T[i] < 0 and is 0 otherwise.

**Redundancy score, RD**, penalizes for any unnecessary content in an assembly such as assembled portions of sequence that map to locations that are already covered by other portions of the assembly, and assembled portions of sequence that do not match the reference at all. An additional data table, U, was used for redundancy score computations. Instead of recording the closest assembly position for each reference position, we store the *closest reference position*, U[j] = argmin_{i}{| i - j |}, such that there is a match between R[i] and A[j], for each assembly position j. The reference locations that occur exactly once among the entries in U, #u, yields the fraction of unique and useful content of the assembly through the following equation, RD = #u/m.

**Match score, M**, is designed to reward for long contiguous matches and to penalize for gaps. We first segmented the reference length, n, into mutually non-intersecting segments: alternating matching regions, u, (where T[i] != 0) and gap regions, v, (where T[i] = 0): Σ_{s}u_{s} + ∑_{t}v_{t} = n. We assigned a reward, weighted by factor α, for each matching segment, and assigned a penalty, weighted by factor β, for each gap segment. We defined the match score as M = 1/( α + β ) ( α∑_{s}(|u_{s}|/n)^{η1} + β∑_{t}(|v_{t}|/n)^{η2} ), and we used parameter values η1 = η1 = 2 in our experiments. In determining Match score, M, more emphasis can be placed on either the matches or the gaps by changing the parameter values in this equation.

**Coverage, C**, is the fraction of matches to the reference sequence. C = ∑1_{i}/n, where 1_{i} = 1 when T[i] ! = 0 and is 0 otherwise.