WGP enables the construction of robust physical maps in wheat
With current sequencing technologies, high quality draft sequences of genomes that contain a high level of TEs, such as maize  and wheat , can only be achieved by a clone-by-clone approach. With a size of 17 Gb, assembling a high quality reference sequence of the bread wheat genome remains costly even with NGS technologies. It is therefore essential that the minimal tiling path used for sequencing originates from a robust and accurate physical map. Because Whole Genome Profiling is based on the assembly of identical sequence tags, it is potentially more robust than any of the previous techniques used for physical mapping in wheat (e.g. SNaPshot [10, 12]). Our results on a subset of about 1/3 of the largest wheat chromosome (3B, 1 Gb) demonstrate that, combined with an adapted assembly methodology, WGP offers a promising approach to construct robust and accurate physical maps of the wheat genome. The assembly methodology developed in this study enabled the construction of a physical map of an equivalent size to the one obtained with the SNaPshot technology  but with 30% fewer contigs and 3.5 times less mis-assembled BACs. The stepwise stringent assembly methodology produced more robust and accurate results for the large and complex genome of wheat than the single step methodology that was originally employed for developing WGP on Arabidopsis thaliana . This implies that the WGP method can be adapted to any species by adjusting the assembly methodology and parameters.
The quality of a physical map assembly depends on the availability of sufficient information to establish contigs as well as the capacity to minimize the number of chimerical contigs. BAC library coverage and the density of bands/tags per BAC used to assemble contigs are the key factors for ensuring that adequate information is available. To date, contig assembly is done with the FPC software  that relies on the Sulston score  which corresponds to the probability of coincidence, i.e. the probability that two non-overlapping clones share by chance a given number of bands  or sequence tags in the case of WGP. The critical parameters are the selected tolerance to consider two bands/tags as identical and the cut-off value. The latter corresponds to the threshold of the Sulston score at which one considers that two clones do overlap. At a given cut-off, the higher the number of bands/tags per clone, the lower the percentage of overlap needed to merge two clones [7, 21]. Thus, in WGP, the tag density affects directly the capacity to generate long contigs and, therefore, increasing the tag density should enhance the capacity to merge contigs, decrease the contig numbers, and consequently increase the physical map quality. In the present work, 327,282 tags were deconvoluted before tag and BAC filtering thereby resulting in a density of 23 tags per BAC, approximately 2.5 times less than the value theoretically expected if all tags could be deconvoluted (i.e. reflecting the EcoRI restriction enzyme recognition site frequency in the wheat genome). By comparison, in Arabidopsis thaliana the density was 40 tags per BAC and for melon, tomato, Brassica napus, and lettuce the tag densities were 26, 33, 22, and 25 . Thus, the lower tag density observed in wheat than in Arabidopsis (and to a lesser extent tomato) may reduce the ability to exploit fully the advantages of WGP, even though this density is comparable to those obtained in melon, Brassica napus, and lettuce following digestion with EcoRI and MseI. Thus, although the generated WGP map is of high quality, we suggest three possible ways to further improve quality metrics and reduce tag loss in WGP. The first improvement is directed at limiting the loss of information at the deconvolution steps through optimized design of the pooling strategy. The defined pooling strategy is a trade-off between the costs of sample preparation and sequencing using the Illumina GAII technology on the one hand and the size of the region covered by the BAC library (genome complexity) on the other . The percentage of the genome covered in each pooling set has an impact on the tag loss at the deconvolution step. Indeed, at high coverage of the genome "in deconvolution space", there is a high probability that two or more BACs originating from the same region are present in the same pooling set. The WGP tags shared by these BACs will be lost at the deconvolution process as they are present in four or more BAC pools in a 2-D pooling scheme and six or more pools in a 3-D pooling scheme. In this study, the pooling strategy consisted of pooling sets of individual plates that each covers about 23% of the target region with a 3-D pooling scheme. If we had used the whole wheat chromosome 3B BAC library (56,952 clones) with the same 3-D pooling scheme, it is likely that fewer tags would have been lost as pooling would have covered only 5.3% of the total BAC library. Thus, our selection of BACs based on prior information likely decreased the "effective genome size" and thereby increased the loss of tags by deconvolution. Wheat chromosome sizes range from 600 Mb to 1Gb  while chromosome arms, currently used to construct the wheat physical maps in the framework of the IWGSC http://www.wheatgenome.org, range from 230 Mb to 580 Mb in size. Thus, if WGP is applied with a 3-D pooling scheme to the wheat chromosome arms, the pooling sets will cover between 9% and 23% of the genome/chromosome arm size. For the larger chromosome arms, less tags are expected to be lost at the deconvolution step. Another method for reducing tag loss due to a high coverage percentage of the pooling set would be to increase to 4, 5, or 6 the number of dimensions of the pooling set. However, this would significantly increase the cost of the experiment and the amount of sequence information needed which scales linearly with the number of dimensions (i.e. a 4-D scheme requires twice as much sequence information as a 2-D scheme). Thus, prior to each WGP project, the most cost efficient pooling scheme needs to be determined on the basis of the effective genome size.
Another improvement is to reduce the loss of information due to a too high number of identical tags identified in a pooling set by increasing the length of sequence reads. In Arabidopsis, the percentage of genome coverage for the pooling set was higher (40%) than in this study (23%) yet fewer tags were lost at the deconvolution step in Arabidopsis than in wheat. This suggests that a higher percentage of identical tags originating from different regions were found in the same pooling sets in wheat compared to Arabidospis, thereby limiting the probability of deconvoluting them, and, as a consequence, reducing the number of unique tags per BAC. This hypothesis is supported by the fact that the observed average number of BACs sharing the same tags was 2.9 instead of the expected 9.6 (i.e. BAC library coverage of the genome). The likely explanation for this observation is that since the wheat genome is many times (~120 fold) larger than Arabidopsis and consists of a large amount of repeated sequences, the tag length may not be optimal to avoid these confounding effects. Van Oeveren et al  with a simulation on the maize genome indicated that tag lengths of 26 to 31 nt should be sufficient for WGP, even for large genomes. For wheat, we suggest that 30 nt is sufficient to build robust physical maps but is not optimal to fully exploit WGP. To evaluate the potential of longer reads to decrease tag loss, we calculated an index of k-mer frequencies  on a 1X coverage sequence of the wheat genome http://www.cerealsdb.uk.net with k-mer sizes between 15 and 70 nt (data not shown). This showed that 71.1% of the 30-mers (corresponding to the length of a tag in this study) are unique but that increasing the tag length to 70 nt would improve the tag uniqueness up to 81.3%. Thus, the current improvement in read length and quality of the NGS technologies will likely provide opportunities for minimizing further tag loss and therefore improve the potential of WGP in wheat in the near future.
Finally, a third possibility for increasing tag density would be to choose an endonuclease that recognizes more abundant restriction sites in the wheat genome. The EcoRI enzyme used in this study shows a frequency of 1 site every 4.7 kb in wheat (based on the analysis of the reference sequence set of 18 Mb ). We recently observed in the same dataset that HindIII, also a 6 bp-cutter enzyme, shows a site frequency of 1 every 2.5 kb with the difference mainly caused by the composition of TE fraction (unpublished data). Selection of alternative restriction enzymes may therefore also be used to fine tune the performance of WGP in wheat.
The second important parameter for ensuring a high quality physical map is to limit as much as possible the number of chimerical contigs. Here, we estimated that, at the final cut-off value of 1e-11 a number of chimerical contigs is 0.6 for 10 Mb of sequence. Van Oeveren et al.  have developed a methodology and a tool to identify chimerical contigs on the basis of the fraction of BAC pairs within a contig sharing at least one tag (C1) and the average tag density in a contig (C2). The authors empirically determined a threshold for which the square of C1 divided by C2 provided a value that discriminated between chimerical and non-chimerical contigs. Problematic BACs, then, can be identified and discarded by iteratively removing each BAC of the contig and testing whether BAC removal will break up the contig (23). We tested this approach on the wheat chromosome 3B dataset but it did not detect any of the chimerical contigs identified by comparison with the reference sequences. Moreover, only two contigs were identified as chimerical in the whole dataset with this approach while 14 were present based on our estimation of the number of chimerical contigs in 10 Mb. The threshold used to choose chimerical and contiguous contigs was defined from the WGP experiment on Arabidopsis  and it is likely that new parameters need to be established for wheat reinforcing the idea that parameters in the WGP analysis need to be adapted to the complexity of the target genome. With our dataset, we did not have sufficient sequence information to estimate a robust threshold value for wheat. The access to the entire 3B sequence in the near future (C. Feuillet, pers. comm.) will help in this regard.
WGP tag integration improves low quality sequence assemblies and supports pooling strategies for achieving high quality sequence drafts
In addition to providing a robust physical map, WGP holds the potential of facilitating sequence assembly in whole genome or chromosome shotgun sequencing approaches. To date, with the current sequencing technologies, a whole genome or whole chromosome shotgun approach cannot be used to produce a high quality draft sequence of the wheat genome. Here, we wanted to investigate whether WGP can be used to further empower the BAC-by-BAC approach adopted by the IWGSC to obtain the reference sequence of each of the 21 individual chromosomes from the cv. Chinese Spring. Specifically, we wanted to see if WGP can reduce the costs of sequencing by providing a more robust physical framework and generating additional data that can support BAC contig sequence assemblies. The first BAC contig sequencing results obtained on chromosome 3B  using the physical map established with the SNaPshot technology  suggested that about 10% of the BACs were mis-assembled (unpublished data) and this was confirmed in our study (estimated mis-assembled BACs in the SNaPshot map: 9.5%). In a BAC-by-BAC approach, the mis-assembled BACs identified at the sequence assembly step need to be replaced by other BACs thereby increasing sequencing cost. By providing a physical map with less than 3% of mis-assembled BACs (2.7%), WGP thus would decrease the cost of sequencing by ~7% compared to a physical map constructed with the SNaPshot approach.
The assembly simulations indicated that with a low coverage/low cost sequencing approach (i.e. not based on the production of paired-end or mate-pair libraries) sequence assemblies can be improved significantly by integrating sequence tags produced by WGP. These assemblies contain, however, a significant amount of ambiguities in the contig order and a lack of knowledge about the percentage and distribution of gaps (at 25X and lower coverage, gaps represented about 50% of the assembled scaffolds). While such sequence cannot be considered a reference sequence, it can be used to develop molecular markers and perform preliminary comparative analysis. Assemblies with paired-end reads produce scaffolds for which the percentage and distribution of gaps can be estimated. At high sequence coverage (≥ 25 fold), such paired-end reads produce reliable assemblies whose quality cannot be significantly improved by integrating with WGP tags at the density produced in this study. In contrast, at low sequencing coverage (15X and 20X), WGP tag integration improves the assemblies by facilitating the construction of long, superscaffolds. However, such superscaffolds can include up to 15% of mis-ordered scaffolds that contain gaps of up to 30%. This type of sequence is comparable to the quality of sequence obtained at high coverage levels with unpaired reads. Thus, the complexity of the wheat genome with the presence of large and numerous transposable elements with highly similar sequences makes it impossible to produce a high quality reference sequence (≤ 20% of gaps in the sequence scaffolds and N90 > 30 kb) without any paired-end information and a minimum coverage of 25X. Thus, currently, WGP sequence tags produced during physical mapping are helpful to link sequence scaffolds but they cannot be used to decrease the sequencing coverage necessary to obtain a high quality assembly in such a complex genome.
In our opinion, the greatest potential for WGP is the possibility of increasing the degree of pooling in sequencing projects based on BAC pools, a strategy that has been proposed to reduce sequencing costs in large genomes  and is currently being utilized for sequencing chromosome 3B of bread wheat. In this case, the assembly of sequence reads from BAC pools of two or more unrelated physical contigs leads to sequence contigs (without paired-end) or scaffolds (with paired-end) that need to be reassigned to their respective physical contig of origin. WGP can provide the information needed for this assignment. In the 3B project, 52% of the 924 pools of the minimal tiling path used for 454 GS FLX Titanium sequencing contain two or more physical contigs (unpublished data). The 327,282 WGP tags generated in this pilot study will be helpful for assigning and ordering the sequence contigs or scaffolds produced from these pools.