A genetically anchored physical framework for Theobroma cacao cv. Matina 1-6

Background The fermented dried seeds of Theobroma cacao (cacao tree) are the main ingredient in chocolate. World cocoa production was estimated to be 3 million tons in 2010 with an annual estimated average growth rate of 2.2%. The cacao bean production industry is currently under threat from a rise in fungal diseases including black pod, frosty pod, and witches' broom. In order to address these issues, genome-sequencing efforts have been initiated recently to facilitate identification of genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars. However, problems inherent with assembly and resolution of distal regions of complex eukaryotic genomes, such as gaps, chimeric joins, and unresolvable repeat-induced compressions, have been unavoidably encountered with the sequencing strategies selected. Results Here, we describe the construction of a BAC-based integrated genetic-physical map of the T. cacao cultivar Matina 1-6 which is designed to augment and enhance these sequencing efforts. Three BAC libraries, each comprised of 10× coverage, were constructed and fingerprinted. 230 genetic markers from a high-resolution genetic recombination map and 96 Arabidopsis-derived conserved ortholog set (COS) II markers were anchored using pooled overgo hybridization. A dense tile path consisting of 29,383 BACs was selected and end-sequenced. The physical map consists of 154 contigs and 4,268 singletons. Forty-nine contigs are genetically anchored and ordered to chromosomes for a total span of 307.2 Mbp. The unanchored contigs (105) span 67.4 Mbp and therefore the estimated genome size of T. cacao is 374.6 Mbp. A comparative analysis with A. thaliana, V. vinifera, and P. trichocarpa suggests that comparisons of the genome assemblies of these distantly related species could provide insights into genome structure, evolutionary history, conservation of functional sites, and improvements in physical map assembly. A comparison between the two T. cacao cultivars Matina 1-6 and Criollo indicates a high degree of collinearity in their genomes, yet rearrangements were also observed. Conclusions The results presented in this study are a stand-alone resource for functional exploitation and enhancement of Theobroma cacao but are also expected to complement and augment ongoing genome-sequencing efforts. This resource will serve as a template for refinement of the T. cacao genome through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays.


Background
Theobroma cacao (cacao tree) is the source of the world's cocoa butter and cocoa powder, key ingredients in chocolate. T. cacao is a short, tropical tree that is grown in multiple countries including Côte d'Ivoire, Ghana, Indonesia, Nigeria, Brazil, Cameroon, Ecuador, Colombia, Mexico, and Papua New Guinea where cacao beans are an important cash crop. World cocoa production was estimated to be 3 million tons in 2010 and had an annual estimated average growth rate of 2.2% from 1998 to 2010 (http://www.worldcocoafoundation.org/ learn-about-cocoa/cocoa-facts-and-figures.html). Cacao bean production is currently under threat from several sources including a rise in the incidence of fungal diseases including black pod, frosty pod, and witches' broom [1,2]. In order to address these issues, multiple genetic and genomic efforts have been initiated in the last decade to identify genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars [3][4][5][6][7][8][9][10]. These efforts have recently culminated in whole-genome shotgun assemblies of the genomes of two T. cacao cultivars: Criollo [11] and . These genome sequences will greatly assist T. cacao breeding efforts [10] as well as contribute to our basic knowledge of tree and dicot biology through comparisons with a growing collection of genome sequences from trees and dicotyledonous plants.
There are primarily two approaches to sequence a genome: the BAC-by-BAC approach where libraries of clones with large inserts (e.g. BACs) are randomly sequenced and ordered relative to a minimum tile path (MTP) such as has been used to sequence the genomes of rice and maize [13,14], or the whole-genome shotgun (WGS) sequencing of genomic DNA, carried out without cloning, from genomic libraries with multiple insert sizes as implemented for Western poplar, grapevine, and sorghum genomes [15][16][17]. In the BAC-by-BAC approach, the MTP is obtained through the construction of a physical map via DNA fingerprinting: each BAC clone is digested with restriction enzymes [18][19][20] and contigs are generated based upon shared DNA signatures. DNA signatures, comprised of a BAC's repertoire of fragment sizes, are determined by algorithms incorporated into software such as Fingerprinted Contigs (FPC) [18,21]. FPC-based physical maps can be validated and improved by incorporating information obtained by hybridizing BAC clones with specific probes which can include probes that have been genetically mapped; the ordering of contigs on physical maps can thus be facilitated by using genetically derived mapping information [22]. Furthermore, BAC ends can be sequenced and serve as well-spaced markers for determining the accuracy of shotgun read assemblies. Using a strategy that integrates physical mapping and genetic mapping reduces the number of chimeric contigs and increases overall confidence in the final assembly (whether BAC-by-BAC or WGS). In any strategy, it is important to have a high quality reference assembly available for annotation and re-sequencing endeavours.
In this study, we describe the construction of a BACbased integrated genetic-physical map of the genome of T. cacao cv. Matina 1-6 (estimated genome size = 374 Mbp). We also demonstrate the utility of the map by comparing it with other plant genomes. This map serves as an important reference for T. cacao genomes being sequenced; it can be used to establish the accuracy of those genome sequences prior to their use for applications such as SNP discovery, RNAseq, ChiPseq, and other techniques.

Results
Physical map construction T. cacao genomic DNA was first cleaved into large fragments for cloning into vectors that could accommodate large inserts. Three different restriction endonucleases, HindIII, MboI, and EcoRI, were utilized to partially digest genomic DNA samples which were then used to construct three T. cacao BAC libraries (TCC_Ba, TCC_Bb, TCC_Bc, respectively) using methods that have been previously described [23]. These libraries were then used to construct the physical map. 36,864 clones were arrayed for each of the three complementary BAC libraries, which we estimate provides 10× genome coverage and, therefore, these libraries collectively represent approximately 30 T. cacao genome equivalents. The average insert size was 138 kb and the clones are arrayed in 288, 384-well microtiter plates as summarized in Table 1.
In order to determine BAC order and orientation, we characterized all BACs in the libraries using high resolution restriction band fingerprinting. Using previously published methods [24], a total of 108,288 BAC clones from all three BAC libraries were subjected to high information content fingerprinting (HICF) after addition of control clones in the E07 and H12 wells of the 96-well offset to maintain data uniformity. Data comprising fragment sizes were captured by capillary electrophoresis [24]. After removal of clones containing less than 20 fragments or greater than 220 fragments and empty vectors, 95,386 clones (88% of the original total), with an average of 114.2 fragments per clone, were successfully assigned a digitized fingerprint (Table 1) to use for assembly carried out using the FPC software [25].
To obtain a high-quality commensurate build, fingerprints from all three BAC libraries were combined for contig assembly using a stringent cutoff of 1 e -80 and a tolerance of 3. This base assembly resulted in 445 contigs harbouring 86,590 clones and 8,806 singletons. The DQer function of FPC was run to identify and break up contigs comprised of greater than 10% questionable clones (a sign of potential false joins). The Singles-to-Ends function and the Ends-to-Ends function were run at a final cutoff of 1 e-50 using automatic merges to join singletons to contig ends and contigs to contigs, respectively. Additionally, manual curation of the physical map was performed based on results collected from the integration of the genetic recombination map and synteny mapping (described below) at a cutoff as low as 1 e -25 . Assuming a cumulative average insert size of 138 kb and an estimated T. cacao genome size of 440 Mbp, the consensus band (CB) estimation equates to an average of 1,210 bp per band.
The final T. cacao physical map totaled 154 contigs containing 91,117 BACs (96%) and 4,268 singletons (4%) ( Table 2). Of the 154 contigs, 74 contigs are comprised of fewer than 5 BACs, 5 contigs are comprised of between 6 and 15 BACs, 5 others of between 18 and 92, and 69 are comprised of greater than 100 BACs per contig ( Figure 1). Size estimations of consensus bands were converted to base pairs in the summary of the T. cacao physical map assembly and for estimates of contig lengths (Tables 2 and 3).

Integration of genetic markers and sequence tagged sites (STS) onto the physical map
To assimilate genetic and physical maps and integrate conserved ortholog set (COS) sequences onto the T. cacao physical framework, overgo probes were designed from genetically mapped simple-sequence repeat markers (SSR) originating from EST sequences [5,26] and [6] COSII sequences derived from Arabidopsis thaliana [27], respectively. The remaining 95 COSII sequences were placed onto the physical map and their linkage groups inferred by contig placement. Overgo probes were anchored to the physical map using a 3-dimensional pooled hybridization approach following the method of Fang et al. [28]. Briefly, overgo probes were pooled using a pooling strategy in which 125 probes were hybridized to the three T. cacao BAC libraries. Based on single location integration, 96% of the markers were accurately placed on the physical map and these markers were used to anchor and orient 49 contigs as 10 pseudomolecules (Table 3). Where ordered contig ends fail to overlap, an arbitrary 250 kb addition was made and annotated as gaps between contigs and is visualized using the CMAP comparative map viewer ( Figure 2; [29]); this gap addition was not calculated as part of any physical map statistics. A framework file was created in FPC to anchor BAC contigs to chromosomes using results obtained from integration of 230 genetic recombination markers [30] (Additional file 1: Table S1, Additional file 2: Table S2). The framework function of FPC was used to anchor BAC contigs to chromosomes, as well as order and renumber them, which resulted in 49 contigs assigned, ordered, and oriented to the ten T. cacao linkage groups; 105 contigs remained unanchored, most of which contain a small number of BACs. Specific contigs (anchored and unanchored) are described in Additional file 3: Table S3 and Additional file 4: Table S4.
Dense minimum tile path (MTP) selection for BAC-end sequencing BAC-end sequences (BES) corresponding to a fingerprint in the physical map serve as long-range, paired sequence anchor points which can be used to facilitate alignments to other genomes and integration of draft sequence contigs. Ideally, a pair of BAC-end sequences assigned to each BAC in the physical map will provide a robust array of Sequence Tagged Sites (STS) for draft genome anchoring, genome exploitation, etc. However, the T. cacao physical map assembled into a very few, long contigs built from nearly 30× BAC coverage; therefore, we determined a BES for every clone was not necessary but rather an end sequence every 7-8 kb. For the T. cacao physical map, the median number of Consensus Band (CB) units per BAC was 120 CBs, the average BAC insert size was 138 kb, and the desired distance in bp between sequence anchor points was 8 kb. The resulting n-value was approximately 7 CB  A total of 58,766 sequencing reactions were performed that included both ends, 52,966 (90%) of which were successfully trimmed and a total of 49,984 of those were part of a successful pair. This moderately dense array of BACend sequences will serve as sequence anchor points for comparative genomics and long-range sequence pairs for draft genome sequence integration.
Alignment of the T. cacao physical map with V. vinifera, P. trichocarpa, and A. thaliana genomes As more genomes are sequenced, comparative genomics is becoming a more readily applicable approach to studying genomic architecture, gene function, and genome evolution across species. Recently, Argout et al. [11]   published findings into the paleohistory of T. cacao (cv. Criollo) by looking at orthologous genes between T. cacao (cv. Criollo) chromosomes and Vitis vinifera, Arabidopsis thaliana, Populus trichoptera, Glycine max, and Circa papaya. We used a similar synteny-based approach to gain insight into T. cacao (cv. Matina 1-6) genome structure and evolutionary history; we compared the genetically integrated T. cacao physical map to the V. vinifera, P. trichocarpa, and A. thaliana genomes using 52,966 BAC-end sequence anchor points tied to 49 genetically anchored BAC contigs, 230 genetically mapped framework markers, 6 mapped AtCOSII markers, and 95 AtCOSII anchored markers for each of the alignments below. The alignments and visualizations were performed with the Symap software [31,32]. A detailed review of the synteny computing algorithm used can be found in Soderlund et al. [31]; briefly, the BES and marker sequences were filtered for repeats aligned to the corresponding genomes. A total of 13,807 BES hits anchored to 60 fingerprint contigs, covering approximately 73% (321.5 Mbp) of the V. vinifera genome (Table 4; Additional file 5: Table S5), were aligned between the T. cacao physical map and the V. vinifera genome. The average percent identity (% identity) of the alignment of the BES-associated contigs ranged from 74% to 100%. A total of 101 synteny blocks were identified, 27 of which were less than 1 Mbp long, 43 were between 1 and 3 Mbp in length, and 31 were greater than 3 Mbp (Table 5). With P. trichocarpa, 10,731 BES hits and 57 marker sequences were anchored to the 63 contigs that could be aligned, covering approximately 78% (403.5 Mbp) of the P. trichocarpa genome (Table 4; Additional file 6: Table S6). The percent identity of the BES alignments with P. trichocarpa ranged from 75% to 100%. A total of 187 synteny blocks were identified, 87 of which were less than 1 Mbp, another 61 were between 1 and 3 Mbp, and 39 were greater than 3 Mbp (Table 5).
Approximately 44% (75.9 Mbp) of the A. thaliana genome was covered (Table 4; Additional file 7: Table S7;  Table 5) as a result of 5,332 BES hits and 59 marker sequences anchored to 48 contigs. It is important to note here that even though 96 COSII sequences were used as anchor points, only 59 of them were considered a homologous match in A. thaliana; the unmatched COSII sequences likely were flanked by non-homologous BES that did not meet the criteria set to be considered a syntenic region. The percent identity of the sequence matches between A. thaliana and T. cacao ranged from 75% to 95% and a total of 68 synteny blocks were identified, 50 of which were less than 1 Mbp, 14 ranged between 1 and 3 Mbp, and 4 were greater than 3 Mbp ( Table 5). Alignments of our physical map to these three genomes are consistent with T. cacao's closest relative being P. trichocarpa, followed by V. vinifera and then A. thaliana, the most distant relative of the three. Structural details are discussed below.
In order to visualize synteny relationships, whole-genome dot plots and circos plots [32] were created to visualize genome structure and collinearity between T. cacao and the three other genomes (Figures 3 and 4). As described above, the most syntenic blocks were identified between T. cacao and P. trichocarpa followed by those with V. vinifera, and then A. thaliana. The longest stretch of collinearity is between T. cacao and P. trichocarpa and spans 26 [11]. We observed evidence of chromosomal fusion events as well through comparisons of the T. cacao physical map to V. vinifera, P.trichoptera, and A. thaliana as shown in the circos plots ( Figure 4). For example, P. trichocarpachromosomes 2, 5, and 14 may have fused to form T. cacao (Matina 1-6) chromosome 1. V. vinifera chromosomes 4,
To determine if a physical map is sufficient for investigating ancestral paleo-polyploidy events, we looked at the detailed alignments between T. cacao and V. vinifera chromosomes. We observed that V. vinifera chromosomes 1, 14, and 17 align to T. cacao (Cv. Matina 1-6) chromosomes 2, 3, and 4; V. vinifera chromosomes 2, 12, 15, and 16 align to T. cacao (Cv. Matina 1-6) chromosomes 1, 3, and 5; V. vinifera chromosomes 3, 4, 7, and 18 align to T. cacao chromosomes 1, 2, and 8; V. vinifera chromosomes 4, 9, and 11 align to T. cacao 6, 8, and 9; V. vinifera chromosomes 5, 7, and 14 align to T. cacao 1, 4, and 5; V. vinifera chromosomes 6, 8, and 13 align to T. cacao 5, 9, and 10; V. vinifera chromosomes 10, 12, and 19 align to T. cacao 1, 6, and 7 ( Figure 4). These observations are nearly identical to those of Argout et al. [11] and also confirm evidence of ancestral triplicated chromosome groups reported for V. vinifera [15]. These results suggest that a BAC-based physical map with relatively evenly spaced BAC-end sequence anchor points can have immediate utility, depending on the amount of collinearity that exists between it and other genomes of interest, for interrogating agriculturally important genes and gene families and elucidating evolutionary origins prior to the availability of a high quality reference genome.
Alignment of the T. cacao cv. Matina 1-6 physical map with the T. cacao cv. Criollo genome assembly After the T. cacao cv. Matina 1-6 physical map was constructed the T. cacao cv. Criollo genome assembly became available [11]. Alignment of the T. cacao cv. Matina 1-6 physical map with the T. cacao cv. Criollo genome assembly [11] identified a total of 37,310 BES matches and 194 marker sequences anchored to 87 contigs covering approximately 65% (318.3 Mbp) of the Criollo genome (Table 4; Additional file 8: Table S8). The average percent identity of the sequence matches ranged from 83% to 99%. A total of 112 synteny blocks were identified, 56 of which were less than 1 Mbp in length, 25 ranged from 1 to 3 Mbp, and 31 were longer than 3 Mbp (Table 5). Alignment of the genome sequences of the Matina 1-6 cultivar with those of the Criollo cultivar revealed a very close  alignment, as expected, which validates the comparative genomic software and methodology. While alignment of the Matina 1-6 anchored physical map contigs to the Criollo genome sequence revealed long stretches of collinear sequence (Figure 4; Figure 5A), there were also many instances of rearrangements by either inversion or translocation and instances of duplication ( Figure 5). Segments of Criollo chromosome 1 are duplicated on Matina 1-6 chromosomes 4, 8, and 9, for example. There are also segments of Criollo chromosome 4 and chromosome 5 that are duplicated on Matina 1-6 chromosome 10 and chromosomes 2 and 10, respectively. There are segments on Criollo chromosome 7 that are duplicated on Matina 1-6 chromosome 2. There is a segment of Criollo chromosome 8 that is duplicated on Matina 1-6 chromosome 1 and 2. An example of an inverted genomic segment resides near the telomere on Criollo chromosome 1 ( Figure 5A) and another on chromosome 5 ( Figure 5A and 5B). Two potential sequence translocations are located on Criollo chromosomes 1, 5, and 7, reside on chromosomes 4, and 2 in Matina 1-6, respectively, but further support is necessary to confirm these due to low resolution in these areas. As an example of using a comparative approach to elucidate the structure of the Matina 1-6 genome, Figure 5C illustrates duplicated sequences from Matina 1-6 linkage group 10 that match sequences on Criollo chromosomes 4, 5, 8, and 10. Figure 5B shows a 2D comparison of Matina 1-6 chromosome 10 and Criollo chromosome 10. These chromosomes are apparently highly similar in sequence but quite rearranged. The longest conserved segment of contiguous collinearity between the sequences of the two cultivars occurs on chromosome 2 and spans approximately 15.3 Mbp. The most noncongruent chromosome between the two cultivars is chromosome 3 and the most rearranged is chromosome 9. Investigating these structural differences could reveal the underlying biological mechanisms that directed these events from an evolutionary standpoint. Assembly accuracy of both the Matina 1-6 physical map and the Criollo draft assembly should also be verified.

Integration of unanchored contigs using collinearity with other genomes
In an effort to improve the T. cacao cv. Matina 1-6 physical map, we examined structural similarities found in the genomes of related species that might suggest possible linkage group assignments for unanchored contigs in the map. Unincorporated Matina 1-6 FPC contigs were assessed for linkage group assignment based on anchoring to regions of the V. vinifera, P. trichocarpa, A. thaliana, and T. cacao cv. Criollo chromosomes where other T. cacao cv. Matina 1-6 anchored BAC contigs were aligned. These alignments resulted in linkage group predictions for 11 unanchored contigs (Additional file 9: Table S9). As more genomes are sequenced, a comparative approach to refining existing physical maps will become more effective.

Overview of the cacao physical map
We constructed three BAC libraries together represent-ing~30× coverage of the T. cacao genome (publicly available via http://www.genome.clemson.edu) and used them to create the first whole-genome physical map for T. cacao cv. Matina 1-6 (publicly available via http:// www.cacaogenomedb.org). This map is genetically anchored and enables cacao cultivar improvement through efforts such as positional cloning and regionspecific analysis through sub-genome sequencing (companion paper). The map also aids in assembly of reference genomes, gap-filling, and independent assessment of assembly accuracy. And, in addition to protein-coding regions, BACs harbor genomic segments such as untranslated regions (UTR), promotor/regulatory elements, and introns that are important in functional genomics studies. A BAC-based physical map is thus not just an interim solution or a step in the process of sequencing a genome but a viable resource for utilizing a complete genome sequence once ascertained. Physical maps have been reported recently for Gossypium raimondii [33], Aquilegia formosa [28], wheat chromosomes [34], maize [35,36], Brachypodium distachyon [37], Oncorhynchus mykiss [38], Prunus persica [39], and Glycine max [40,41]. Our T. cacao physical map assembly is highly ordered and tightly anchored to a moderately dense genetic recombination map, resides in 49 anchored contigs, and represents 82% of the T. cacao genome as computed by assuming a total genome size of 440 Mbp. However, estimated genome sizes for T. cacao range from 326 Mbp to 440 Mbp as determined using reassociation kinetics, flow cytometry [42,43], or genome assembly [11]. Once the true size of the genome of T. cacao cv. Matina 1-6 is known, our physical map statistics may require adjustment. Availability of a high-density genetic recombination map [5,26] was critical to our success in ordering and orienting BAC contigs and subsequently assigning contigs to chromosomes. We hybridized 230 genetically mapped SSR markers to the BAC contigs. Only 10 of these mapped markers hybridized to more than one contig signifying low copy-number, accuracy of the genetic map through additional evidence in the independent physical map build, and accuracy of the BAC assembly. Several of these markers flank QTL loci and therefore provide immediate templates for sequencing pools of high priority BACs to identify candidate genes for further investigation (companion paper). These mapped marker sequences also serve as sequence anchor points for use in comparative genome studies.
Using other genomes to assess and improve the T. cacao physical map and discover biological insights Each newly available genome assembly, especially those from species distantly related to model organisms, improves the resolution of genome biology and evolution. Exploration of evolving genome architecture through synteny analysis upon the release of a new genome has become a standard experiment in which a new window into a clade is opened. New genome assemblies are, however, incomplete. Gaps, chimeric joins, and unresolvable repeat-induced compressions are unavoidable. One way to identify these errors and improve an assembly is to use related genomes as a guide.
Comparative genomics has evolved rapidly over the last decade. More genomes acting as reference sequences and the advancement of computational algorithms and visualization software has facilitated a turnkey approach. At the same time, aligning physical maps to sequenced genomes has become increasingly useful. For example, Fang et al. constructed a physical map for Aquilegia formosa, a species in a unique clade of basal eudicots that is being utilized as a new model system for studying floral variation, adaptive radiation, and evolution, and used a comparative approach to gain insight into the evolutionary lineages between A. formosa and V. vinifera [28,[44][45][46]. Gu et al. created a physical map of Brachypodium distachyon and compared it to rice and wheat; they observed whole-genome duplication events in relation to rice that were caused by paleotetraploidy and a broad spectrum of other evolutionary events between the wheat and B. distachyon genomes [37]. In short, a physical map serves as a distinct line of non-sequence-based evidence as well as an adjunct to a draft genome sequence during the biological discovery process.
In our comparative genome analysis, we used the SyMAP software to align the T. cacao cv. Matina 1-6 physical map to A. thaliana, V. vinifera, and P. trichocarpa genome sequences to gain insight into the structure and evolutionary history of T. cacao as well as to improve the physical map assembly. Not surprisingly, we found more syntenic blocks between the ten chromosomes of T. cacao and the 19 chromosomes of P. trichocarpa than between the ten chromosomes of T. cacao and the 19 chromosomes of grape. The main difference was in the length of the syntenic blocks; 100 syntenic blocks between T. cacao and P. trichocarpa were greater than 1 Mbp in length, whereas there were only 74 syntenic blocks longer than 1 Mbp between T. cacao and V. vinifera. The number of short (< 1 Mbp) collinear regions differed between these comparisons as well (27 in the V. vinifera comparison and 87 in the comparison with P. trichocarpa). Alignment to the A. thaliana genome produced the fewest syntenic blocks as expected since A. thaliana is the most evolutionarily distant species of the three used for our comparisons.
Structural and evolutionary implications can also be derived from comparisons between physical maps and available genome sequences. Collinear genomic segments and duplications can be quickly identified and provide insight into selective pressures and identify regions of the genome for targeted detailed analysis, all without a reference genome. As a result of our use of this approach, we concur that the chromosome fusion events recently reported by Argout et al. [11] occurred in the ancient genome structure that led to the ten-chromosome structure we see today. Additionally, this comparative approach is sufficient to identify and confirm ancestral triplicated chromosomal groups recently reported for V. vinifera [15].
The utility of the cacao physical map with regard to cacao genome sequences Even as more and more genomes are sequenced de novo and sequencing strategies evolve, i.e. there is less reliance on Sanger-based sequencing and second-generation sequencing platforms shift in chemistry and read-lengths, there are still inherent problems with assembly and resolution of distal regions of complex eukaryotic genomes. Mate-pair sequencing libraries with long insert sizes such as BACs and fosmids provide the necessary linking information and long-range contiguity for resolving repetitive DNA and scaffolding draft contigs. As noted previously, a physical map is an important adjunct to de novo genome sequencing projects and serves as an independent genome assembly that is composed of a very different data type. BAC fingerprints assembled (at a conservative cutoff) with an integrated dense array of paired-end sequences and genetic markers can be used to check for errors and to corroborate the accuracy of a draft genome assembled using a whole-genome sequencing (WGS) strategy. The physical map can also serve as a template for gap-filling and targeted sub-genome re-sequencing of BAC pools [47].
The physical map we present here can be of great utility in advancing draft genome sequences of Theobroma cacao into high quality reference genomes. For example, comparison of the cv. Matina 1-6 physical map assemblies to the cv. Criollo pseudomolecules revealed a high degree of collinear genomic segments ( Figure 5A). However, there are still many regions of structural difference between these two draft sequences such as sequence inversions and translocations. There are also regions of discontiguous alignments between the physical map and draft sequence. We speculate that these differences may be the result of underlying biological differences between cultivars, misassemblies in either the Matina 1-6 physical map or the Criollo whole-genome draft assembly, gaps resulting from unresolved repetitive DNA, or a lack of dense BES in particular contigs in the physical map. For any of these possibilities, the Matina 1-6 physical map can serve as a template for confirming/resolving assembly differences and provide resolution in regions of low quality or underrepresentation in the draft genome assemblies through targeted PCR or sub-genome sequencing through BAC pools.

Conclusions
The three BAC libraries, BAC-end sequences, and the genetically integrated physical map for T. cacao cv. Matina 1-6 resulting from this study are important resources for functional exploitation and enhancement of Theobroma cacao that are expected to complement and augment genome sequencing efforts. The results obtained from the comparative analyses with A. thaliana, V. vinifera, and P. trichocarpa suggest that genome assemblies from distantly related species can be used to gain insights into genome structure and evolutionary history as well as conserved functional genomic sites and to improve a physical map assembly. These resources will also serve as the templates for refinement of T. cacao genome sequences through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays through long-range contiguity.

BAC library construction
The Matina 1-6 tree clones used for DNA isolation were kept at greenhouse conditions and dark-treated for 12 hours prior to leaf harvesting. Approximately 100 g of young, F2 stage expanding leaf tissue with the mid-vein removed were harvested, washed two times with autoclaved ddH 2 0 and ground in liquid nitrogen with a mortar and pestle to a coarse powder. Nuclei were prepared following previously published methods [24] with the following modifications: addition of 1% (w/v) soluble PVP-40 (Sigma-Aldrich), 0.1% (w/v) L-ascorbic acid (Sigma-Aldrich), 0.13% (w/v) sodium diethyldithiocarbamate trihydrate (Sigma-Aldrich), and 0.4% beta-mercaptoethanol to nuclei isolation buffer (NIB) right before use. Purified nuclei were concentrated and embedded in agarose plugs. Protein digestion and plug washing was carried out exactly as previously described [24]. To prepare high molecular weight genomic DNA fragments, plugs were macerated with a single-edge razor blade and then partially digested (separately) with, HindIII, EcoRI, or MboI using standard procedures. DNA size selection, electro-elution, and ligation were carried out as previously described [24]. The BAC libraries were assigned a unique CUGI identifier according the enzyme used for partial digestion: TCC_Ba was made using HindIII, TCC_Bb with MboI, and TCC_Bc with EcoRI. Each BAC library was characterized for insert distribution by randomly selecting 384 clones and subjecting them to miniprep, digestion with Not-I (New England Biolabs), and resolution by pulsed-field gel electrophoresis [23].

HICF BAC fingerprinting
384-well plates containing BACs were decondensed to 96well format robotically with the Q-bot (Genetix, United Kingdom). Two pins were removed from the sub-plate inoculators to allow for manual insertion of control clones in the E07 and H12 positions; controls were used to assess data uniformity. DNA was isolated from a total of 108,288 clones from TCC_Ba, TCC_Bb, and TCC_Bc, the three BAC libraries, by following standard alkaline lysis miniprep methods [48], and then used as substrates for HICF carried out following previously published methods [24] with the following modifications. Approximately 0.5 ug of BAC DNA was digested with 2.0 units of HindIII, BamHI, XbaI, XhoI, and HaeIII at 37C for 2 hours. The DNA was treated with 0.25 U of label from the SNaP-shot kit (Applied Biosystems, Foster City, California) at 65 C for 1 hour and then precipitated with ethanol. The labelled DNA was reconstituted in 9 ul of Hi-Di formamide (Applied Biosystems) and 0.05 ul of LIZ600 (Applied Biosystems). BAC fingerprints were sized on an ABI3730 (Applied Biosystems) using a 36 cm array and POP7 (Applied Biosystems). The fingerprint profiles were processed using GeneMapper 3.7 (Applied Biosystems) for sizing quality and FPMiner (Bioinforsoft, Oregon) for digitized fingerprint assignment. For improved data quality, vector bands, clones without inserts, and restriction profiles with less than 20 or more than 200 bands were removed and the remaining profiles were uploaded to FPC v8.5.3 [25] for fingerprint contig assembly.

Physical map assembly
The initial build was done at a Sulston score cut-off of 1 e -80 and a tolerance of 3. To reduce false joins, the DQer function of FPC was used to break down all contigs comprised of more than 10% questionable (Q) clones. Further physical map refinement was performed using the Ends-to-Ends and Singles-to-Ends functions of FPC with stepwise reductions of the Sulston score cutoff values to a final score of 1 e -50 . Additional fingerprint contig merges were made with lower Ends-to-Ends overlap when there was additional agreement with the anchored genetic map [30]. Contigs merged in this fashion used Sulston score cutoffs as low as 1 e -25 . High quality marker sequence was processed through a CUGI pipeline consisting of Repeat Masker [49] with the RepBase database [50], cross_match [51], and Tandem Repeat Finder [52]. The remaining sequence was used for overgo design using the overgomaker (http://genome.wustl.edu/software/overgo_maker) software. Once a version of the whole-genome draft sequence of Matina 1-6 became available [12], potentially repetitive overgo sequences were aligned using BLAST [53] with cutoff of at least 85% percent identity. Any overgo sequence with more than 8 hits to the putative assembly was removed from the experiment, the repetitive sequence was masked, and a new overgo was selected from the original sequence. Manual filter hit-calling and deconvolution of the multi-dimensional pool hybridization data was accomplished using HybDecon, open source software available at http://www.genome.clemson.edu/ software/hybdecon. The filter hit-calling functionality is an improved version of Hybsweeper, a web-based Java tool first reported by Lazo et al. [54]. In addition, a Perl-based deconvolution script, written entirely at CUGI, accompanies the tool and is launched from within the graphical interface once manual calling of positive hits is complete. The source code, a test dataset, and installation manual are all available online. A user's manual is readily available with the click of the Help button from within the software. Additionally an online experimental design tool is available (http://www.genome.clemson.edu/software/hybdecon/ exp_setup) to assist with setup for multi-dimensional pooling hybridization.

Dense MTP selection and BAC-end sequencing
We selected a set of BACs from the physical map to provide sequenced BAC-ends at least every 8 kb along the genomic sequence. However, because the unit of measure in an FPC physical map is a Consensus Band (CB) unit, or restriction fragment unit, which does not correlate with distance in base pairs, we estimated the number of CB units that would approximate a distance of 20 kb using the following formula: where m is median number of CB units per BAC in the physical map, i is the average BAC insert size, k is the desired size in base pairs, and n is the number of CB units that estimates a distance of 20 kb. An in-house Perl script was created to iterate through the contigs of the physical map and select BACs which were closest to being 7 CB units apart. A list of BAC-ends that existed prior to this analysis was provided as input to the script and these BACs were used to avoid selection of a new BAC in regions where BAC-ends already existed. BACs were selected first by the CB location of their left-ends, but new BACs were not selected in regions where rightends from a previous selection existed.

Pseudomolecule construction and genome size estimation
Physical map coordinates for markers (CB position, contig ID) were obtained directly from FPC by right-clicking 'file' and selecting 'save ordered marker list'. Contig lengths and BAC positions were manually parsed from FPC using the following procedure: A) 'cat < fpc_filename > | grep 'Map' | sed s/Map//| sed s/Ends//| sed s/ ctg//| cut -d "O" -f1 | cut -d "." -f1 > BacsInContigs.txt'; B) BacsInContigs.txt was dropped into Excel and split at space delimiters and parsed; C) BAC start and stop positions in a contig and total BAC counts per contig was determined using a 'find duplicates' query in Access 2010 (Microsoft). Contigs that contained a genetic marker were considered anchored and assigned to a linkage group based upon that marker. Marker order within a contig was determined based upon physical position and not genetic position. Ordered and anchored contigs were then assigned physical map positions. Anchored contigs were sorted according to genetic position and pseudomolecules were constructed by adding an arbitrary gap of 0.25 Mbp between anchored contigs. Finally, a CMAP file was constructed for 10 anchored linkage group maps (Pseudo1-Pseudo10) with the following four feature types: contig, gap, marker, and STS. Genome size was estimated as the sum of anchored contigs (+ gaps) and unanchored contigs.