A genetically anchored physical framework for Theobroma cacao cv. Matina 1-6
© Saski et al; licensee BioMed Central Ltd. 2011
Received: 26 April 2011
Accepted: 16 August 2011
Published: 16 August 2011
The fermented dried seeds of Theobroma cacao (cacao tree) are the main ingredient in chocolate. World cocoa production was estimated to be 3 million tons in 2010 with an annual estimated average growth rate of 2.2%. The cacao bean production industry is currently under threat from a rise in fungal diseases including black pod, frosty pod, and witches' broom. In order to address these issues, genome-sequencing efforts have been initiated recently to facilitate identification of genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars. However, problems inherent with assembly and resolution of distal regions of complex eukaryotic genomes, such as gaps, chimeric joins, and unresolvable repeat-induced compressions, have been unavoidably encountered with the sequencing strategies selected.
Here, we describe the construction of a BAC-based integrated genetic-physical map of the T. cacao cultivar Matina 1-6 which is designed to augment and enhance these sequencing efforts. Three BAC libraries, each comprised of 10× coverage, were constructed and fingerprinted. 230 genetic markers from a high-resolution genetic recombination map and 96 Arabidopsis-derived conserved ortholog set (COS) II markers were anchored using pooled overgo hybridization. A dense tile path consisting of 29,383 BACs was selected and end-sequenced. The physical map consists of 154 contigs and 4,268 singletons. Forty-nine contigs are genetically anchored and ordered to chromosomes for a total span of 307.2 Mbp. The unanchored contigs (105) span 67.4 Mbp and therefore the estimated genome size of T. cacao is 374.6 Mbp. A comparative analysis with A. thaliana, V. vinifera, and P. trichocarpa suggests that comparisons of the genome assemblies of these distantly related species could provide insights into genome structure, evolutionary history, conservation of functional sites, and improvements in physical map assembly. A comparison between the two T. cacao cultivars Matina 1-6 and Criollo indicates a high degree of collinearity in their genomes, yet rearrangements were also observed.
The results presented in this study are a stand-alone resource for functional exploitation and enhancement of Theobroma cacao but are also expected to complement and augment ongoing genome-sequencing efforts. This resource will serve as a template for refinement of the T. cacao genome through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays.
Theobroma cacao (cacao tree) is the source of the world's cocoa butter and cocoa powder, key ingredients in chocolate. T. cacao is a short, tropical tree that is grown in multiple countries including Côte d'Ivoire, Ghana, Indonesia, Nigeria, Brazil, Cameroon, Ecuador, Colombia, Mexico, and Papua New Guinea where cacao beans are an important cash crop. World cocoa production was estimated to be 3 million tons in 2010 and had an annual estimated average growth rate of 2.2% from 1998 to 2010 (http://www.worldcocoafoundation.org/learn-about-cocoa/cocoa-facts-and-figures.html). Cacao bean production is currently under threat from several sources including a rise in the incidence of fungal diseases including black pod, frosty pod, and witches' broom [1, 2]. In order to address these issues, multiple genetic and genomic efforts have been initiated in the last decade to identify genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars [3–10]. These efforts have recently culminated in whole-genome shotgun assemblies of the genomes of two T. cacao cultivars: Criollo  and Matina 1-6 . These genome sequences will greatly assist T. cacao breeding efforts  as well as contribute to our basic knowledge of tree and dicot biology through comparisons with a growing collection of genome sequences from trees and dicotyledonous plants.
There are primarily two approaches to sequence a genome: the BAC-by-BAC approach where libraries of clones with large inserts (e.g. BACs) are randomly sequenced and ordered relative to a minimum tile path (MTP) such as has been used to sequence the genomes of rice and maize [13, 14], or the whole- genome shotgun (WGS) sequencing of genomic DNA, carried out without cloning, from genomic libraries with multiple insert sizes as implemented for Western poplar, grapevine, and sorghum genomes [15–17]. In the BAC-by-BAC approach, the MTP is obtained through the construction of a physical map via DNA fingerprinting: each BAC clone is digested with restriction enzymes [18–20] and contigs are generated based upon shared DNA signatures. DNA signatures, comprised of a BAC's repertoire of fragment sizes, are determined by algorithms incorporated into software such as Fingerprinted Contigs (FPC) [18, 21]. FPC-based physical maps can be validated and improved by incorporating information obtained by hybridizing BAC clones with specific probes which can include probes that have been genetically mapped; the ordering of contigs on physical maps can thus be facilitated by using genetically derived mapping information . Furthermore, BAC ends can be sequenced and serve as well-spaced markers for determining the accuracy of shotgun read assemblies. Using a strategy that integrates physical mapping and genetic mapping reduces the number of chimeric contigs and increases overall confidence in the final assembly (whether BAC-by-BAC or WGS). In any strategy, it is important to have a high quality reference assembly available for annotation and re-sequencing endeavours.
In this study, we describe the construction of a BAC-based integrated genetic-physical map of the genome of T. cacao cv. Matina 1-6 (estimated genome size = 374 Mbp). We also demonstrate the utility of the map by comparing it with other plant genomes. This map serves as an important reference for T. cacao genomes being sequenced; it can be used to establish the accuracy of those genome sequences prior to their use for applications such as SNP discovery, RNAseq, ChiPseq, and other techniques.
Physical map construction
T. cacao BAC library overview
Average insert size (kb)
Insert range (kb)
HICF fragments (avg)
In order to determine BAC order and orientation, we characterized all BACs in the libraries using high resolution restriction band fingerprinting. Using previously published methods , a total of 108,288 BAC clones from all three BAC libraries were subjected to high information content fingerprinting (HICF) after addition of control clones in the E07 and H12 wells of the 96-well offset to maintain data uniformity. Data comprising fragment sizes were captured by capillary electrophoresis . After removal of clones containing less than 20 fragments or greater than 220 fragments and empty vectors, 95,386 clones (88% of the original total), with an average of 114.2 fragments per clone, were successfully assigned a digitized fingerprint (Table 1) to use for assembly carried out using the FPC software .
To obtain a high-quality commensurate build, fingerprints from all three BAC libraries were combined for contig assembly using a stringent cutoff of 1 e-80 and a tolerance of 3. This base assembly resulted in 445 contigs harbouring 86,590 clones and 8,806 singletons. The DQer function of FPC was run to identify and break up contigs comprised of greater than 10% questionable clones (a sign of potential false joins). The Singles-to-Ends function and the Ends-to-Ends function were run at a final cutoff of 1 e-50 using automatic merges to join singletons to contig ends and contigs to contigs, respectively. Additionally, manual curation of the physical map was performed based on results collected from the integration of the genetic recombination map and synteny mapping (described below) at a cutoff as low as 1 e-25. Assuming a cumulative average insert size of 138 kb and an estimated T. cacao genome size of 440 Mbp, the consensus band (CB) estimation equates to an average of 1,210 bp per band.
T. cacao physical map overview
No. BACs in FPC
No. BACs in contigs
No. BACs in Anchored Contigs
No. BACs in Unanchored Contigs
Contig Len, Anchored (Mbp)
Contig Len, Unanchored (Mbp)
FPC Genome Size (Mbp)
% Genome Anchored (bp)
Integration of genetic markers and sequence tagged sites (STS) onto the physical map
Dense minimum tile path (MTP) selection for BAC-end sequencing
BAC-end sequences (BES) corresponding to a fingerprint in the physical map serve as long-range, paired sequence anchor points which can be used to facilitate alignments to other genomes and integration of draft sequence contigs. Ideally, a pair of BAC-end sequences assigned to each BAC in the physical map will provide a robust array of Sequence Tagged Sites (STS) for draft genome anchoring, genome exploitation, etc. However, the T. cacao physical map assembled into a very few, long contigs built from nearly 30× BAC coverage; therefore, we determined a BES for every clone was not necessary but rather an end sequence every 7-8 kb. For the T. cacao physical map, the median number of Consensus Band (CB) units per BAC was 120 CBs, the average BAC insert size was 138 kb, and the desired distance in bp between sequence anchor points was 8 kb. The resulting n-value was approximately 7 CB units and 29,383 BACs with an end approximately every 7CB units apart were selected for BAC-end sequencing. A total of 58,766 sequencing reactions were performed that included both ends, 52,966 (90%) of which were successfully trimmed and a total of 49,984 of those were part of a successful pair. This moderately dense array of BAC-end sequences will serve as sequence anchor points for comparative genomics and long-range sequence pairs for draft genome sequence integration.
Alignment of the T. cacao physical map with V. vinifera, P. trichocarpa, and A. thaliana genomes
Synteny coverage between T. cacao (Matina 1-6) physical map and reference genome assemblies
T. cacao (Criollo)
T. cacao (Matina 1-6)
T. cacao (Criollo)
T. cacao (Criollo)
Synteny blocks between T. cacao (Matina 1-6) physical map and reference genome assemblies
Blocks (< 1 Mbp)
Blocks (1 Mbp - 3 Mbp)
Blocks (> 3 Mbp)
T. cacao (Criollo)
To determine if a physical map is sufficient for investigating ancestral paleo-polyploidy events, we looked at the detailed alignments between T. cacao and V. vinifera chromosomes. We observed that V. vinifera chromosomes 1, 14, and 17 align to T. cacao (Cv. Matina 1-6) chromosomes 2, 3, and 4; V. vinifera chromosomes 2, 12, 15, and 16 align to T. cacao (Cv. Matina 1-6) chromosomes 1, 3, and 5; V. vinifera chromosomes 3, 4, 7, and 18 align to T. cacao chromosomes 1, 2, and 8; V. vinifera chromosomes 4, 9, and 11 align to T. cacao 6, 8, and 9; V. vinifera chromosomes 5, 7, and 14 align to T. cacao 1, 4, and 5; V. vinifera chromosomes 6, 8, and 13 align to T. cacao 5, 9, and 10; V. vinifera chromosomes 10, 12, and 19 align to T. cacao 1, 6, and 7 (Figure 4). These observations are nearly identical to those of Argout et al.  and also confirm evidence of ancestral triplicated chromosome groups reported for V. vinifera. These results suggest that a BAC-based physical map with relatively evenly spaced BAC-end sequence anchor points can have immediate utility, depending on the amount of collinearity that exists between it and other genomes of interest, for interrogating agriculturally important genes and gene families and elucidating evolutionary origins prior to the availability of a high quality reference genome.
Alignment of the T. cacao cv. Matina 1-6 physical map with the T. cacao cv. Criollo genome assembly
After the T. cacao cv. Matina 1-6 physical map was constructed the T. cacao cv. Criollo genome assembly became available . Alignment of the T. cacao cv. Matina 1-6 physical map with the T. cacao cv. Criollo genome assembly  identified a total of 37,310 BES matches and 194 marker sequences anchored to 87 contigs covering approximately 65% (318.3 Mbp) of the Criollo genome (Table 4; Additional file 8: Table S8). The average percent identity of the sequence matches ranged from 83% to 99%. A total of 112 synteny blocks were identified, 56 of which were less than 1 Mbp in length, 25 ranged from 1 to 3 Mbp, and 31 were longer than 3 Mbp (Table 5). Alignment of the genome sequences of the Matina 1-6 cultivar with those of the Criollo cultivar revealed a very close alignment, as expected, which validates the comparative genomic software and methodology.
Integration of unanchored contigs using collinearity with other genomes
In an effort to improve the T. cacao cv. Matina 1-6 physical map, we examined structural similarities found in the genomes of related species that might suggest possible linkage group assignments for unanchored contigs in the map. Unincorporated Matina 1-6 FPC contigs were assessed for linkage group assignment based on anchoring to regions of the V. vinifera, P. trichocarpa, A. thaliana, and T. cacao cv. Criollo chromosomes where other T. cacao cv. Matina 1-6 anchored BAC contigs were aligned. These alignments resulted in linkage group predictions for 11 unanchored contigs (Additional file 9: Table S9). As more genomes are sequenced, a comparative approach to refining existing physical maps will become more effective.
Overview of the cacao physical map
We constructed three BAC libraries together representing ~30× coverage of the T. cacao genome (publicly available via http://www.genome.clemson.edu) and used them to create the first whole-genome physical map for T. cacao cv. Matina 1-6 (publicly available via http://www.cacaogenomedb.org). This map is genetically anchored and enables cacao cultivar improvement through efforts such as positional cloning and region-specific analysis through sub-genome sequencing (companion paper). The map also aids in assembly of reference genomes, gap-filling, and independent assessment of assembly accuracy. And, in addition to protein-coding regions, BACs harbor genomic segments such as untranslated regions (UTR), promotor/regulatory elements, and introns that are important in functional genomics studies. A BAC-based physical map is thus not just an interim solution or a step in the process of sequencing a genome but a viable resource for utilizing a complete genome sequence once ascertained. Physical maps have been reported recently for Gossypium raimondii, Aquilegia formosa, wheat chromosomes , maize [35, 36], Brachypodium distachyon, Oncorhynchus mykiss, Prunus persica, and Glycine max[40, 41].
Our T. cacao physical map assembly is highly ordered and tightly anchored to a moderately dense genetic recombination map, resides in 49 anchored contigs, and represents 82% of the T. cacao genome as computed by assuming a total genome size of 440 Mbp. However, estimated genome sizes for T. cacao range from 326 Mbp to 440 Mbp as determined using reassociation kinetics, flow cytometry [42, 43], or genome assembly . Once the true size of the genome of T. cacao cv. Matina 1-6 is known, our physical map statistics may require adjustment. Availability of a high-density genetic recombination map [5, 26] was critical to our success in ordering and orienting BAC contigs and subsequently assigning contigs to chromosomes. We hybridized 230 genetically mapped SSR markers to the BAC contigs. Only 10 of these mapped markers hybridized to more than one contig signifying low copy-number, accuracy of the genetic map through additional evidence in the independent physical map build, and accuracy of the BAC assembly. Several of these markers flank QTL loci and therefore provide immediate templates for sequencing pools of high priority BACs to identify candidate genes for further investigation (companion paper). These mapped marker sequences also serve as sequence anchor points for use in comparative genome studies.
Using other genomes to assess and improve the T. cacao physical map and discover biological insights
Each newly available genome assembly, especially those from species distantly related to model organisms, improves the resolution of genome biology and evolution. Exploration of evolving genome architecture through synteny analysis upon the release of a new genome has become a standard experiment in which a new window into a clade is opened. New genome assemblies are, however, incomplete. Gaps, chimeric joins, and unresolvable repeat-induced compressions are unavoidable. One way to identify these errors and improve an assembly is to use related genomes as a guide.
Comparative genomics has evolved rapidly over the last decade. More genomes acting as reference sequences and the advancement of computational algorithms and visualization software has facilitated a turnkey approach. At the same time, aligning physical maps to sequenced genomes has become increasingly useful. For example, Fang et al. constructed a physical map for Aquilegia formosa, a species in a unique clade of basal eudicots that is being utilized as a new model system for studying floral variation, adaptive radiation, and evolution, and used a comparative approach to gain insight into the evolutionary lineages between A. formosa and V. vinifera[28, 44–46]. Gu et al. created a physical map of Brachypodium distachyon and compared it to rice and wheat; they observed whole-genome duplication events in relation to rice that were caused by paleotetraploidy and a broad spectrum of other evolutionary events between the wheat and B. distachyon genomes . In short, a physical map serves as a distinct line of non-sequence-based evidence as well as an adjunct to a draft genome sequence during the biological discovery process.
In our comparative genome analysis, we used the SyMAP software to align the T. cacao cv. Matina 1-6 physical map to A. thaliana, V. vinifera, and P. trichocarpa genome sequences to gain insight into the structure and evolutionary history of T. cacao as well as to improve the physical map assembly. Not surprisingly, we found more syntenic blocks between the ten chromosomes of T. cacao and the 19 chromosomes of P. trichocarpa than between the ten chromosomes of T. cacao and the 19 chromosomes of grape. The main difference was in the length of the syntenic blocks; 100 syntenic blocks between T. cacao and P. trichocarpa were greater than 1 Mbp in length, whereas there were only 74 syntenic blocks longer than 1 Mbp between T. cacao and V. vinifera. The number of short (< 1 Mbp) collinear regions differed between these comparisons as well (27 in the V. vinifera comparison and 87 in the comparison with P. trichocarpa). Alignment to the A. thaliana genome produced the fewest syntenic blocks as expected since A. thaliana is the most evolutionarily distant species of the three used for our comparisons.
Structural and evolutionary implications can also be derived from comparisons between physical maps and available genome sequences. Collinear genomic segments and duplications can be quickly identified and provide insight into selective pressures and identify regions of the genome for targeted detailed analysis, all without a reference genome. As a result of our use of this approach, we concur that the chromosome fusion events recently reported by Argout et al.  occurred in the ancient genome structure that led to the ten-chromosome structure we see today. Additionally, this comparative approach is sufficient to identify and confirm ancestral triplicated chromosomal groups recently reported for V. vinifera.
The utility of the cacao physical map with regard to cacao genome sequences
Even as more and more genomes are sequenced de novo and sequencing strategies evolve, i.e. there is less reliance on Sanger-based sequencing and second-generation sequencing platforms shift in chemistry and read-lengths, there are still inherent problems with assembly and resolution of distal regions of complex eukaryotic genomes. Mate-pair sequencing libraries with long insert sizes such as BACs and fosmids provide the necessary linking information and long-range contiguity for resolving repetitive DNA and scaffolding draft contigs. As noted previously, a physical map is an important adjunct to de novo genome sequencing projects and serves as an independent genome assembly that is composed of a very different data type. BAC fingerprints assembled (at a conservative cutoff) with an integrated dense array of paired-end sequences and genetic markers can be used to check for errors and to corroborate the accuracy of a draft genome assembled using a whole-genome sequencing (WGS) strategy. The physical map can also serve as a template for gap-filling and targeted sub-genome re-sequencing of BAC pools .
The physical map we present here can be of great utility in advancing draft genome sequences of Theobroma cacao into high quality reference genomes. For example, comparison of the cv. Matina 1-6 physical map assemblies to the cv. Criollo pseudomolecules revealed a high degree of collinear genomic segments (Figure 5A). However, there are still many regions of structural difference between these two draft sequences such as sequence inversions and translocations. There are also regions of discontiguous alignments between the physical map and draft sequence. We speculate that these differences may be the result of underlying biological differences between cultivars, misassemblies in either the Matina 1-6 physical map or the Criollo whole-genome draft assembly, gaps resulting from unresolved repetitive DNA, or a lack of dense BES in particular contigs in the physical map. For any of these possibilities, the Matina 1-6 physical map can serve as a template for confirming/resolving assembly differences and provide resolution in regions of low quality or underrepresentation in the draft genome assemblies through targeted PCR or sub-genome sequencing through BAC pools.
The three BAC libraries, BAC-end sequences, and the genetically integrated physical map for T. cacao cv. Matina 1-6 resulting from this study are important resources for functional exploitation and enhancement of Theobroma cacao that are expected to complement and augment genome sequencing efforts. The results obtained from the comparative analyses with A. thaliana, V. vinifera, and P. trichocarpa suggest that genome assemblies from distantly related species can be used to gain insights into genome structure and evolutionary history as well as conserved functional genomic sites and to improve a physical map assembly. These resources will also serve as the templates for refinement of T. cacao genome sequences through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays through long-range contiguity.
BAC library construction
The Matina 1-6 tree clones used for DNA isolation were kept at greenhouse conditions and dark-treated for 12 hours prior to leaf harvesting. Approximately 100 g of young, F2 stage expanding leaf tissue with the mid-vein removed were harvested, washed two times with autoclaved ddH20 and ground in liquid nitrogen with a mortar and pestle to a coarse powder. Nuclei were prepared following previously published methods  with the following modifications: addition of 1% (w/v) soluble PVP-40 (Sigma-Aldrich), 0.1% (w/v) L-ascorbic acid (Sigma-Aldrich), 0.13% (w/v) sodium diethyldithiocarbamate trihydrate (Sigma-Aldrich), and 0.4% beta-mercaptoethanol to nuclei isolation buffer (NIB) right before use. Purified nuclei were concentrated and embedded in agarose plugs. Protein digestion and plug washing was carried out exactly as previously described . To prepare high molecular weight genomic DNA fragments, plugs were macerated with a single-edge razor blade and then partially digested (separately) with, Hin dIII, Eco RI, or Mbo I using standard procedures. DNA size selection, electro-elution, and ligation were carried out as previously described . The BAC libraries were assigned a unique CUGI identifier according the enzyme used for partial digestion: TCC_Ba was made using Hin dIII, TCC_Bb with Mbo I, and TCC_Bc with Eco RI. Each BAC library was characterized for insert distribution by randomly selecting 384 clones and subjecting them to miniprep, digestion with Not-I (New England Biolabs), and resolution by pulsed-field gel electrophoresis .
HICF BAC fingerprinting
384-well plates containing BACs were decondensed to 96-well format robotically with the Q-bot (Genetix, United Kingdom). Two pins were removed from the sub-plate inoculators to allow for manual insertion of control clones in the E07 and H12 positions; controls were used to assess data uniformity. DNA was isolated from a total of 108,288 clones from TCC_Ba, TCC_Bb, and TCC_Bc, the three BAC libraries, by following standard alkaline lysis miniprep methods , and then used as substrates for HICF carried out following previously published methods  with the following modifications. Approximately 0.5 ug of BAC DNA was digested with 2.0 units of Hin dIII, Bam HI, Xba I, Xho I, and Hae III at 37C for 2 hours. The DNA was treated with 0.25 U of label from the SNaP-shot kit (Applied Biosystems, Foster City, California) at 65 C for 1 hour and then precipitated with ethanol. The labelled DNA was reconstituted in 9 ul of Hi-Di formamide (Applied Biosystems) and 0.05 ul of LIZ600 (Applied Biosystems). BAC fingerprints were sized on an ABI3730 (Applied Biosystems) using a 36 cm array and POP7 (Applied Biosystems). The fingerprint profiles were processed using GeneMapper 3.7 (Applied Biosystems) for sizing quality and FPMiner (Bioinforsoft, Oregon) for digitized fingerprint assignment. For improved data quality, vector bands, clones without inserts, and restriction profiles with less than 20 or more than 200 bands were removed and the remaining profiles were uploaded to FPC v8.5.3  for fingerprint contig assembly.
Physical map assembly
The initial build was done at a Sulston score cut-off of 1 e-80 and a tolerance of 3. To reduce false joins, the DQer function of FPC was used to break down all contigs comprised of more than 10% questionable (Q) clones. Further physical map refinement was performed using the Ends-to-Ends and Singles-to-Ends functions of FPC with stepwise reductions of the Sulston score cutoff values to a final score of 1 e-50. Additional fingerprint contig merges were made with lower Ends-to-Ends overlap when there was additional agreement with the anchored genetic map . Contigs merged in this fashion used Sulston score cutoffs as low as 1 e-25. High quality marker sequence was processed through a CUGI pipeline consisting of Repeat Masker  with the RepBase database , cross_match , and Tandem Repeat Finder . The remaining sequence was used for overgo design using the overgomaker (http://genome.wustl.edu/software/overgo_maker) software. Once a version of the whole-genome draft sequence of Matina 1-6 became available , potentially repetitive overgo sequences were aligned using BLAST  with cutoff of at least 85% percent identity. Any overgo sequence with more than 8 hits to the putative assembly was removed from the experiment, the repetitive sequence was masked, and a new overgo was selected from the original sequence. Manual filter hit-calling and deconvolution of the multi-dimensional pool hybridization data was accomplished using HybDecon, open source software available at http://www.genome.clemson.edu/software/hybdecon. The filter hit-calling functionality is an improved version of Hybsweeper, a web-based Java tool first reported by Lazo et al. . In addition, a Perl-based deconvolution script, written entirely at CUGI, accompanies the tool and is launched from within the graphical interface once manual calling of positive hits is complete. The source code, a test dataset, and installation manual are all available online. A user's manual is readily available with the click of the Help button from within the software. Additionally an online experimental design tool is available (http://www.genome.clemson.edu/software/hybdecon/exp_setup) to assist with setup for multi-dimensional pooling hybridization.
Dense MTP selection and BAC-end sequencing
where m is median number of CB units per BAC in the physical map, i is the average BAC insert size, k is the desired size in base pairs, and n is the number of CB units that estimates a distance of 20 kb. An in-house Perl script was created to iterate through the contigs of the physical map and select BACs which were closest to being 7 CB units apart. A list of BAC-ends that existed prior to this analysis was provided as input to the script and these BACs were used to avoid selection of a new BAC in regions where BAC-ends already existed. BACs were selected first by the CB location of their left-ends, but new BACs were not selected in regions where right-ends from a previous selection existed.
Pseudomolecule construction and genome size estimation
Physical map coordinates for markers (CB position, contig ID) were obtained directly from FPC by right-clicking 'file' and selecting 'save ordered marker list'. Contig lengths and BAC positions were manually parsed from FPC using the following procedure: A) 'cat < fpc_filename > | grep 'Map' | sed s/Map//| sed s/Ends//| sed s/ctg//| cut -d "O" -f1 | cut -d "." -f1 > BacsInContigs.txt'; B) BacsInContigs.txt was dropped into Excel and split at space delimiters and parsed; C) BAC start and stop positions in a contig and total BAC counts per contig was determined using a 'find duplicates' query in Access 2010 (Microsoft). Contigs that contained a genetic marker were considered anchored and assigned to a linkage group based upon that marker. Marker order within a contig was determined based upon physical position and not genetic position. Ordered and anchored contigs were then assigned physical map positions. Anchored contigs were sorted according to genetic position and pseudomolecules were constructed by adding an arbitrary gap of 0.25 Mbp between anchored contigs. Finally, a CMAP file was constructed for 10 anchored linkage group maps (Pseudo1-Pseudo10) with the following four feature types: contig, gap, marker, and STS. Genome size was estimated as the sum of anchored contigs (+ gaps) and unanchored contigs.
Synteny mapping and comparative genomics
Synteny mapping was carried out using the Symap software [31, 32]. Genomic assemblies and annotations for alignments were downloaded from the following locations: A. thaliana (http://www.arabidopsis.org/portals/genAnnotation/gene_structural_annotation/agicomplete.jsp), V. vinifera (http://www.plantgdb.org/VvGDB/), P. trichocarpa (http://www.phytozome.net/poplar.php), and T. cacao cv. Criollo (http://cocoagendb.cirad.fr). The T. cacao cv. Matina 1-6 input data for Symap was our 49 BAC contigs anchored to chromosomes by 230 genetic recombination markers and aligned to the respective genomes with 52,966 BAC-end sequences and the 230 genetic map sequences, respectively.
This project was funded by Mars Inc. The authors would also like to thank Jeanice Troutman, Xia Xiaoxia, Michael Atkins, and Maria Delgado of CUGI for BAC-end sequencing, colony arraying, BAC filter production, and BAC resource management and distribution. Also, a special thanks to Belinda Martineau for a critical review of the manuscript.
- Evans HC: Cacao diseases-the trilogy revisited. Phytopath. 2007, 97 (12): 1640-1643. 10.1094/PHYTO-97-12-1640.View Article
- Hebbar PK: Cacao diseases: a global perspective from an industry point of view. Phytopath. 2007, 97 (12): 1658-1663. 10.1094/PHYTO-97-12-1658.View Article
- Brown JS, Phillips-Mora W, Power EJ, Krol C, Cervantes-Martinez C, Motamayor JC, Schnell RJ: Mapping QTLs for Resistance to Frosty Pod and Black Pod Diseases and Horticultural Traits in Theobroma cacao. Crop Science. 2007, 47 (5): 1851-1858. 10.2135/cropsci2006.11.0753.View Article
- Clement D, Risterucci AM, Motamayor JC, N'Goran J, Lanaud C: Mapping QTL for yield components, vigor, and resistance to Phytophthora palmivora in Theobroma cacao L. Genome. 2003, 46 (2): 204-212. 10.1139/g02-125.PubMedView Article
- Crouzillat D, Lerceteau E, Pétiard V, Morera-Monge JA, Rodríguez H, Walker D, Phillips-Mora W, Ronning C, Schnell RJ, Osei J, et al: Theobroma cacao L.: A genetic linkage map and quantitative trait loci analysis. Theor Appl Genet. 1996, 93 (1-2): 205-214. 10.1007/BF00225747.PubMedView Article
- Faleiro F, Queiroz V, Lopes U, Guimarães C, Pires J, Yamada M, Araújo I, Pereira M, Schnell R, Filho G, et al: Mapping QTLs for Witches' Broom (Crinipellis Perniciosa) Resistance in Cacao (Theobroma Cacao L.). Euphytica. 1996, 149 (1-2): 227-235.View Article
- Lanaud C, Fouet O, Clément D, Boccara M, Risterucci AM, Surujdeo-Maharaj S, Legavre T, Argout X: A meta-QTL analysis of disease resistance traits of Theobroma cacao L. Molecular Breeding. 2009, 24 (4): 361-374. 10.1007/s11032-009-9297-4.View Article
- Queiroz VT, Guimarães CT, Anhert D, Schuster I, Daher RT, Pereira MG, Miranda VRM, Loguercio LL, Barros EG, Moreira MA, et al: Identification of a major QTL in cocoa (Theobroma cacao L.) associated with resistance to witches' broom disease. Plant Breeding. 2003, 122 (3): 268-272. 10.1046/j.1439-0523.2003.00809.x.View Article
- Risterucci AM, Paulin D, Ducamp M, N'Goran JA, Lanaud C: Identification of QTLs related to cocoa resistance to three species of Phytophthora. Theor Appl Genet. 2003, 108 (1): 168-174. 10.1007/s00122-003-1408-8.PubMedView Article
- Schnell RJ, Kuhn DN, Brown JS, Olano CT, Phillips-Mora W, Amores FM, Motamayor JC: Development of a marker assisted selection program for cacao. Phytopath. 2007, 97 (12): 1664-1669. 10.1094/PHYTO-97-12-1664.View Article
- Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, et al: The genome of Theobroma cacao. Nature Genetics. 2011, 43 (2): 101-108. 10.1038/ng.736.PubMedView Article
- IRGSP: The map-based sequence of the rice genome. Nature. 2005, 436 (7052): 793-800. 10.1038/nature03895.View Article
- Pennisi E: Plant sciences. Corn genomics pops wide open. Science. 2008, 319 (5868): 1333-10.1126/science.319.5868.1333.PubMedView Article
- Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007, 449 (7161): 463-467. 10.1038/nature06148.PubMedView Article
- Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al: The Sorghum bicolor genome and the diversification of grasses. Nature. 2009, 457 (7229): 551-556. 10.1038/nature07723.PubMedView Article
- Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, et al: The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006, 313 (5793): 1596-1604. 10.1126/science.1128691.PubMedView Article
- Nelson W, Soderlund C: Integrating sequence with FPC fingerprint maps. Nucleic Acids Res. 2009, 37 (5): e36-10.1093/nar/gkp034.PubMed CentralPubMedView Article
- van Oeveren J, de Ruiter M, Jesse T, van der Poel H, Tang J, Yalcin F, Janssen A, Volpin H, Stormo KE, Bogden R, et al: Sequence-based physical mapping of complex genomes by whole genome profiling. Genome Res. 2011
- Zhu H, Blackmon BP, Sasinowski M, Dean RA: Physical map and organization of chromosome 7 in the rice blast fungus, Magnaporthe grisea. Genome Res. 1999, 9 (8): 739-750.PubMed CentralPubMed
- Soderlund C, Longden I, Mott R: FPC: a system for building contigs from restriction fingerprinted clones. Comput Appl Biosci. 1997, 13 (5): 523-535.PubMed
- Xiong Z, Kim JS, Pires JC: Integration of genetic, physical, and cytogenetic maps for Brassica rapa chromosome A7. Cytogenet Genome Res. 2010, 129 (1-3): 190-198. 10.1159/000314640.PubMedView Article
- Luo M, Wing RA: An Improved Method for Plant BAC Library Construction. 2003, Totowa, NJ: Humana Press, Inc, 236: -
- Luo MC, Thomas C, You FM, Hsiao J, Ouyang S, Buell CR, Malandro M, McGuire PE, Anderson OD, Dvorak J: High-throughput fingerprinting of bacterial artificial chromosomes using the snapshot labeling kit and sizing of restriction fragments by capillary electrophoresis. Genomics. 2003, 82 (3): 378-389. 10.1016/S0888-7543(03)00128-9.PubMedView Article
- Soderlund C, Humphray S, Dunham A, French L: Contigs built with fingerprints, markers, and FPC V4.7. Genome Res. 2000, 10 (11): 1772-1787. 10.1101/gr.GR-1375R.PubMed CentralPubMedView Article
- Brown JS, Schnell RJ, Motamayor JC, Lopes U, Kuhn DN, Borrone JW: Resistance gene mapping for witches' broom disease in Theobroma cacao L. in an F2 population using SSR markers and candidate genes. J Amer Soc Hort Sci. 2005, 130 (3): 366-373.
- Wu F, Mueller LA, Crouzillat D, Petiard V, Tanksley SD: Combining bioinformatics and phylogenetics to identify large sets of single-copy orthologous genes (COSII) for comparative, evolutionary and systematic studies: a test case in the euasterid plant clade. Genetics. 2006, 174 (3): 1407-1420. 10.1534/genetics.106.062455.PubMed CentralPubMedView Article
- Fang GC, Blackmon BP, Henry DC, Staton ME, Saski CA, Hodges SA, Tomkins JP, Luo H: Genomic tools development for Aquilegia: construction of a BAC-based physical map. BMC Genomics. 2010, 11: 621-10.1186/1471-2164-11-621.PubMed CentralPubMedView Article
- Fang Z, Polacco M, Chen S, Schroeder S, Hancock D, Sanchez H, Coe E: cMap: the comparative genetic map viewer. Bioinformatics. 2003, 19 (3): 416-417. 10.1093/bioinformatics/btg012.PubMedView Article
- Brown JS, Sautter RT, Olano CT, Borrone JW, Kuhn DN, Motamayor JC, Schnell RJ: A Composite Linkage Map from Three Crosses Between Commercial Clones of Cacao, Theobroma cacao L. Tropical Plant Biol. 2008, 1 (2): 120-130. 10.1007/s12042-008-9011-4.View Article
- Soderlund C, Nelson W, Shoemaker A, Paterson A: SyMAP: A system for discovering and viewing syntenic regions of FPC maps. Genome Res. 2006, 16 (9): 1159-1168. 10.1101/gr.5396706.PubMed CentralPubMedView Article
- Soderlund C, Bomhoff M, Nelson WM: SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011
- Lin L, Pierce GJ, Bowers JE, Estill JC, Compton RO, Rainville LK, Kim C, Lemke C, Rong J, Tang H, et al: A draft physical map of a D-genome cotton species (Gossypium raimondii). BMC Genomics. 2010, 11: 395-10.1186/1471-2164-11-395.PubMed CentralPubMedView Article
- Luo MC, Ma Y, You FM, Anderson OD, Kopecky D, Simkova H, Safar J, Dolezel J, Gill B, McGuire PE, et al: Feasibility of physical map construction from fingerprinted bacterial artificial chromosome libraries of polyploid plant species. BMC Genomics. 2010, 11: 122-10.1186/1471-2164-11-122.PubMed CentralPubMedView Article
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326 (5956): 1112-1115. 10.1126/science.1178534.PubMedView Article
- Wei F, Zhang J, Zhou S, He R, Schaeffer M, Collura K, Kudrna D, Faga BP, Wissotski M, Golser W, et al: The physical and genetic framework of the maize B73 genome. PLoS Genet. 2009, 5 (11): e1000715-10.1371/journal.pgen.1000715.PubMed CentralPubMedView Article
- Gu YQ, Ma Y, Huo N, Vogel JP, You FM, Lazo GR, Nelson WM, Soderlund C, Dvorak J, Anderson OD, et al: A BAC-based physical map of Brachypodium distachyon and its comparative analysis with rice and wheat. BMC Genomics. 2009, 10: 496-10.1186/1471-2164-10-496.PubMed CentralPubMedView Article
- Palti Y, Luo MC, Hu Y, Genet C, You FM, Vallejo RL, Thorgaard GH, Wheeler PA, Rexroad CE: A first generation BAC-based physical map of the rainbow trout genome. BMC Genomics. 2009, 10: 462-10.1186/1471-2164-10-462.PubMed CentralPubMedView Article
- Zhebentyayeva T, Swire-Clark G, Georgi L, Garay L, Jung S, Forrest S, Blenda A, Blackmon B, Mook J, Horn R, et al: A framework physical map for peach, a model Rosaceae species. 2008, 745-756. 4
- Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al: Genome sequence of the palaeopolyploid soybean. Nature. 2010, 463 (7278): 178-183. 10.1038/nature08670.PubMedView Article
- Wu C, Sun S, Nimmakayala P, Santos FA, Meksem K, Springman R, Ding K, Lightfoot DA, Zhang HB: A BAC- and BIBAC-based physical map of the soybean genome. Genome Res. 2004, 14 (2): 319-326. 10.1101/gr.1405004.PubMed CentralPubMedView Article
- Figueira A, Janick J, Goldsbrough P: Genome Size and DNA Polymorphism in Theobroma-Cacao. J Amer Soc Hort Sci. 1992, 117 (4): 673-677.
- Couch JA, Zintel HA, Fritz PJ: The genome of the tropical tree Theobroma cacao L. Mol Gen Genet. 1993, 238: 123-128.
- Kramer EM: Aquilegia: A New Model for Plant Development, Ecology, and Evolution. Annual Review of Plant Biology. 2009, 60: 261-277. 10.1146/annurev.arplant.043008.092051.PubMedView Article
- Kramer EM, Hodges SA: Aquilegia as a model system for the evolution and ecology of petals. Phil Trans R Soc B. 2010, 365: 477-490. 10.1098/rstb.2009.0230.PubMed CentralPubMedView Article
- Hodges SA, Derieg NJ: Adaptive radiations: From field to genomic studies. Proc Natl Acad Sci USA. 2009, 106: 9947-9954. 10.1073/pnas.0901594106.PubMed CentralPubMedView Article
- Rounsley S, Marri PR, Yu Y, He R, Sisneros N, Goicoechea JL, Lee SJ, Angelova A, Kudrna D, Luo M, et al: De Novo Next Generation Sequencing of Plant Genomes. Rice. 2009, 2 (1): 1939-8425.View Article
- Sambrook J, Fitsch EF, Maniatis T: Molecular Cloning: A Laboratory Manual. 1989, Cold Spring Harbor, NY: Cold Spring Harbor Press
- RepeatMasker. [http://www.repeatmasker.org]
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005, 110 (1-4): 462-467. 10.1159/000084979.PubMedView Article
- Gordon D, Abajian C, Green P: Consed: a graphical tool for sequence finishing. Genome Res. 1998, 8 (3): 195-202.PubMedView Article
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27 (2): 573-580. 10.1093/nar/27.2.573.PubMed CentralPubMedView Article
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView Article
- Lazo GR, Lui N, Gu YQ, Kong X, Coleman-Derr D, Anderson OD: Hybsweeper: a resource for detecting high-density plate gridding coordinates. Biotechniques. 2005, 39 (3): 320-10.2144/05393BM05. 322, 324PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.