A clone-free, single molecule map of the domestic cow (Bos taurus) genome

Background The cattle (Bos taurus) genome was originally selected for sequencing due to its economic importance and unique biology as a model organism for understanding other ruminants, or mammals. Currently, there are two cattle genome sequence assemblies (UMD3.1 and Btau4.6) from groups using dissimilar assembly algorithms, which were complemented by genetic and physical map resources. However, past comparisons between these assemblies revealed substantial differences. Consequently, such discordances have engendered ambiguities when using reference sequence data, impacting genomic studies in cattle and motivating construction of a new optical map resource--BtOM1.0--to guide comparisons and improvements to the current sequence builds. Accordingly, our comprehensive comparisons of BtOM1.0 against the UMD3.1 and Btau4.6 sequence builds tabulate large-to-immediate scale discordances requiring mediation. Results The optical map, BtOM1.0, spanning the B. taurus genome (Hereford breed, L1 Dominette 01449) was assembled from an optical map dataset consisting of 2,973,315 (439 X; raw dataset size before assembly) single molecule optical maps (Rmaps; 1 Rmap = 1 restriction mapped DNA molecule) generated by the Optical Mapping System. The BamHI map spans 2,575.30 Mb and comprises 78 optical contigs assembled by a combination of iterative (using the reference sequence: UMD3.1) and de novo assembly techniques. BtOM1.0 is a high-resolution physical map featuring an average restriction fragment size of 8.91 Kb. Comparisons of BtOM1.0 vs. UMD3.1, or Btau4.6, revealed that Btau4.6 presented far more discordances (7,463) vs. UMD3.1 (4,754). Overall, we found that Btau4.6 presented almost double the number of discordances than UMD3.1 across most of the 6 categories of sequence vs. map discrepancies, which are: COMPLEX (misassembly), DELs (extraneous sequences), INSs (missing sequences), ITs (Inverted/Translocated sequences), ECs (extra restriction cuts) and MCs (missing restriction cuts). Conclusion Alignments of UMD3.1 and Btau4.6 to BtOM1.0 reveal discordances commensurate with previous reports, and affirm the NCBI’s current designation of UMD3.1 sequence assembly as the “reference assembly” and the Btau4.6 as the “alternate assembly.” The cattle genome optical map, BtOM1.0, when used as a comprehensive and largely independent guide, will greatly assist improvements to existing sequence builds, and later serve as an accurate physical scaffold for studies concerning the comparative genomics of cattle breeds. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1823-7) contains supplementary material, which is available to authorized users.


Background
Cattle are the most common type of large domesticated animals and have consequently played an important role in recent history of humankind since their domestication 8,000 to 10,000 years ago [1]. Cattle have enhanced human civilizations through their varied uses as livestock for meat, milk, and draft power. Accordingly, there are~1.3 billion cattle in the world today providing a significant source of nutrition and livelihood to the human population. Domestic cattle comprise more than 800 breeds and are grouped taxonomically into two species-Bos taurus (taurine) and B. indicus (indicine)-which were evolved from the ancestral species of B. primigenius. Given this large and venerable resource of cattle breeds, cattle research efforts have also greatly contributed to our knowledge of genetics, endocrine function, fertilization, growth, lactation and mammalian biology. As such, there are still many unsolved questions regarding cattle adaptation to diverse terrestrial environments since domestication that center on how cattle convert low-grade forage to energy-rich fat, milk and meat, and, more fundamentally, how genetic underpinnings define economically important traits. The cattle genome was originally selected for sequencing due to its unique biology and economic importance, virtues that are also strengthened by its role as a model organism for understanding other ruminants, or mammals.
The Bovine Genome Sequencing and Analysis Consortium published the first draft sequence for the Bos taurus genome in 2009-a sizable effort costing $53 million and involving nearly 300 investigators from 25 countries [2,3]. The initial sequence assembly (Btau4.0) was constructed by the Baylor College of Medicine Human Genome Sequencing Center using~7.1-fold Sanger sequencing coverage of the genome. Their genome assembly approach combined a BAC (Bacterial Artificial Chromosome) clone-by-clone approach with whole genome shotgun (WGS) reads, and yielded an N50 contig size of 48.7 Kb and a N50 scaffold size of 1.9 Mb (Btau4.0; 135,743 contigs; 13,388 scaffolds; total mass: 2.77 Gb). 89 % of these assembled contigs and scaffolds were anchored onto the 29 bovine autosomes and the X chromosome based on the integrated FPC physical map [4], which combined a series of complementary mapping resources: 290,797 fingerprinted BACs, the humancattle comparative map, the genetic map, and the radiation hybrid (RH) map [2][3][4][5][6][7][8][9][10]. The Center for Bioinformatics and Computational Biology, University of Maryland, using a different strategy, constructed another bovine assembly in 2009 based on the same raw sequence and map data (UMD2; 44,433 contigs; total mass: 2.86 Gb; contig N50: 93.56 Kb). Their strategy leveraged paired-end BAC sequence information, mapping data and, most notably, syntenic relationships to the human genome that allowed 91 % of the UMD2 contigs to be anchored to bovine chromosomes, based on the integrated bovine genome map [4,11]. Comparisons between these two assemblies revealed substantial differences that appear as assembly errors, genome segmental inversions, chromosomal placements, sequence gap numbers, and discrepancies of the sequence coverage across the bovine genome [11][12][13]. Two updated bovine genome sequence assemblies (Btau4. 6 and UMD3.1) were released from these groups featuring additional BAC sequence data, corrected assembly errors and additional gap filling. Although comprehensive analyses of these recent releases have yet to be done, significant differences between these updated assemblies are generally expected to be encountered. Indeed, this article reports on notable disparities. Consequently, discrepancies between these assemblies engender ambiguities when using reference sequence data, which significantly impacts almost any type of genomic study in cattle.
The cattle genome, as discussed, enjoys a broad range of map resources that include: genetic linkage maps using microsatellite markers; markers comprising AFLP, EST, and BAC end sequences; a radiation hybrid map, and a BAC physical map [4,5,7,9,[14][15][16][17]. Despite this, these resources fall a bit short in several ways. The genetic linkage and radiation hybrid maps lack sufficient levels of unambiguous markers, but, more troubling, the linkage map is a composite constructed across many separate bovine populations and thus doesn't reflect a single bovine genome. The bovine BAC physical map is a composite map that was constructed from three different BAC libraries developed from three different cattle breeds (Hereford CHORI-240, Holstein RPCI-42, and Angus TAMBT) [4,7]. Understandably, such haplotype and/or breed-specific variability in these map resources could translate into ambiguities evidenced by sequence-map comparisons, which may have impacted the fine-scale assembly, or previous validations of the bovine reference sequence.
We constructed a comprehensive optical map spanning the bovine genome, using genomic DNA from just one animal (L1 Dominette 014490; the same Hereford animal that was originally sequenced) in order to circumvent this array of issues. This new resource will provide the bovine community with a highly accurate and comprehensive physical map that enables direct and independent comparisons amongst sequence builds, with goals pointed at sequence finishing and discovery of genomic differences. Briefly, Optical Mapping is a single-molecule system that constructs high-resolution physical scaffolds, covering entire genomes to guide many stages of genome sequence assembly and validation [18][19][20][21][22][23][24]. Since it assembles genome-wide ordered restriction maps from massive datasets comprising randomly sheared genomic DNA molecules (~400 kb), artefacts associated with cloning and amplification are completely obviated. Furthermore, very long DNA molecules span complex genomic regions that are rife with repeats that generally hinder accurate sequence assembly without Optical Mapping analysis. As such, our optical map offers an uniquely effective means for resolving and mediating the differences between the two different bovine genome sequence assemblies in several ways: 1) recruiting new orphan sequence contigs that fill sequence gaps; 2) providing an independent resource that potentiates finishing through sequence gap characterization, and 3) enabling independent validations of sequence assemblies.

Optical map dataset
Genomic DNA was prepared from L1 Dominette 014490 blood samples, after separation of white blood cells, and then BamHI restriction mapped using our Optical Mapping pipeline (Materials and Methods). This raw dataset holds 1,908,396 Rmaps (1 Rmap = 1 single molecule restriction map) ≥ 300 Kb, with an average size of 397.49 Kb (300-2,515.20 Kb) and a total mass of 758,574.97 Mb (~270 X coverage, before alignment, assuming a~2.8 Gb genome). One Rmap is the restriction map of a single genomic DNA molecule; it represents the most fundamental unit of map data in functional ways akin to a sequence read. A second map dataset was contributed by Prof. Juan F. Medrano and after size filtering (≥300 kb) it added another 1,064,919 Rmaps, bringing the total raw dataset up to 2,973,315 (439 X coverage, before alignment).
Initial evaluation of the genome builds UMD3.1 and Btau4.6 via pairwise alignments of the Rmap dataset The UMD3.1 and Btau4.6 references were first evaluated for large-scale errors by inspection of the pairwise alignments [25,26] of the entire Rmap dataset against BamH1 in silico restriction maps (constructed in the computer) of both sequence builds (Materials and Methods). These map vs. reference alignments produce files, similar to sequence SAM/BAM files, which note the location of each aligned Rmap (Additional file 1: Figure S1 and Additional file 2: Figure S2). Such alignments allow us to quickly filterout marginal Rmaps from the raw dataset and provide an initial assessment of the completeness of a given sequence build [27]. The average Rmap coverage after alignment varies considerably between the two builds, with 42 X for Btau4.6, while UMD3.1 boasts 70 X. Additional file 1: Figure S1 and Additional file 2: Figure S2 also show a specific example of disparate rates of Rmap coverage, focusing on a 3.3 Mb region on chromosome 8, highlighted by a green box, where 8 Rmaps (~1 X coverage) are aligned to Btau4.6, compared to 527 Rmaps (~64 X) aligned to UMD3.1. Given these vastly different overall alignment rates and patterns, we chose UMD3.1 to serve as our reference sequence build for assembling the optical map.

Optical map assembly
Our optical map assembly strategy used a two-pronged approach involving iterative assembly, requiring a sequence reference [26], and de novo assembly for dealing with largescale discordances (sequence vs. map) and gaps in the UMD3.1 build (Fig. 1). Many of these problematic regions are sparsely populated by Rmaps as evidenced from inspection of Additional file 1: Figure S1 (see regions highlighted by purple boxes). Accordingly, the workflow (Fig. 1) shows how iterative assembly selectively shunts uncontiged Rmaps, mostly originating from these problematic regions, into "Germinate and Grow" (G & G) for de novo assembly. Resulting optical map contigs from both sides of the workflow were then curated and combined for finishing the optical map. Details follow in the next two subsections.

Iterative assembly
We published a workflow in 2010 [26], termed "iterative assembly" (Fig. 1), which embedded genome assembly algorithms [28][29][30][31], originally designed to deal with small bacterial or fungal genomes, within a new pipeline. This pipeline supports the assembly and analysis of large mammalian and plant optical maps by distributing the computation into large numbers of independent jobs that can be executed on a high-throughput computing network. Briefly, iterative assembly uses an in silico restriction map of available genome sequence resources-contigs, scaffolds, pseudomolecules, etc.-as a reference for exhaustive pairwise alignment [25] of entire Rmap datasets. Both sequence data (UMD3.1) and actual genomic DNA molecules (Rmaps) are "cut" with the same restriction enzyme. Thusly placed Rmaps, termed "piles," covering an entire genome, are then divided into 1 Mb overlapping bins along each chromosome for assembly; each bin is independently assembled into contigs. Each optical contig bears a consensus map, which now becomes the updated, independent reference; sequence information is no longer used in the assembly process. Repeatedly iterating this workflow increases optical contig length, number and depth.
Eight iteration cycles were performed using a BamH1 in silico map constructed from the UMD3.1 sequence build as the initial reference and with a minimum depth of 20 Rmaps. 3,048 contigs emerged after the first iteration ranging 404-2,943 Kb in size; averaging 1,826 Kb. However, after 8 iterations the number of contigs increased to 3,321, and their average span was boosted to 3,545 Kb (421-6,456 Kb). Contigs presenting very long tandem repeats were removed from this process. These 3,321 optical contigs were then grouped by chromosome, using alignments to UMD3.1 and each grouping was independently assembled into a total of just 79 contigs spanning 96.71 % of the UMD3.1 build.

de novo assembly
We have previously reported on the Map Assembler, a de novo optical map assembler capable of assembling bacterial maps [32]. However, the Map Assembler algorithm has polynomial complexity (degree >2) and exceeds feasible memory and time constraints for genomes of size >10 Mb. To face the challenge of assembling larger genomes, we've implemented Germinate and Grow (G & G), a new de novo assembly algorithm that will be fully described elsewhere. The conceptual basis for G & G is an extension of the de Bruijn graph approach to sequence assembly [33,34]. Simply put, a whole genome optical map can be represented by the traversals of a certain graph, and the assembly problem is to discover those traversals from the input data set of Rmaps. Specifically, we use geometric k-mer hashing [35] to identify nodes in the de Bruijn graph that are very likely error-free and then traverse the "read" paths implied by the Rmaps containing instances of those nodes. This traversal allows us to localize the assembly; we then use the Map Assembler on the subsets of Rmaps that are near each other on this graph. We call the resulting consensus maps seed maps. The seed maps typically cover most of the genome and they reliably approximate highly confident paths in the graph.
The seed maps are then extended and refined using the iterative assembly engine (Fig. 1), producing another set of consensus maps. The error rate for these consensus maps is sufficiently low for resolving the corresponding Euler path and assembling all but the most repetitive regions of the genome. We then fill gaps in the assembly by repeating the process, generating another set of seed maps and extending and refining them. For this set of seed maps, we use a lower stringency (smaller value of k) and use only those Rmaps not already represented in the genome reference-based iterative assembly.
We used G & G to assemble just those Rmaps (2,448,748) that escaped assembly within the iterative assembly pipeline, which yielded 1,500 optical contigs, with most recapitulating those constructed by iterative assembly. As such, these de novo optical maps were largely used to augment and cross-validate optical map assemblies constructed by iterative assembly. The final  exhaustive pairwise alignments of the complete Rmap dataset (purple lettering) against the UMD3.1 sequence reference maps (red lettering) generated in silico produces "piles" of Rmaps. Such alignments are then divided into overlapping bins (1 Mb bins; 500 Kb overlap), which are then independently assembled into updated reference maps (optical contigs bearing consensus maps) that are used for 8 subsequent cycles (blue circular arrows) of alignment (entire Rmap dataset) and assembly, all performed without using the sequence reference. Right side depicts de novo map assembly, using those Rmaps from the entire dataset not recruited for optical map contig formation during iterative map assembly, which are used to construct a de Bruijn graph via k-mer hashing. "Seed" maps are then assembled from Rmaps in each each confident node within graph and used as an optical reference for pairwise alignments (piles) of Rmaps during iterative assembly (8 times; blue circular arrows). Bottom shows the merged assembly of updated optical contigs from each bin (iterative assembly) and contigs assembled from "seed" maps (de novo assembly) into the final optical map-BtOM1.0, which was used to tabulate map vs. sequence discordances bovine optical map-termed, "BtOM1.0"-comprises 78 contigs spanning of 2,575.30 Mb across the genome (alignments to UMD3.1 are found in Fig. 4), at an average depth of 77 Rmaps and an average contig size of 33.02 Mb (659.71 Kb-140.22 Mb; Table 1). The haploid bovine genome harbors 29 acrocentric, autosomal chromosomes, and one sex chromosome, or 60 telomeric ends. Accordingly, 20/78 optical contigs (BtOMcontig_6, 8,11,16,21,23,24,34,39,40,42,45,46,47,49, 50, 55, 57, 67, 69) present sharply demarcated ends (Figs. 2a and 3; Tables 1 and 2), which indicate that they've spanned into the repetitive sequences near telomeres. The remaining 40/60 chromosome ends are not, or, are partially spanned by optical maps because the short arms of these acrocentric chromosomes are densely populated by repeats, making them intractable to our analysis. Interestingly, we find that BtOMcontig_4 has~6 Kb tandem repeats at one end, which also shows alignment to chromosome 11. In addition, BtOMcontig_2 presents tandem repeats with a repeat unit consisting of multiple BamHI fragments with a total unit mass of~290 Kb and is anchored on bovine chromosome 6 ( Fig. 2b; Table 1) The optical contigs generated by iterative and de novo map assembly ( Fig. 1) were merged through assembly of their consensus maps into 78 final optical contigs. They were then ordered and oriented, through alignment against a BamHI in silico restriction map constructed from the UMD3.1 sequence build (Fig. 3). Chromosome-wide optical maps, BtOM1.0_chr1-29 and BtOM1.0_chrX, were constructed with 500 Kb gaps inserted between any two of optical map contigs anchored on the UMD3.1 sequence assembly (BtOM1.0 available at GitHub: https://github.com/schwartz-lab/BovineGenomeOMdata/). This workflow constructed 30 chromosome-wide optical maps that were aligned to both Btau4.6 and UMD3.1 sequences using local alignment (Fig. 4). A series of contiguous restriction fragments that align between BtOM1.0 and the in silico maps of a sequence build is called a "map segment." Tabulations describing these aligned map segments are listed in Table 2. In total, 135 map segments (over 78 optical contigs) present a total of aligned sequence segment mass of 2,297.08 Mb, with an average size of 17.02 Mb, or~86 % of UMD3.1 are covered by optical maps. Map coverage of UMD3.1 ranges from 73 to 95 %. 50/60 chromosome ends, 19.14 Mb in total, within UMD3.1, are extended by optical maps (Fig. 5; Tables 1 and 2). For Btau4.6, 188 map segments align to BtOM1.0, with a total mass of 2,054.74 Mb (~78 % of Btau4.6) and an aligned map segment size averaging 10.93 Mb. The optical map coverage of Btau4.6 by optical maps is less than that tabulated for UMD3.1 and it ranges from~65 to 82 % for all 30 bovine chromosomes. Lastly, 55/60 of the Btau4.6 chromosome ends are extended by optical maps with a mass totalling 38.35 Mb (Table 2).

Discordance calling between optical maps and sequence assemblies
Discordances between BtOM1.0 and the UMD3.1 were called based on the alignments between the consensus maps that were trimmed and stripped off from the last (8 th ) cycle of iterative assembly ( Fig. 1; Methods; [26]) and then manually curated. Complex discordances required directed alignment and assembly steps, complemented by additional curation, for their complete characterization. There are, in total, 4,754 discordances called between BtOM1.0 and the UMD3.1 based on only confident alignments and these discordances are tabulated as six categories (Additional file 3: Table S1; Table 3 Similarly, Btau4.6 discordances were tabulated as just described for UMD3.1, but relied on the same optical consensus maps created from UMD3.1. These efforts identified 7,463 discordances in the Btau4.6 sequence assembly (Additional file 4: Table S2; Table 3; Additional file 5: Figure S3). Such tabulations include 102 large segments of inverted/translocated (IT) (involving 61.65 Mb; Fig. 7a), 2,331 COMPLEX-complex events/misassembly/ inversion (involving 273.14 Mb; i.e., Fig. 7b and c)

Discussion and Conclusions
A whole genome optical map, BtOM1.0, of the B. taurus Hereford breed, L1 Dominette 01449 was constructed using the same animal employed for whole genome shotgun sequencing, which was also the daughter of the Hereford bull L1 Domino (registration number 41170496) Kb, and such "marker" density is far greater than the bovine genetic map (∼7,000 markers [14]) and the composite map (∼17,000 markers) which combines linkage and radiation hybrid resources [8,9]. The size of the B. taurus genome, estimated by Optical Mapping, is similar to those  Comparisons of Btau4.6 and UMD3.1 to BtOM1.0 also revealed large-scale difference between these sequence assemblies. Such issues became most apparent through our analysis of optical map alignments at telomeric regions, or ends of chromosomes. Table 2 shows that Btau4.6 is missing more sequence at chromosome ends (38.35 Mb) as compared to UMD3.1 (19.14 Mb). Also, the Btau4.6 assembly of the X chromosome excluded~60 Mb of sequence relative to BtOM1.0 (Fig. 8). Therefore, our comparative analysis results of UMD3.1 and Btau4.6 based on alignments to BtOM1.0 are in line with previous reports [11,12], and affirm the NCBI's current designation of UMD3.1 sequence assembly as the "reference" assembly and the Btau4.6.1 assembly as the "alternate" assembly [36].
There are numerous sequence gaps in the two B. taurus genome sequence assemblies (74,425 in UMD3.1 and 66,276 in Btau4.6). However, most of the sequence gaps are small, in that there are only 606 sequence gaps ≥ 2 kb in UMD3.1 and 5,450 in Btau4.6. Importantly, greater than 96 % of these gaps in UMD3.1 and Btau4.6 were successfully bridged by BtOM1.0 (584 and 5401, respectively). Accordingly, this analysis begs the question: Are the discordances called in UMD3.1 and  Fig. 4 Genome-wide alignment of BtOM1.0 to UMD3.1 and Btau4.6 sequence assemblies. Tracks show BtOM1.0 (center) alignments to UMD3.1 (top) and Btau4.6 (bottom) for each chromosome. Red highlights (center track) restriction fragments aligned to both in silico maps of UMD3.1 and Btau4.6; cyan (center track) highlights alignment of BtOM1.0 to UMD3.1, or Btau4.6; and white or black highlights no alignments to either sequence build. Inset (green box) shows a zoomed view of chr27 detailing a large-scale discordance with the optical map: an inverted sequence assembly within Btau4.6. Also note transposed sequence assemblies, flagged by black lines running between separate chromosomes Btau4.6, through alignments to BtOM1.0, largely due to sequence gaps inserted into the two sequence assemblies? We explored this question by intersecting the sequence gap and discordance coordinates from both sequence assemblies, and identified within UMD3. Thus, 27.6 % of the large sequence gaps in UMD3.1 contribute to only 3.4 % of the discordances called in UMD3.1, while 84.2 % of the large sequence gaps in Btau4.6 are responsible for 50.8 % of the called discordances in this assembly. As such, this simple analysis further substantiates the superior quality of UMD3.1 vs. Btau4.6, which in part, is reflected by the high rate of parsimoniously inserted sequence gaps.
Our systematic tabulation and curation of discordances found through comparison of BtOM1.0 vs. UMD3.1, or Btau4.6 will greatly facilitate future improvements of B. taurus genome sequence assemblies in order to build a more accurate and unified version of the reference sequence. Because BtOM1.0 was constructed from DNA derived from the very same animal that was sequenced, this physical map provides direct comparisons to these other resources that are not affected by genotype differences manifested by other breeds, or even animals of the same breed. Although there are many map resources available for the B. taurus genome, which include genetic linkage maps [5, 6, 14-17, 37, 38], radiation hybrid maps [8,9,39], BAC physical maps [4,7], cytogenetic maps [40,41] and comparative maps between cattle and human [10,42,43], the resolution of these maps can be modest. Consider that the B. taurus composite map of integrated linkage/radiation hybrid maps [9,39] and BAC physical maps [4,7] features the greatest number of markers (17,254 markers), but with a density of only~180 kb/marker. In comparison, BtOM1.0 boasts an average restriction site density of 8.91 Kb, which fostered resolution of difficult-to-discern errors in sequence assembly. For example, Fig. 9 shows a 79 Kb region that was inverted and misplaced based on alignment to BtOM1.0, which was also substantiated by new sequence data and PCR.
During the course of writing this manuscript, a reviewer questioned the extent of map errors and possible biases that may be introduced through our selective use of UMD3.1 as the reference genome for BtOM1.0. Although our previous publications report a high degree of accuracy and minimal biases stemming from the iterative assembly pipeline [22,23,26], we compared the iterative assembly of optical maps constructed from UMD3.1 vs. Btau4.6 using chromosomes 27 and 28 (Fig. 10). Using Btau4.6 as the reference sequence, eight iterations (Fig. 1) and merging of optical contigs produced three optical map contigs for chromosome 27 and a single optical map contig was derived for chromosome 28. Alignments show that these new maps are essentially identical to BtOM1.0 except for a few restriction site differences (2 extra cuts, 5 missing  Fig. 5 Examples of optical map coverage within telomeric, or sub-telomeric regions. a: BtOM1.0_chr11 aligned to chr11 of both UMD3.1 and Btau4.6. Lines show BamHI restriction sites with red highlighting those BtOM1.0_chr11 track (center) restriction fragments aligned to both UMD3.1_chr11 and Btau4.6_chr11. White highlights unaligned BtOM1.0_chr11 restriction fragments and extend 517 kb past both the UMMD3.1_chr11 and Btau4.6_chr11 sequence. b: BtOM1.0_chr22 aligned to chr22 of both UMD3.1 and Btau4.6. (Same color scheme as a.) The white region on BtOM1.0_chr22, is unaligned to both sequence builds and extends~625 kb past UMD3.1_chr22 cuts for chromosome 27; 4 missing cuts for chromosome 28). We attribute these minor differences to heterozygosity, since our calling of discordances uses a single representation of the physical map created by Optical Mapping. However, over an entire chromosome multi-Mb-scale differences are apparent. Fig. 10b shows three optical map contigs aligned to Btau4. In comparison, chromosome 28 shows a single optical map contig (43.20 Mb) generated from the Btau4.6 sequence as the starting reference, while two optical map contigs (contig1, 42.84 Mb; contig2, 3.01 Mb; Figs. 3 and 10) formed using the UMD3.1 sequence for chromosome 28 (Fig. 10a). The absence of the small con-tig2 (3.01 Mb) from the Btau4.6 derived optical contigs implies reduced coverage for chromosome 28. As such, our analysis shows that BtOM1.0 bears minimal local biases stemming from the choice of sequence build used for iterative assembly, but the overall optical map coverage varies. Fortunately, absent, or problematic genomic regions would then be covered, as required, by optical maps constructed by de novo techniques (Fig. 1). Consequently, the need for de novo assembly steps is minimized by judicious selection of a reference genome for iterative assembly of an optical map.
We conclude that BtOM1.0 will prove to be a valuable resource for advancing the state of current sequence assemblies, by serving as a largely independent physical scaffold, as shown in Figs. 5, 6, 7, 8, 9 and 10, but perhaps, more importantly, as a platform to support future comparative studies, focusing on structural variation amongst different cattle breeds, or within populations. Lastly, errors always accompany any ambitious effort pointed at comprehensive analysis of entire genomes. Accordingly, the true merits and accuracy of a new resource, such as BtOM1.0, will be comprehensively assessed over time by individual researchers in the bovine community.   [44] (1 million cells/ml) and lysed in modified NDSK (1 mg/ml proteinase K, 1 % lauroylsarcosine, 0.5 M EDTA, 1 M NaCl, pH 8.0) at 50°C for two overnights with one switch of fresh NDSK solution after the first overnight; HMW DNA was extracted from prepared inserts for optical mapping as previously described [20,24,45].

Optical mapping
Optical mapping surfaces were prepared as previously described [19,21,45,46]. Briefly, glass cover slips ( Fig. 7 Examples of large-scale map vs. sequence discordances (IT discordances). a: GPS (Methods) viewer shows the BtOM1.0_chr1 optical map (center track) alignments to UMD3.1_chr1 and Btau4.6_chr1 revealing 759 Kb translocated region (UMD3.1) and a 1,284 Kb inversion (Btau4.6). Color scheme is the same as Fig. 5, with cyan highlighting BtOM1.0_chr1 portions aligning to UMD3.1_chr1 or Btu4.6_chr1. b:~1 Mb region of chromosome 17: the in silico maps (blue track) of UMD3.1_chr17 perfectly align to the BtOM1.0_chr17, but Btau4.6_chr17/BtOM1.0_chr17 alignment reveals two misassembled regions in the Btau4.6 sequence build: an inversion (blue and gold arrows) and extraneous sequence. Restriction fragments (undulating boxes) are color keyed as: gold (agreement with blue track, or reference); red (extra cuts in optical map); cyan (missing cuts in optical map); and purple (compound events). c:~500 Kb region on chromosome 19: alignment of BtOM1.0_chr19 and Btau4.6_chr19 (blue track) suggest that the Btau4.6_chr19 sequence (blue arrow) was probably misplaced and should be inverted and placed nearby. BtOM1.0_chr19 and UMD3.1_chr19 (blue track) alignment suggests that sequence here also presents misassemblies similar to Btau4.6_chr19 and sonicated until the pH of the wash reached 6.0 within 30 min, and then washed with ethanol twice with sonication. Cleaned glass cover slips were derivatized using trimethyl silane: (N-trimethoxysilypropyl-N,N,N-trimethylam monium chloride; 130 μl) and vinyl silane: (vinyltris(trimethysiloxy)silane); 15 μl) in 250 ml in distilled water to confer a positive charge and provide chemical moieties for covalent bonding of the acrylamide overlay to the surface.

DNA mapping, image acquisition, and processing
Bovine genomic DNA molecules (~400-500 kb) were premixed with lambda DASH II bacteriophage DNA When only one end is mapped, or mapping shows wrong orientation, or revealing discordant distances between mapped read pairs, these events are termed broken pairs. Reads of a broken pair that map to a unique location against the reference are colored green or red, according to whether they mapped in the forward, or reverse, orientation respectively (Stratagene, La Jolla, CA) as an internal sizing standard and then deposited on optical mapping surfaces using a silicone microchannel device [24]. A fully automated image acquisition microscope workstation (GenomeZephyr) with Mightex LED illumination (San Francisco, CA) acquired image data that was automatically processed by machine vision, within a pipeline, which compiled large files comprising ordered restriction maps for each imaged molecule (Rmap) [24].

Optical map assembly
Previous work [22] had confirmed that iterative assembly, which relies on a sequence reference map, constructs unbiased optical maps that are essentially equivalent to those crafted by a de novo method using a "divide and conquer" approach [22,23,26,46]. Iterative assembly simply uses the reference sequence for anchoring Rmaps, which are then independently assembled in to optical contigs (Fig. 1). These newly assembled optical contigs become the updated reference for 8 cycles of alignment and assembly, which increase their breadth and depth. All accomplished without use of the sequence reference map. Accordingly, if a sequence reference suffers from many misassemblies, or gaps, de novo approaches are used to assemble across such regions. Because of sequence assembly issues, the B. taurus optical map incorporates these two assembly strategies for efficient and comprehensive  Fig. 10 Comparisons of optical maps assemblies seeded by UMD3.1, or Btau4.6. (Cyan tracks show in silico restriction maps generated from sequence; Orange tracks show optical maps. Striated black, or white-filled boxes flag unaligned map fragment(s). Black vertical lines demarcate restriction fragments, or map alignments.) a: Alignments of BtOM1.0_chrs 27, 28, against the in silico maps of UMD3.1-chrs 27 and 28 (also see Fig. 3). b: Iterative assembly results for chrs 27 and 28 using Btau4.6 as the reference sequence. Note large inversion (far right), not present in UMD3.1, or optical maps, revealed within the Btau4.6 chr27 sequence. c: Alignment of the optical maps shown in (a) and (b). Inset shows 3.5 Mb region; vertical lines locate BamHI restriction sites. Pink fragment (500.0 Kb) shows the gap between the two optical map contigs in BtOM1.0 chromosome 28 map assembly, which used our G & G algorithm for de novo assembly.
We first used reference-based iterative map assembly and then removed the Rmaps in these assembled optical map contigs from the whole Rmap dataset, the leftover or the uncontiged Rmaps were for de novo map assembly via G & G. The combined map assembly strategy (Fig. 1) ensured the completeness of the final optical maps by maximizing the recovery of optical contigs from genomic regions not covered, or from heavily misassembled sections in the reference maps (UMD3.1 and Btau4.6).

Construction of chromosome-wide maps
Eight cycles of iterative assembly, using the UMD3.1 in silico map as the reference, produced thousands of overlapping optical contigs. Consensus maps of these contigs were merged using the map assembler into large-scale optical maps (Fig. 3). These large-scale optical maps were then further augmented and refined through additional merging operations using optical consensus maps generated from de novo assembly. After alignment to the UMD3.1 in silico BamHI restriction map, they were manually joined into chromosomewide optical maps and viewed using GnomSpace -a map-centric genome viewer that facilitates inspection of alignments.
Calling discordances between the in silico maps of sequence assemblies and optical maps As previously described [26] the iterative assembly pipeline automatically calls discordances, or structural variants using a reference map (UMD3.1, or Btau4.6. 5) classified as: (1)  Very large scale, or complex discordances involving apparent translocations of sequence assemblies between chromosomes required manual intervention. These discordances were flagged as ITs (Inverted or Translocated sequences) and curated using map viewers developed in our group: GPS (unpublished work) and GnomSpace [26].

Genome viewer: GPS
Genome Polysemy and Synonymy (GPS; unpublished) is a visualization platform for the analysis of alignments between optical maps, optical contigs, and in silico restriction maps created from sequence data. The software takes an xml file consisting of several optical maps and their alignments, and converts them into an interactive graphical representation using Scalable Vector Graphics (SVG). The SVG engine within GPS enables users to zoom in/out, pan, arbitrarily position optical maps, or contigs, and highlight selected features in ways designed to greatly enhance visual analysis of alignments. Such advantages allow users to more fully understand compound events involving translocations, inversions, and frank aberrations, or discordances. GPS visualization capabilities are based on an Open Source SVG manipulating library called Apache Batik (http://xmlgraphics.apache.org/batik/), and the last version of Java (1.8). One of the most useful advantages of the software is its ability to efficiently process and render very large map alignments within sizable and complex genomes (~3 Gb). GPS source code is accessible here: https://github.com/schwartz-lab/genome-polysemyand-synonymy

DNA sequencing
One lane of 150 bp PE Illumina sequencing was performed from blood extracted genomic DNA from Dominette L1 014490 to generate 515 million reads (the SRA archive number in NCBI: SRP05124). Reads were mapped to assembly UMD3.1 using CLC Bio Genomics Workbench software (CLC Bio, Aarhus, Denmark; 85 % of the reads mapped to UMD3.1) using the following settings: mismatch cost = 2; linear gap cost for insertions and deletions = 3; length fraction = 0.6; similarity fraction = 0.9; auto detect pair distance and ignore non-specific matches.

Ethics statement
The bovine blood sample used is the property of the ARS USDA, therefore, no specific permits were required for the described studies.