Bread wheat is one of the most important food crops worldwide. However, present wheat production is far from the expected increased global demand in the near future [1, 2]. Development of better yielding varieties with improved adaptation to the new climatic challenges is therefore important for global food security. A ‘tool’ with a great potential to revolutionize wheat breeding and production is a publicly available reference genome sequence. Genome sequences enable cost-effective identification of genomic variation which subsequently can be used to improve agricultural traits of interest through marker-assisted selection (MAS) and genomic selection programs . A rapidly increasing number of genomes from important food crops are becoming available. In 2011 potato and cacao [4, 5], in 2010 soybean , and in 2009 maize, sorghum and cucumber genomes were published [7–9]. However, even though wheat is one of the top five food commodities in the world, a wheat genome sequence is not yet available.
The main reason why the wheat genome sequencing is lagging behind is related to technical challenges due to large size (17Gb) and the complexity of the hexaploid wheat genome. Bread wheat is allohexaploid and carries three distinct, but closely related homoelogous genomes (2n = 6x = 42, AABBDD) [10, 11]. A distinction between homoeolog sequences in post sequencing processing of genomic sequence data is essentially impossible. Fortunately, the hexaploid wheat genome can be dissected to small parts by flow cytometric sorting of single chromosomes and chromosome arms [12, 13]. This technological breakthrough has enabled production of wheat chromosome specific BAC-libraries  and facilitated construction of physical maps of hexaploid wheat chromosomes . For some genomic applications, such as shotgun sequencing, large amount of DNA are required. In order to obtain sufficient DNA to sequence purified chromosome arms, millions of chromosomes must be sorted, a process, which is highly labor intensive . Including an amplification step of flow-sorted DNA can significantly reduce the labor and consequently the cost of acquiring chromosome specific DNA for sequencing. Multiple displacement amplification (MDA) is the most common method for genome amplification for sequencing purposes as MDA generate relatively long amplification products (majority between 5-20 kb) . However, MDA is known to give rise to chimeras, which can bring down the utility of the amplified DNA .
Shotgun sequencing of MDA DNA from flow-sorted chromosome arms, especially in combination with genetic maps and synteny information, has proven to be a highly cost effective way of gene discovery and construction of syntenic chromosome assemblies [19–21]. Unfortunately, the fragmentation level of the shotgun assemblies has been very high, which limits the information value of the assemblies. De novo assemblies of 7DS and 7BS using Illumina paired-end (PE) sequences with a chromosome arm coverage of 30-34×, resulted roughly in 600,000-1,000,000 contigs per chromosome arm, an N50 of ~500-1200 bp, and maximum contig sizes of just over 30,000 bp [21, 22]. Consequently, many contigs do not contain complete gene sequences, and the relative order of genes can only be identified for a small subset of genes found on contigs containing multiple genes (i.e. multigene contigs).
High levels of DNA sequence assembly fragmentation is closely associated with the repeat content of the genome , and the wheat genome is extreme with respect to repeat content, having more than 80% repetitive DNA . One way of reducing assembly fragmentation is to include additional sequencing libraries with large insert sizes, referred to as mate pair (MP) libraries . MP reads can vary in insert sizes between 1-20 kb and the idea of these ‘long jump’ paired sequences is to span repetitive regions that cause assembly fragmentation, and thereby link multiple contigs into longer scaffolds. This will improve the information value of an assembly by (1) improving the assembly contiguity (2) increasing the proportion of full length genes contained in single sequences (i.e. link exons from different contigs), and (3) increase the number of linearly ordered genes.
A number of recent publications describe the effect of MP data on assemblies of plant genomes [4, 9, 25]. One example is the potato genome assembly, which had on average an N50 increase of 37 Kb for every 1 Kb increase in MP insert size . Although the potato genome (1C = 865 Mbp) has a relatively high repeat content (total repeat content ≈ 62%, TE-derived repeats ≈ 32%), it does not compare to the hexaploid wheat genome (1C = 17,000 Mbp) that has >80% of TE-derived repetitive DNA . It is thus not clear to what extent MP data may improve shotgun assemblies of genomes with extreme repeat content such as wheat. Additionally, the utility of MP data from MDA DNA from flow-sorted chromosomes is unknown. The aim of this paper is therefore to study the effects of MP from MDA DNA on assembly contiguousness and gene content in shotgun assemblies of a flow-sorted hexaploid wheat chromosome.