Identification of recombination events in outbred species with next-generation sequencing data

Background Meiotic recombination events include crossovers and non-crossovers or gene conversions. Although the rate of crossovers is often used for genetic mapping, the gene conversion events are not well studied especially in outbred species, which could produce distorted markers and thus affect the precision of genetic maps. Results We proposed a strategy for identifying gene conversion events in Populus with the next-generation sequencing (NGS) data from the two parents and their progeny in an F1 hybrid population. The strategy first involved phasing the heterozygous SNPs of the parents to obtain the parental haplotype blocks by NGS analytical tools, permitting to identify the parental gene conversion events with progeny genotypes. By incorporating available genetic linkage maps, longer haplotype blocks each corresponding to a chromosome can be created, not only allowing to detect crossover events but also possibly to locate a crossover in a small region. Our analysis revealed that gene conversions are more abundant than crossovers in Populus, with a higher probability to generate distorted markers in the regions involved than in the other regions on genome. The analytical procedures were implemented with Perl scripts as a freely available package, findGCO at https://github.com/tongchf/findGCO. Conclusions The novel strategy and the new developed Perl package permit to identify gene conversion events with the next-generation sequencing technology in a hybrid population of outbred species. The new method revealed that in a genetic mapping population some distorted genetic markers are possibly due to the gene conversion events. Electronic supplementary material The online version of this article (10.1186/s12864-018-4791-x) contains supplementary material, which is available to authorized users.


Background
Linkage mapping plays an important role in genetic analysis, especially in the context of quantitative trait locus (QTL) identification [1,2], comparative genomics [3,4], and genome scaffold sequence assembly [5,6]. Since the publication of the first genetic map of Drosophila melanogaster [7], linkage maps have been constructed in many animal and plant species. Generally, a linkage map displays the linear order and genetic distance of molecular markers on chromosomes through analyzing the parental recombination events occurring during meiosis and passing on to the offspring. Thus, the precision and accuracy of linkage maps, which are crucial for applications, are affected by several factors, such as the mapping population size, the number and quality of markers, and the approach for ordering markers within a linkage group. In theory, the markers are required to be Mendelian factors, which segregate in a fixed ratio in a whole population. However, in practice, some markers were found not to follow Mendelian segregation, while the biological mechanics is not well explained up to date.
The Populus is a model system for forest trees. It has tremendous economic and ecological importance, and is widely distributed in North Hemisphere [8,9]. A large number of linkage maps of different Populus species have been built in the past two decades using traditional molecular markers such as RAPD, RFLP, AFLP and SSR. In most of these studies, distorted markers that deviated from Mendelian segregation ratios were reported with frequencies varying from < 10% [10][11][12][13][14] to > 20% [15,16]. Different strategies were taken for distorted markers when performing linkage analysis. Some studies excluded seriously distorted markers at the 1% significant level (i.e. P < 0.01) because these markers could bias linkage analysis and thereafter QTL identification [10,13,17,18], while the others included all distorted markers as they were considered to be possibly associated with genes of interest [11,14,15]. The segregation distortion in Populus was generally believed to be related to many biological factors, such as genetic isolation, chromosome loss, genetic load, genome structural rearrangement, and linkage of markers to lethal genes [19,20]. However, recent studies showed that gene conversion (GC) events during meiosis could be one of the main reasons that can skew segregation rates, but were typically ignored in genetic mapping studies [21].
It is well known that the result of meiotic recombination includes crossovers (COs) and non-crossovers (NCOs or GCs). COs reciprocally exchange DNA sequences between homologous chromosomes at a megabase scale, whereas GCs copy shorter sequences (less than a few kilobases) from one homologous chromosome to the other, altering allelic frequency [22][23][24]. Ky et al. [25] inferred that some distorted traditional markers of AFLP or RFLP in coffee may be due to GC events, but there were no direct evidences at genome-sequence scale. However, with the next-generation sequencing technologies, recent studies revealed that there are abundant GC events occurred during meiosis in Arabidopsis. Lu et al. [23] detected that the number of GCs was almost the same as the number of COs in Arabidopsis, like in yeast. Subsequently, however, Yang et al. [21] presented that more number of GC events were identified by sequencing 40 Arabidopsis plants and their parents with high coverage, rejecting the former estimate of equal numbers of CO and GC events.
In the present study, we investigated CO and GC events in Populus in order to understand the implications in generating distorted markers in a mapping population. We performed high-throughput whole genome sequencing of 10 progeny and their two parents in an F 1 bybrid population of Populus deltoides and P. simonii. In a previous study [26], we have separately constructed the female P. deltoides and the male P. simonii linkage maps with thousands of single nucleotide polymorphisms (SNPs) generated by one of the next generation sequencing technologies. Here, a strategy was proposed to identify the CO and GC events in such a highly heterozygous tree species through the procedures: (1) phasing the parental haplotypes based on the reference genome of P. trichocarpa [9], (2) mapping the paired-end (PE) reads of each progeny to the reference sequences, (3) calling SNP genotypes and phasing for haplotypes for each individual, (4) identifying GCs by comparing the progeny haplotypes with the parents, (5) generating longer parental haplotypes with available linkage maps, and (6) finally forming progeny longer haplotypes with SNP genotypes and identifying COs in each progeny. The strategy was implemented with Perl scripts as a freely available package, findGCO, at the website of https://github.com/ tongchf/findGCO. Consequently, 34.8 COs from the female parent and 27.3 from the male were found on average in progeny, while the numbers of GCs were 4055.6 and 3564.0, respectively, over 100 times the number of COs. Furthermore, we investigated the relationship between GCs and distorted markers with SNP data from 299 progeny in the same population, revealing that the distorted SNPs more frequently occurred in the regions of GCs than in the other regions. The results facilitated to recognize the role of GCs in forming distorted molecular markers and provided essential information when dealing with those markers in genetic mapping.

Plant materials and whole genome sequencing
An F 1 full-sib family of P. deltoides × P. simonii was originally established as a mapping population. Approximately 500 progeny were planted in Xiashu Forest Farm of Nanjing Forest University, Jurong County, Jiangsu Province, China [26]. Ten progeny randomly chosen from the hybrid population as well as the two parents were considered as the materials for identifying recombination events in this study. Genomic DNA was extracted from fresh leaf tissue of each individual with the CTAB protocol [27]. Next, the qualified DNA was randomly sheared by sonication and Illumina adaptors with a unique multiplex identifier (MID) were added by ligation. A single library for the two parents with an insert size range of 300-500 bp was prepared and sequenced from both ends (paired-end, PE) with 101 bp read lengths in one lane of Illumina HiSeq 2000, while two libraries for the 10 progeny with the same insert size were constructed and sequenced (PE, 126 bp) in two lanes of Illumina HiSeq 2500. The whole-genome sequencing was performed at different times in Biomarker Technologies Co. Ltd., Beijing, China (BMK).

Quality control and aligning of PE reads
The raw sequence data generated from the Illumina sequencers were filtered to obtain high-quality (HQ) reads with procedures as described in Mousavi et al. [28]. Briefly, we first discarded those PE reads that satisfy any one of the following conditions: (1) containing primer/ adapter sequence, (2) having more than 10% uncalled bases (N), or (3) more than half of the bases in either of the reads having Phred quality score less than 5. The data generated from this step are called clean data. Secondly, the clean data were further filtered with NGS QC toolkit [29] to generate HQ reads such that the quality score is greater than or equal to 20 for ≥70% bases in either of PE reads.
In the process of haplotype phasing or SNP calling, PE reads of each sample including the two parents were required to align to the reference genome sequence of P. trichocarpa [9]. We used the command mem in the software of BWA [30] with default parameters to map the reads to the reference sequence, resulting in a SAM (sequence alignment/map) [31] file for each sample. To avoid to use those reads that are mapped to repeat regions in the reference, each SAM file was filtered such that each record in the file has an edit distance not more than 8% of the read length, with the best alignment score greater than or equal to 60 and the second-best alignment score less than the best alignment. After this step, the SAM file was converted to BAM format with SAMtools [31] for saving storage space and other subsequent analyses.
Removal of duplicate reads is a usual filtering step in processing NGS data for high quality. Duplicate reads are considered to be caused by multiple PCR products from the same DNA fragment, which may lead to false positive variant calls [32,33]. We used the program MarkDupicaties in Picard package (http://broadinstitute.github.io/picard) to remove the duplicate reads contained in each BAM file generated above. The final processed BAM files were used for haplotype phasing analyses in the next sections.

Parental haplotype construction
A haplotype is a linear set of bases from all SNPs in a given chromosome [34], and here we defined a haplotype block as a subset of bases in a haplotype. In diploid organisms, the recombination events at meiosis can be discovered by comparing the haplotypes of an individual and its parents. We used the command phase in SAMtools [31] with a minimum base quality score of 20 in heterozygote to obtain the information of the parental haplotypes using the corresponding BAM files created above. The records of this step were filtered to generate haplotype blocks for each parent. Each block must contain at least 5 SNPs with a coverage depth of at least 5 reads at each site and at least 3 reads for each allele. Furthermore, the genotype of each SNP in a parental block was required to be heterozygous in the current parent and homozygous in the other, and each genotype quality must have a Phred-scaled score of at least 60. In the following steps, for simplicity, the allele of homozygote for all SNPs in these haplotype blocks is denoted by 'a' and the other allele of heterozygote by 'b'.
In order to obtain longer haplotype blocks, we used the linkage phase information of SNPs on the two parent-specific linkage maps constructed in the previous study [26] to merge two adjacent haplotype blocks on genome. The merging procedures can be described as in Fig. 1. When two SNPs on a linkage group with a known linkage phase (Fig. 1a) are found in two different haplotype blocks (Fig. 1b), the two haplotype blocks can be merged into a longer one, with another one on the homologous chromosome (Fig. 1c). Finally, each linkage group of two parental maps corresponded to a long haplotype block.

Identification of recombination events
We called genotypes for each progeny at all SNPs contained in the haplotype blocks of the two parents. First, the command mpileup in SAMtools was used to generate BCF files, with each BAM file as input and the parameter of minimum base quality taking the value of 20. Second, a VCF file was produced with each BCF file using the command call of BCFtools (v1.1), which is accompanied with SAMtools, just skipping indels (insertions/deletions). Finally, each VCF file was filtered such that an SNP genotype has a sequence depth of at least 15 reads with the genotype mapping quality of greater than 60 and each allele coverage of at least 5 reads.
For each parental haplotype block, including the longer ones constructed with linkage information, we chose those SNPs at which the genotypes of the other parent are all homozygous ( Fig. 2a and b). Those SNPs have the characteristic of pseudo-testcross markers [35], which can be used to identify recombination events in the progeny haplotypes as performed by Yang et al. [21] ( Fig. 2c, d and e). At those pseudo-testcross SNPs, if the genotypes of a progeny are denoted by 'aa' or 'ab' , the two haplotype blocks can be inferred, one of which is inherited from one parent with alleles of 'a's and the other from the other parent with alleles of 'a's and 'b's ( Fig. 2c and d). Comparing the haplotype block containing 'a's and 'b's with the two in the heterozygous parent ( Fig. 2b), the recombination events can be identified at meiosis in this parent. If a DNA fragment is less than 2 kb but greater than 20 bases and replaces a homologous sequence, a GC is thought to occur during meiosis [24,36]; however, when two DNA fragments in lengths of more than 10 kb come from different homologous chromosomes and join together in a progeny haplotype block, a crossover is considered to exist at the junction [21,22] (Fig. 2d).

Reads quality control, mapping and duplicates removing
We sequenced the whole genomes using the platforms of Illumina HiSeq 2000 for the two parents, P. deltoides and P. simonii, and Ilumina HiSeq 2500 for the 10 progeny in BMK at different times. With the standard quality control (QC) pipeline at BMK, a total of 62.12 Gb clean data with a read length of 101 bp were obtained from the two parents and an average of 13.77 Gb with a read length of 126 bp from the progeny (Table 1). These clean data are available under accession numbers SRP071167 for the parents and SRP125267 and SRP125268 for the progeny at the NCBI Sequence Read Archive database (http://www.ncbi.nlm.nih.gov/Traces/sra). After a serious filtering process with NGS QC toolkit, the HQ reads data were obtained with a reducing range from 6 to 20%, in which over 96% bases have a Phred quality score of at least 20 for each individual (Table 1).
We mapped the HQ reads of each individual to the reference genome of P. trichocarpa. As a result, 96.15 and 95.67% of the HQ reads from the female and male parents were aligned to the reference, respectively, and the mean percentage for the progeny was 98.33 with a standard deviation of 0.77. We filtered these mapped reads such that the edit distance is at most 8% of the single read length with the best alignment score of at least 60 higher than the second-best alignment score. The remaining reads were considered to be almost uniquely mapped to the reference genome [26], occupying 55.55-60.15% of the mapped reads of each sample. After that, we further removed the duplicate reads from these almost uniquely mapped reads, leading to 3.92-33.99% of Fig. 1 Merger of two haplotype blocks with linkage information. Two different alleles are denoted by 'a' and 'b' for all SNPs in the haplotype blocks. a The two SNPs, namely C01_400706 and C01_666847 with a repulsion linkage phase, are found in (b) haplotype blocks 1 and 2. c According to the linkage phase, the two blocks can be merged into a longer one (dark brown), with another one on the homologous chromosome Fig. 2 Procedures for identifying recombination events with a haplotype block inherited from one parent. a The haplotype blocks at the same SNP sites for two parents are shown. At these sites, the genotypes are all heterozygous in the female P. deltoides, but homozygous in the male P. simonii. b The homozygous allele is denoted by 'a' and the other allele in a heterozygote by 'b' at each SNP. c One progeny is genotyped with notations of 'aa' and 'ab' at those SNPs. d The haplotype blocks of this progeny can be discriminated, one (blue) from the male parent and the other (yellow/red) from the female. The haplotype block from the female carries recombinant information, in which the first red fragment (< 2 kb) from the top is considered to be a product of gene conversion and the junction between the second yellow and red fragments (> 10 kb) a crossover. e The alleles on the haplotype blocks of the progeny are labelled with base notations as they were the reads discarded for each individual. Consequently, 37.43-52.50% of the HQ reads of each individual, which were expectedly mapped to unrepeated regions with a maximum edit distance of 8 for the parents and 10 for the progeny, were remained for inferring haplotype blocks and identifying recombination events in the next section ( Table 1).

Construction of parental haplotype blocks
With the remained reads generated above, we used the command phase in SAMtools to call haplotype blocks for each parent. After performing the filtering steps as described in Materials and Methods, we obtained 54,753 haplotype blocks containing 647,971 HQ SNPs in the female parent of P. deltoides, while in the male P. simonii the number of haplotype blocks was 35,458 with a total number of 427,863 HQ SNPs (in Additional file 1: Table S1). These female and male blocks have the average spanned lengths of 842 and 806 bp with the longest lengths of 26,133 and 20,230 bp, totally covering 10.62 and 6.58% of the reference genome, respectively. By incorporating the genetic linkage maps, 19 longer haplotype blocks were constructed for each parent. The female longer blocks contained 9942 SNPs, of which 1205 SNPs are included in the linkage map, whereas the male longer ones contained 8149 SNPs with 700 from the male linkage map (in Additional file 1: Table S2). On each of the female haplotype blocks, including the longer ones, all the SNP segregation types are ab×aa, i.e., the female parent genotype is a heterozygote ab, but the male is a homozygote aa at each SNP site. On the contrary, the segregation types of all the SNPs on each male haplotype block are aa×ab.

Identification of recombination events
We compared the parental haplotype blocks with the blocks of each progeny to identify recombination events. As a result, 9247 (16.9%) of the maternal haplotype blocks and 6841 (19.3%) of the paternal were found to have recombination events detected in at least one progeny. We categorized these haplotype blocks according to the number of individuals in which one or more recombination events were detected in the same haplotype block. Figure 3 presented the bar charts of these categories for both parents. It can be seen that over 60% of these blocks have recombination events detected in at least two individuals, with over 5% (488/508) having recombination events identified in all the 10 progeny. Table 2 presented the distribution of the number of recombination events identified in the progeny over fragment length. It is easily found that the over 95% of the recombination events belonged to GC (20 bp -2 kb), while less than 2% could result from crossover events (> 2 kb). Table 3 and in Additional file 1: Table S3 presented the numbers of gene conversion events distributed on the reference genome sequences, which were identified in each of the 10 progeny and occurred during meiosis in the female and male parents, respectively. On average, we found 4055.6 maternal GCs with an average length of 231.4 bp and 3564.0 paternal GCs with almost the same average length (231.5 bp) in the progeny (in Additional files 2 and 3: Excel Sheets S1 and S3). Furthermore, we discovered that there existed strong correlations among the parental GC numbers and the lengths of the reference chromosomes of P. trichocarpa, with coefficients of 0.9606 between the female and the male, 0.9377 between the female and the reference and 0.9072 between the male and the reference. With the long haplotype blocks constructed by the parental linkage maps, CO events were identified from the SNP genotypes of each progeny at the sites of those long haplotypes. The distribution of CO events on chromosome 1 per parental meiosis was shown in Fig. 4 for the two parents. For other chromosomes, the CO patterns were presented in detail in Additional file 1: Figure S1 and S2. The spans formed by the two parental COs were given for all chromosomes of each progeny in Excel Sheets CD-B35-2 to CD-3-18 in Additional file 4 and Excel Sheets CS-B35-2 to CS-3-18 in Additional file 5. It can be calculated that an average number of CO events was 34.8 and 27.3 per meiosis in the female and male parents, respectively. In contrast, these numbers of COs are less than 1% of the GC numbers per meiosis.

Implication of gene conversion for distorted SNPs
In order to investigate into the implication of GCs for distorted SNP markers, we analyzed two SNP datasets of different segregation types of ab×aa and aa×ab, which were generated from 299 individuals in the same F 1 hybrid population of P. deltoides × P. simonii in the previous study of ours [26]. We filtered those SNPs such that at least 100 individuals have been genotyped, and the filtered SNPs were then classified into two categories, within or outside the GC regions that were identified in the 10 progeny in this study. Consequently, 367 SNPs in the ab×aa dataset and 380 in the aa×ab dataset were found in the GC regions, and the ratios of seriously distorted SNPs (P < 0.01) in these GC regions were 69.75 and 74.74%, respectively (Table 4). If the GC regions were limited to those that each was identified in at least 5 different individuals, the ratios of the distorted SNPs Fig. 3 Bar charts of the number of haplotype blocks against the number of individuals in which one or more GC events were detected in the same haplotype block for the female (a) and male (b) parents increased up to 84.84 and 76.38%, respectively. On the other hand, 17,640 of the filtered SNPs in the ab×aa dataset were found outside the GC regions with a distorted ratio of 31.35%, while in the aa×ab dataset there were 10,915 filtered SNPs outside the GC regions with a distorted ratio of 35.10%. Overall, the ratio of distorted SNPs within the GC regions was about two times greater than that outside the GC regions.

Discussion
We proposed a strategy for identifying recombination events during the meioses of the two parents in an F 1 hybrid population of P. deltoides and P. simonii with the NGS technology. Unlike in the previous studies in Arabidopsis [21,23], where the SNP phases of parents were known due to the inbred lines available, this strategy first needed to phase the heterozygous SNPs of parents to form the parental haplotype blocks by the NGS analytical tools such as BWA and SAMtools [30,31]. Next, the specific parental haplotype blocks, in which all the SNP genotypes are heterozygous for one parent and homozygous for the other, were chosen and compared with the progeny haplotype blocks to identify the recombination events. Therefore, the method presented here permits us to explore especially the GC events in outbred species, which could affect segregation ratio of molecular markers involved and totally ignored in most previous genetic linkage analyses. Most importantly, we developed a Perl package to implement the complicated computing procedures, making it easy to identify recombination events using NGS data with just one keystroke.
The key step to implement the strategy is to phase the parental haplotype blocks of the outbred parents with NGS data and analytical tools. Here, as a primary work to investigate into the GC events in an outbred species, we directly applied the phase module in SAMtools to obtain the parental haplotypes, because this software contains many powerful functions and meanwhile, we also used it in this study for dealing with the reads mapping files in BAM format and calling SNPs, etc. Certainly, there are many other package tools available for haplotype phasing, such as fastPHASE [37], Beagle [38], SHAPEIT [39] and Eagle [40], which were mostly developed for medical and population genetics. A few of these phasing packages can handle NGS data with a reference sequence and could be utilized to improve accuracy of the parental phasing results in the current study. However, the phasing step for parental haplotypes would be skipped if we had an hybrid population produced by crossing an F 1 individual with one of its parents  (Backcross) or another F 1 individual (F 2 ). We noticed that in the previous study of ours [26], about 80% SNPs belong to the segregation pattern of aa×bb in the same F 1 hybrid population of P. deltoides and P. simonii. These SNPs will segregate in the backcross or F 2 hybrid population, having the same characteristics as in the usual experimental populations generated from inbred lines. Thus, if these SNPs are applied to the individuals from the backcross or F 2 population, it will be very easy to identify recombination events as conducted in Arabidopsis [21].
Although the numerous phased parental haplotypes were not so long that the average length was~800 bp and the longest one less than 30 kb, they indeed permitted us to explore the rich world of GC events for the first time in forest tree species. As performed by Yang et al. [21], we considered that the GC tracts are of 20 bp to 2 kb long and each must contains 2 SNPs, ignoring very short (< 20 bp) and long (> 2 kb) tracts, which amount to be less than 5% of the total (< 10 kb, Table 2). With the filtering procedures for mapped reads (see Methods), the identified Table 3 Distribution of the number of gene conversion events detected in each of the 10 progeny and inherited from the female parent based on the reference genome sequences Ref.
Progeny ID Aver.  GC events largely could be considered to be located in non-repeat regions on genome. Our analysis revealed that at least 99% of recombination events could be regarded as GCs, basically consistent with the findings in Arabidopsis [21]. Moreover, the GCs seemed not to randomly occur on the genome because the GC regions shared by all the 10 individuals have a higher frequency (> 5%) than those shared by 7, 8, or 9 individuals (< 5%, Fig. 4). This suggested that some genomic regions may have a preference for GCs, more possibly leading to the internal markers distortedly segregating in a mapping population. We validated it by analyzing the real SNP datasets from the previous study of ours (see Results). By incorporating the linkage maps of the two parents, the longer haplotypes for the two parents and their progeny can be obtained that each corresponded to a chromosome, permitting to determine the positions of CO events with higher resolutions. Although a CO event could be identified at a marker interval of a linkage map just through the genotypes of an individual at those markers involved [41], the longer haplotypes constructed here could shorten the CO region if the flanking markers are within parental haplotype blocks. Occasionally, we could find a crossover within a parental haplotype block. We presented such cases for the two parental COs in Additional files 5 and 6: Excel Sheets S6 and S8 and, each of which was identified within a haplotype block and in a marker interval of 13-5329 bp in length. These high-resolution findings for COs provided important clues for further validating for some special proposes by sequencing target regions with the traditional Sanger sequencing or the single-molecule real-time (SMRT) sequencing recently developed by PacBio [42].

Conclusions
The proposed novel strategy and the corresponding Perl package developed here permit to identify gene conversion events with the next-generation sequencing data in a hybrid population of outbred species. Our analysis revealed that in a genetic mapping population some distorted genetic markers are possibly due to the gene conversion events. More careful attention should be paid to the distorted markers when performing genetic linkage analysis.

Additional files
Additional file 1: Table S1. Summary of parental blocks from the intermediate files of 'parent1.abxaa.5snps.blocks', 'parent2.aaxab.5snps.blocks', 'parent1.long.haplotype' and 'parent2.long.haplotype', created by findGCO.  Table S3. Distribution of the number of gene conversion events detected in each of the 10 progeny and inherited from the male parent based on the reference genome sequences. Figure S1. CO patterns identified in each progeny on all chromosomes in the female parent P. deltoides. Figure S2. CO patterns identified in each progeny on all chromosomes in the male parent P. simonii. (DOCX 277 kb) Additional file 2: Excel Sheets RD-B35-2, RD-C25-3, RD-C3-2, RD-C32-2, RD-C5-3, RD-3-12, RD-3-14, RD-3-15, RD-3-16 and RD-3-18 Distribution of the number of recombination events over fragment length, which occurred in the meiosis of the female P. deltoides and were identified in each of the 10 progeny. Excel Sheet S1 Distribution of the average number of recombination events over fragment length, which occurred in the meiosis of the female P. deltoides and were identified in the 10 progeny. Excel Sheet S2 Summary of the number and the total length of haplotype blocks in which the maternal recombination events were identified in each progeny. (XLSX 29 kb) Additional file 3: Excel Sheets RS-B35-2, RS-C25-3, RS-C3-2, RS-C32-2, RS-C5-3, RS-3-12, RS-3-14, RS-3-15, RS-3-16 and RS-3-18 Distribution of the number of recombination events over fragment length, which occurred in the meiosis of the male P. simonii and were identified in each of the 10 progeny. Excel Sheet S3 Distribution of the average number of recombination events over fragment length, which occurred in the meiosis of the male P. simonii and were identified in the 10 progeny. Excel Sheet S4 Summary of the number and the total length of haplotype blocks in which the paternal recombination events were identified in each progeny. (XLSX 29 kb) Additional file 4: Excel Sheets CD-B35-2, CD-C25-3, CD-C3-2, CD-C32-2, CD-C5-3, CD-3-12, CD-3-14, CD-3-15, CD-3-16 and CD-3-18 Crossover tracts on each chromosome of the maternal P. deltoides that were identified in each progeny. Excel Sheet S5 Summary of the crossover numbers on the female chromosomes that were identified in each progeny. Excel Sheet S6 Summary of the crossover events on the female chromosomes that were identified within a short haplotype block region. (XLSX 72 kb) Additional file 5: Excel Sheets CS-B35-2, CS-C25-3, CS-C3-2, CS-C32-2, CS-C5-3, CS-3-12, CS-3-14, CS-3-15, CS-3-16 and CS-3-18 Crossover tracts on each chromosome of the maternal P. simonii that were identified in each progeny. Excel Sheet S7 Summary of the crossover numbers on the male chromosomes that were identified in each progeny. Excel Sheet S8 Summary of the crossover events on the male chromosomes that were identified within a short haplotype block region. (XLSX 58 kb)  Table 4 The ratio of distorted SNPs within or outside GC regions detected in the current study for the SNP datasets of two different segregation types generated in the previous study   The seriously distorted SNP has a p-value of less than 0.01 with the Chi-squared test