In this study, we present what we believe is the first example of constructing genome-wide high resolution RH maps from SNP array data. Our main motivation for building these maps was to provide independent data to analyze and validate the pig genome sequence. Before proposing improvements to the pig genome, we carefully validated our results using information on segregation data in pig families. Eventually, this made our inference more robust and allowed us to contribute significant improvements to the pig genome: we proposed modifications for the largest discrepancies and the assembly was modified when this did not contradict sequence data (e.g. breaking a contig). We discuss here what we believe are important aspects of the current study: genotyping an RH panel with a high density SNP array, construction of high (ultimate) resolution genome-wide RH maps and finally the analysis of discrepancies between maps and assemblies.
Genotype calling from SNP array data
Key to the success of RH mapping in general and for this study in particular is the ability, for each marker, to confidently distinguish between its presence or absence in each clone of the panel thereby providing the retention profile used for constructing maps. In this context, and in order to reduce the risk of false negative/positive calling and its severe impact on the subsequent linkage analysis, PCR-based genotyping is usually performed in duplicates, scoring discrepancies as unknown. In contrast to the binary outcome of PCR, the raw intensities provided by the Illumina genotyping platform enabled a calling procedure to be devised based on a continuous measure. The full distribution of signals across SNPs and clones was used in an attempt to control the false positive/negative rate by scoring intermediate intensity values as unknowns (see Methods). This can also be seen as using all other data points when calling a particular SNP genotype in a single clone. This genotype calling procedure from intensity data obtained by SNP array genotyping can certainly be improved in future studies. In particular, it would be interesting to try to separate the different effects of clones, SNPs and arrays on the observed intensities in order to provide better prediction of the genotypes, possibly reducing missing data and genotyping error rates. This would require the development of new statistical methods and the genotyping of at least some clones on multiple arrays.
On the resolution of RH maps
In RH mapping, the precision of a map or a mapping tool is generally characterized by the resolution expressed as a Kilobases to centiRay ratio. Based on our RH maps, we estimate the resolution of the IMpRH panel at 8.6 Kb/cR and of the IMNpRH2 panel at 5.3 Kb/cR, whereas previous estimates were respectively 35.4 Kb/cR and 12.5 Kb/cR . This difference can be explained by two reasons. First, the number of markers in this study is much larger than in any previous analysis. It is indeed well known that an increase in marker density causes map inflation and hence observing a decrease in the Kb/cR ratio when increasing marker density is a classical behavior of chromosomal maps. Second, we use a comparative mapping approach that incorporates, in the optimization criteria to construct RH maps, a prior information of a reference order given here by the genome assembly. As such, maps obtained are not the most parsimonious in breakpoints, i.e. not the map of smallest length (in cR). Again, this has the consequence of decreasing the Kb/cR ratio.
The resolution can however be understood in a broader sense than this simple Kb to cR ratio (see Additional file 7). When constructing high-resolution maps, a natural question arises: what is the maximum number of markers that we can expect to order? This depends of course on the design of the mapping experiment and we address the question in the context of our study where two radiation hybrid panels were used. Using estimates of the resolution parameters above, we can compute the theoretical proportion of markers that can be assigned distinct positions in RH maps (Additional file 7), using three assumptions: (i) the RH order is the true order (ii) markers are evenly spaced on the genome and (iii) RH vectors have no missing data or genotyping errors. In our case, this theoretical proportion is 99.7% and we observe a value of 94.2% (Additional file 7). We consider this difference as reasonable given that none of the three hypotheses strictly holds in real data.
Designing an RH mapping experiment requires to define what is the desired resolution, i.e. what is the typical physical distance (in Kilobases) between markers that are to be ordered. For example in the case of ordering the scaffolds of a genome assembly, the N50 or N90 scaffold size could be the relevant target resolution. The resolution of an RH panel depends on the panel size and the resolution parameter expressed in Kilobases per centiRay. This parameter is related to the radiation dose (expressed in Rads) but through a process too complex to be modeled so its value can only be guessed from previous studies. However, estimates obtained from the literature must be taken with caution. First they were most likely obtained in other species. Furthermore, as we have shown, these estimates depend on the number of markers and the mapping methods used. Overall, adjusting the resolution through the radiation dose is going to be imprecise. In previous RH panels construction experiments, the panel size was purposely limited to 90 clones because of PCR genotyping where a single marker is genotyped for all clones disposed on a 96-well plate (with wells reserved for control samples). SNP array genotyping however proceeds by genotyping all SNPs on a single clone and therefore does not impose such a design so panel sizes can be made larger to increase the resolution. Given a panel size and resolution parameter, Additional file 7 provides the equations allowing to derive the expected number of markers that can be mapped to distinct positions. For example, we estimate that using the two pig panels, up to 250K markers could be mapped on the autosomes (see Additional file 7 for details). However, it would require obtaining RH vectors for about 1 million SNPs because there is a trade-off between increasing the number of markers interrogated and decreasing the probability of separating adjacent markers. Above 100K SNPs, there is a strong diminishing return in the proportion of markers that can be mapped among the genotyped markers. The numbers of separable markers above depend on the characteristics of the panels used here. It can be increased, in particular by using panel with more than 200 clones. However this may be prohibitively expensive and our general conclusion is that using arrays larger than 100K SNPs is not going to be cost-effective for producing high-density RH maps in most situations.
Discrepancies between maps and assembly
The resolution of discrepancies directly addresses the question of the reliability of the order defined on one side by the genome map and on the other by the assembly. The construction of robust maps was precisely designed to address the reliability of RH maps . On the assembly side, the process is clearly too complicated, involving different technologies such as sequence assembly or physical mapping resources, to enable the development of confidence measure for the organization of sequences in a particular region. A reasonable step is certainly to differentiate the different components of the assembly such as the contigs and scaffolds on one side and their organization along chromosomes on the other side. The modifications of the preliminary assembly (build9) proposed in this study only involved reordering of scaffolds along chromosomes. Note however that our approach could potentially contribute to the identification of chimeric scaffolds. Another approach to resolve contradicting orders is the exploitation of additional and independent source of information such as the genetic data used in this study. Finally, some of the remaining inconsistencies could be biologically grounded, reflecting individual structural variations. The reference sequence and the RH panels were indeed constructed using the DNA from different individuals and from different breeds (a Duroc for the reference sequence and Large White for the panels). Preliminary studies in pigs have demonstrated the existence of a considerable level of between-breed variation .
The particular case of the X chromosome
The X chromosome was not investigated in this study because it requires a specific analysis. First, both RH panels were constructed using male DNA hence with a single X chromosome and therefore a reduced retention in comparison to other chromosomes, with the exception of the pseudo-autosomal region which is believed to cover a small fraction of the X chromosome (∼5% ). Second, the X chromosome harbors the HPRT gene used as the selection locus leading to a retention fraction in its neighborhood which requires specific attention for the construction of maps . Finally, our validation procedure, using genetic data and based essentially on the observation of male meioses is not applicable here. For these reasons, we reserve the construction of an RH map for this chromosome and associated analysis for future work.