This LD study is the first of its kind in the genus Coffea. We identified numerous different cases of LD evolution along the genome within the different DGs. We were also able to identify a cryptic structure within DG G. This very fine genetic structure had not been discovered in earlier studies either due to a lack of resolution or because too few markers were used.
What is the point of a genome-wide LD study?
As in all species, we found a decrease in the LD with distance. For a population in equilibrium between mutation and genetic drift, the LD (measured by r2) is expected to depend on both the effective size of the population and the recombination rate between the loci considered
[23, 24]. The closer any two markers are, the longer it will take for the LD to dissipate. Therefore, one expects to find some LD values greater between close markers than between distant markers at a given moment in the evolution of a population.
The approach we adopted, with one marker every 13 cM on average but with highly densified portions of the genome, seemed to be the best strategy for constructing an initial pan-genomic view of the LD properties in the species studied. Moreover, our approach enabled a comparison of the LD behavior over different LGs. This comparison should lead to a clearer understanding of the possibilities for association studies within the studied populations.
Is LD in Coffea canephora an insurmountable problem for association studies?
Our results demonstrate that it is possible to perform association studies by working specifically on each population, but not on the global diversity level, in the species. Depending on the populations, the needed marker density varies, but the prospects of using association studies to support breeding programs for our species are quite interesting.
Indeed, the graph showing the set of r2 and D’ values on the scale of the 356 genotypes, combined with the importance of the genetic structure found in the entire sample, clearly illustrates the importance of the structure effect on the detection of associations between unlinked markers. The Hardy-Weinberg disequilibrium could not be reduced from the whole sample scale to the GD level. However, disequilibrium still exists within natural populations of C. canephora, as shown by
[12, 35]. This disequilibrium thus prevents the implementation of association studies for an entire set of genotypes using simple correlation models, which do not take into account structure and kinship effects. This result was confirmed by the large number of significant correlations between markers located on the different LGs for the 356 genotypes compared to those found on the DGs. Therefore, it seems necessary to work at the population level to more effectively study the LD dynamics in C. canephora, as the analyses showed that the most valuable results were obtained on DG GP, DG C and two Guinean subgroups (Gsub1 et Gsub2) corresponding more or less to natural populations.
We were thus able to reveal a high variability in LD within the different DGs, with a large share of residual between-linkage group disequilibrium in DG SG2 and DG SG1. These results may potentially lead to the detection of false positives in association studies, even at low levels of genetic structure. The importance of this “genomic” LD (as opposed to local LD) was variable depending on the groups, and by taking into account structure and kinship in association studies, it will be possible to overcome this variability. The residual genomic LD values for the less-structured DGs may be explained by different kinship levels within the natural populations of our species. Nevertheless, for DG SG2, the low r2 values obtained suggest that the LD is significant at very short distances. Indeed, natural populations of coffee trees are usually small, isolated populations with a small number of mother trees and a few juveniles, involving major relations of kinship, despite the strict outcrossing of the studied species.
We used Bonferroni’s correction to consider only truly significant values. Nevertheless, this correction is very conservative and may lead to a substantial loss of power in association studies. Many other corrections have been proposed in the literature in recent years, but none seems to be satisfactory. Moreover, we have shown that, in our case, this correction mainly made it possible to eliminate a certain number of disequilibrium values between unlinked or very weak markers. Normally, in association studies, such a correction will not be necessary because the main source of error (genetic structure) will be controlled. These questions should be given due consideration along with updates in the proposed models. Models that take into account structure and kinship in association studies appear to be a major advance in these approaches, helping to increase both the power and the resolution of such studies
Genetic structure of DG G
Our study enabled us to more effectively determine a fine genetic structure for DG G. Structure seems to exist in these populations, but there are indications of major gene exchanges between them. We found a structure in three subgroups (Gsub1, Gsub2 and Gsub3) with both model-based and distance-based analyses. This very fine genetic structure can only be studied with a large number of markers. DG G was initially described by Berthaud
 using isozyme markers. Berthaud concluded at the time that there was an absence of genetic structure within this group. However, Cubry et al.
[12, 13] showed the existence of a Guinean population that was different from the others (GD GP). It will also be important to study kinships within natural populations, along with the gene flow existing between them, to understand the dynamics of those populations on the forest scale in Guinea and the Ivory Coast.
What models can be used for association studies in Coffea canephora?
Our results show, particularly for the Guineans, that a large number of “control” markers (i.e., control markers that can be used to estimate structure and kinship independently from the association study) are needed to separate the fine structure into populations. Therefore, our case seems to be quite similar to the case of maize, where a set of eighty-nine microsatellite markers was used by Flint-Garcia et al.
 to study structure and kinship on 302 lines.
After correction of the p-values by the Bonferroni method, some large and significant values of the two LD measurements (D’ and r2) were found both between unlinked and linked markers, preventing any distinction between associations based on a physical link between markers and those created by the structure. Therefore, the genomic control approach (adaptation of the significance limit to the number of associations detected between unlinked markers) appears to be less efficient and may lead to a large number of false negatives. This observation is one of the greatest criticisms of this model advanced by Yu et al.
. Moreover, this approach estimates that structure has the same effect at any point of the genome
The structured association approach proposed by Pritchard et al.
 seems to be more efficient than the genomic control. Nevertheless, the degree of kinship in the populations studied, as shown by the diversity trees obtained (particularly for DG GP), indicates that a share of the confounding effect of genetic structuring is not taken into account in this model. Consequently, it seems that the model best adapted to the species and populations in our study is the mixed model proposed by Yu et al.
. This approach has shown its power and its superior control of false positives when compared to other methods using simulated data.
These association study models are becoming increasingly efficient, and we seem to be arriving at a critical point in the development of these approaches. Even so, particular attention must be given to the choice of traits studied and their distribution within the sample on which association studies are performed. Indeed, by correcting the structure effect, there is a risk of not being able to detect traits that would have a distribution superimposed on the population structure
Which target populations should be used for association studies?
The purpose of our work was to make an initial assessment of the LD at the pan-genomic level in C. canephora. We thus discovered considerable variation in the LD between populations. The DGs comprising natural populations, such as GP or C, appear to have a moderate to high LD, at approximately 5 to 25 cM. In these DGs, it seems feasible to carry out genome-wide scan type studies. Nevertheless, given the stochasticity of the LD between LGs and its sensitivity to low allele frequencies, we have certain reservations regarding this type of approach, notably when using highly polymorphic multi-allelic microsatellite markers.
The DGs comprising improved populations, such as SG2, seemed to have undergone substantial genetic mixing with greater diversity and a virtually undetectable LD on the scale at which we worked. Consequently, this type of population seems more suited to regional or candidate gene type approaches.
To conclude, the current association study models can be used to consider structure and kinship effects, enabling this type of approach for use even with composite and structured samples.
Which approach for association studies in Coffea canephora?
For most of the considered DGs, a high density of markers is required to perform association studies. Using SNPs in additions to the SSR should be of great value.
Considering ongoing work on SNP discovery and genotyping by sequencing in coffee, a high number of markers will likely be obtained in the short term. Then, genome-wide association studies (GWAS) can be easily applied to populations used in breeding processes, such as populations involved in the RRS in the Ivory Coast. In countries such as Uganda, GWAS should be applied to the entire germplasm used for breeding for tolerance to Coffee wilt disease (CWD). Therefore, the candidate gene approach should only be used in very low LD populations or for specific purposes.