Defining the core collection and field evaluation
A total of 541 Senegalese pearl millet landraces were collected between 1992 and 2014 [8, 9, 11] from farmers with their approval and respecting institutional, national, or international guidelines. A set of 12 single sequences repeat (SSR) microsatellites [8] were previously used to genotype germplasm consisting of 429 early-flowering morphotypes (Souna) and 112 late-flowering morphotypes (Sanio). In 2014 and 2015, we evaluated 392 landraces at different ISRA’s research stations. A total of 306 early-flowering Souna landraces were evaluated at Bambey (N14°32′12″ W16°36′41″) and at Nioro (N13°45′00″ W15°45′00″), while 86 late-flowering Sanio landraces were evaluated at Senthiou Maleme (N13°49′01″ W13°55′03″) and at Kolda (N12°53′02″ W14°57′05″). In each site, the experiment comprised a randomized complete block design with three replications. Each landrace was grown in a single row containing eight hills. The distance between the rows and between the plants in the row was 90 cm. In the different trials, the following phenotypes were measured: downy mildew incidence (DMI), 50% flowering time (FLO), nodal tillering (NTN), plant height (PHE), tillering (NPT), flag leaf length (FLL), flag leaf width (FLW), spike length (SLE), spike thickness (STH), 1000 seed weight (SWE), grain yield (GYI), panicle yield (PYI) [22]. To establish a core collection from this panel, an advanced maximization sampling technique, called heuristic, was performed based on phenotypic and genotypic data from these 392 landraces using PowerCore v 1.0 software [23]. This algorithm removes duplicates and retains only a limited number of landraces in multiple analyses [23]. From this heuristic approach, 91 landraces were retained, consisting of 60 early-flowering Souna and 31 late-flowering Sanio, that were field-evaluated at ISRA-Nioro in the 2016 and 2017 rainy seasons. The experimental design for each trial was a 7 × 13 alpha lattice with three repetitions. Each of the landraces tested was grown in a single row comprising eight hills in each repetition and the measurements were taken on three hills. The genetic variability of the panel was assessed using the significance of differences between the Nei genetic index of core collection and a Student’s t-test at α = 0.05 [24]. To assess whether the core collection captured the diversity of the whole dataset, we calculated the percentage of mean difference (%MD) and the percentage of variance difference (%VD), the coincidence rate (%CR), and variable rate (%VR) according to [23]. The core collection was considered to be representative of the total collection when no more than 20% of the traits had different means (significant at α = 0.05) in the defined core collection and in the total collected landraces, and the coincidence rate CR% retained by the core collection was no less than 80%. Analysis of variance was performed on the different phenotypic parameters using the Plant Breeding Tools v 1.3 software (https://sites.google.com/a/irri.org/bbi/products) with the formula:
$$ Y=\mu +G+Y+ GY+R+B+\varepsilon $$
where Y is the phenotype; μ, the mean; G the genetic effect; Y, the year effect; GY, the interaction between genotype and year; R the replication effect; B, the incomplete block and ε, the residual effect. The heritability of agro-morphological characters was calculated using a mixed linear model with random effects for individuals, using Plant Breeding Tools v 1.3 software (https://sites.google.com/a/irri.org/bbi/products), with the formula:
$$ {h}^2=\frac{\ {\sigma}_G^2}{\left({\sigma}_G^2+\frac{\sigma_{GxY}^2}{y}+\frac{\sigma_{\varepsilon}^2}{ry}\right)} $$
where \( {\sigma}_G^2 \) is the genotypic variance, \( {\sigma}_{GxY}^2 \) the genotype by (y) year variance and \( {\sigma}_{\varepsilon}^2 \), the residual variance for (r) replicates and (y) year.
For each trait, the adjusted mean of each individual in the 2016 and 2017 trials was calculated with fixed-effects for individuals, using Plant Breeding Tools v 1.3 software and was considered as the value of the individual for the trait concerned.
A principal component analysis was performed using the adegenet v 2.1.1 package [25], R v 3.5.1 [26]. A discriminant analysis (DA) between early- and late-flowering millets was then performed and correlation values between phenotypic traits and factual plans were extracted from the DA. Characters that were significant with a P-value < 0.001 and which presented a high correlation (r) ≥ 0.7, with the axis of differentiation, were identified as discriminating characters between early- and late-flowering millets. This analysis was performed using XLStat 2014 software (http://www.xlstat.com) and the distribution of discriminating agro-morphological characters was plotted using R software v. 3.5.1 [26]. The distribution of the 91 landraces was mapped using QGIS v 3.8 (https://www.qgis.org).
DNA extraction, library construction, and sequencing
Genotyping-by-sequencing (GBS) was performed on genomic DNA extracted as previously described [27] from a single plant sampled at the five-leaf stage of each of the 91 landraces grown in 2016 at Nioro. The DNA was checked using a NanoDrop 2000 (Thermo Scientific™) and showed 260/280 and 260/230 ratios between 1.8 and 2, respectively. Extracted DNA was stored in a solution of Tris-HCl and sent for sequencing at the Next Generation Sequencing Platform of the CHU Research Center, University of Laval, Quebec. The libraries were generated in two multiplexes of 45 and 46 samples. PstI-MspI double-digestion was applied, and adapters were linked to each sample followed by mixing and amplification. The libraries were sequenced using Illumina HiSeq2500.
SNP calling, filtering, and data analysis
The quality of the reads was evaluated using FastQC v 0.72 and MultiQC v 1.6, and the sequences were then cleaned with FastQ Trimmer v 1.0.0. Only sequences of average quality (Q) ≥ 30 (Sanger format) were retained and the first 7 bases (5′ side) of each read were removed. The sequences were aligned with the pearl millet reference genome (GenBank Accession number GCA_002174835.2) using BWA v 1.2.3, before realignment of the sequences for insertions and deletions using RealignerTargetCreator v 0.0.4 and IndelAligner v 0.0.6. The binary alignment map (BAM) format files from the above procedures were merged using MergeBAM v 1.2.0 and SNP calling was performed using UnifiedGenotyper v 0.0.6. A total of 545,834 variants were called including 502,382 SNPs. Filtering was first performed based on mapping quality (MQ) and depth, applying hard filtering using VariantFiltration v 0.0.5 (MQ ≥ 40 divided by the depth of unfiltered samples > 0.1). A second filtering was performed according to the minor allele frequency (MAF) > 0.05, and the allowed maximum proportion of missing data was 0.05 for markers and 0.1 for individuals, using Plink v 1.9. Multi-allelic markers were then removed using Tassel v 5.2.48. The output file finally contained 21,663 SNPs and 78 individuals. This dataset was used for all subsequent analyses. All bioinformatics analysis (sequences filtering, cleaning, mapping, and SNPs calling) was carried out on the Galaxy v 18.0.5 platform [28], in the Bio-Linux 8 operating system [29].
Genetic structure
Genetic structure was evaluated using the sparse mixed linear model (sNMF) algorithm, through the LEA v 2.2.0 package implemented in R. The sNMF algorithm detects genetic clusters of individuals in the population sample. For this analysis, we used several populations ranging from K = 1 to K = 10, with ten repetitions for each K value. Discriminant analysis of principal components (DAPC) was also used through the adegenet v 2.1.1 package [25]. The choice of the number of axes (PCs) retained for the DAPC was made using a cross-validation method, performed on the data set subdivided into two training sets of respectively 90 and 10% [30]. A test comprising 30 repetitions was performed to preselect a limited number of PCs. At each repetition, the validation and training sets were randomly allocated. A second test comprising 1000 repetitions was carried out on the preselected PCs to select the number of PCs that enabled the highest proportion of correct predictions with the lowest error rate. This analysis was carried out using the R software adegenet package [25].
GWAS
Association analyses were conducted with a mixed linear model (MLM) correcting for population structure and kinship using Tassel v 5.2.48. Q-Q and Manhattan plots illustrating the results of GWAS were produced using the qqman package v 0.1.4 in R [31]. The significance threshold (α) of the association of SNP markers with the different traits was calculated using Bonferroni correction [32]. SNPs significantly associated with agro-morphological traits were localized in the pearl millet genome intervals. Locating was performed with the valR package v 0.5.0 [33] in R. Pearl millet genome annotation was used to identify these genes.