Markers and genetic purity
Among the 955,120 SNPs used for genotyping the 265 inbred lines, only 23.1% (220,878 SNPs) were polymorphic in our germplasm, each with a minor allele frequency ranging from 0.05 to 0.50. The percentage of missing data per marker after imputation varied from 0 to 19.2% and the overall average was 9.2%. The number of SNPs per chromosome for dataset 1 varied from 15,324 SNPs on chromosome 10 to 35,002 SNPs on chromosome 1 (Table 1). Using 220,787 SNPs, genetic purity among the 265 inbred lines varied from 68.5 to 99.9% (Additional file 1), with an overall average of 86.9%. Genetic purity of inbred lines is an important quality control criteria in maize breeding and seed system, that directly affects both the quality of hybrid seed and development of new inbred lines [2, 33]. Currently, most maize breeding programs consider S4 or later generation as a fixed inbred line for evaluation in hybrid combination. Inbred lines are considered pure or fixed when the proportion of heterozygous SNP loci does not exceed 5% [33]. Inbred lines with higher than 5% heterogeneous SNP loci are considered either not fixed or likely to have been contaminated by pollen or seed of another source during maintenance. Overall, about 22% of the 265 inbred lines were considered fixed, while the remaining 27% and 51% of the inbred lines had a heterogeneity varying from 5.1 to 12.4 and from 12.5 to 31.5%, respectively (Fig. 1, Additional file 1). Approximately 7% of EIAR’s, 20% of IITA’s and 54% of CIMMYT’s inbred lines were considered fixed. Most inbred lines from EIAR (73%) showed heterogeneity values ranging from 12.5 to 31.5% as compared to only 21% from CIMMYT and 30% from IITA (Fig. 1, Additional file 1). The higher level of heterogeneity observed for most inbred lines from EIAR was due to the use of early generation inbred lines (<S4) as parents for hybrid formation. This approach was used to attain higher seed yield in the prevailing poor inputs and agronomic practice. This in turn lowers the price of hybrid seed production thereby decreasing cost of seed and increasing access to seed by small scale farmers [34]. In addition, the source germplasm available for new line development some decades ago (composites, pools and landraces) showed sever inbreeding depression upon continuous self-pollination. To cope with those challenges, maize breeders in Ethiopia at that time developed and released hybrids using early generation parental inbred lines [3]. Although this strategy favored cheaper seed production, the hybrids were less uniform in comparison to hybrids developed from fixed lines. Also, the genetic purity of some of the recently developed EIAR inbred lines was low, possibly due to pollen contamination and seed admixture during seed maintenance. The majority of the inbred lines originating from both CIMMYT and IITA were at S4 or later generations [33], and thus, both pollen contamination and seed admixture during inbred line maintenance could be the most likely factor that resulted in higher level of heterogeneity in some of these lines. We therefore suggest an additional generation of selfing in order to fix these inbred lines (with the exception of those deliberately maintained at an early stage) to achieve a number of advantages from the use of pure lines, including ease of maintenance of parental lines, high heterosis in hybrids, and ease in quality control during hybrid seed production [2, 33]. We also suggest periodic restocking of inbred lines sourced from IITA and CIMMYT as well as maintenance of reference molecular fingerprints for ease of identity confirmation in future and internal quality control. Further, it is recommended to complement marker based homogeneity test with phenotypic evaluation at regular interval (e.g., every five years) to verify the genetic purity of inbred lines.
One of the limitations of GBS markers was concerns on the reliability of allele calls on heterogeneous and highly heterozygous germplasm as compared with highly homozygous genotypes, which has been dealt through intensive post data correction, including implementation of reliable imputation methods [35, 36]. Using 191 SNPs from Kompetitive Allele Specific PCR (KASP) and different number of GBS markers, we recently compared genetic purity of 80 maize samples (16 maize inbred lines, each represented from 2 to 9 seed sources). The KASP and GBS-based SNP markers showed some discrepancy in terms of numerical values when heterogeneity exceeded 12.5%, but the overall conclusions reached in assigning lines into genetically pure or not were highly similar. The correlation between KASP and GBS markers for estimating genetic purity varied from 0.90 to 0.93 depending on the number of GBS markers used for analyses [2]. The KASP-based SNPs are preselected high quality SNPs for QC analysis but they are much fewer than the number of GBS markers used for estimating genetic purity, which may be one of the reasons for the observed small differences between KASP and GBS.
Genetic distance and kinship
Pairwise genetic distances among the 265 inbred lines ranged from 0.011 to 0.346 (Additional file 2), with an average of 0.313. Only fourteen pairs (0.04%) of inbred lines showed genetic distance estimates less than 0.05 with most of these pairs originating from CIMMYT. All pairs of the inbred lines with a genetic distance <0.05 were sister inbred lines, with shared pedigree across most generations. The proportion of pairwise comparisons with a genetic distance estimate less than 0.200 was ≤1% for inbred lines originating from EIAR and CIMMYT and was 15% for inbred lines originating from IITA (Fig. 2a). The genetic distance between pairs of inbred lines in 89% of the entire set, 88% of EIAR and 81% of CIMMYT fell in the range of 0.301 to 0.346 (Fig. 2a). Most IITA inbred lines had a genetic distance estimate between 0.200 and 0.300 (55%), while only 28% of them had genetic distance estimate between 0.301 and 0.346. The result suggested relatively narrow genetic variation among the sampled inbred lines of IITA compared to those of EIAR and CIMMYT. This could be due to differences in sample sizes of the lines included in the present study from the different institutes and the heterotic patterns of the lines as defined in the various programs. Previous genetic distance estimates reported for tropical maize germplasm are highly variable. In one of the recent studies, approximately 59% of the pairwise distances among 417 doubled haploid maize lines genotyped with 97,190 GBS markers ranged between 0.301 and 0.500 [37]. In another study involving 450 inbred lines developed by CIMMYT breeders in Africa, 95% of the pairs of inbred lines showed genetic distance values ranging between 0.301 and 0.500 [12].
Selection of parents with good phenotypic performance and wide genetic base is one of the most important steps in the development of new hybrid varieties. In general, progeny variance increases in crosses between genetically distant parents [38], providing opportunities to generate progenies with maximum segregation for target traits that are desired. Breeders use different methods in selecting the best parents for making new crosses, including (a) pedigree relationships, (b) phenotypic performance for specific traits, (c) adaptability and yield stability, and (d) genetic distances estimated from phenotypic traits and molecular markers [39]. The relationship between genetic distance and progeny genetic variance was found to be inconsistent across species and studies, with some showing strong relationships, while others showing weak relationships [8, 9, 39,40,41]. Nevertheless, genetic distances estimated from high density molecular markers could provide useful additional information for selecting the best parental combination to generate new crosses for developing improved maize inbred lines.
The pairwise relative kinship coefficients among the 265 inbred lines ranged between 0.00 and 1.778, where values close to zero indicate lack of relationship, while those close to 2 indicate complete relationship. Fifty-nine percent of the relative kinship values were close to zero, 40% varied between 0.050 and 0.500, and the remaining 1% fell between 0.500 and 1.778 (Fig. 2b, Additional file 3). As shown in Additional file 4, the kinship heatmap computed using the 220,878 SNPs reached a pick between zero and 0.050, which shows lack of relatedness in most pairs of inbred lines used in this study. The proportion of close to zero pairwise kinship values observed in the presents study (59%) was much higher than the one reported for the 450 inbred lines originating from the CIMMYT Africa maize breeding program which was only 5.1% [12], and was relatively higher than that of the 632 inbred lines reported for the global maize collection [18]. Similar results of close to zero pairwise relative kinship values were reported in 61% of the 100 inbred lines from INERA and IITA [22], 60% pairs of 359 inbred lines from CIMMYT and IITA [13], and 64% pairs of 544 inbred lines from CIMMYT [14]. Considering relative kinships within each of the three germplasm sources, EIAR depicted the highest percentage (64%) of pairs of lines with kinship values close to zero, followed by CIMMYT (54%), and IITA (29%) (Fig. 2b). Assemblage of maize germplasm from diverse sources might have contributed to the observed low level of relatedness among EIAR’s inbred lines.
Genetic relationship and population structure
The population structure of the inbred lines was assessed using PCA, DAPC and the model-based STRUCTURE. All the three methods revealed the presence of three distinct groups, with 94% agreement on group membership predicted by the different methods (Figs. 3-4 and Additional file 1). Using DAPC, the first group was composed of 175 quality protein maize (QPM) and non-QPM inbred lines that were mainly extracted from broad-based pools and populations, such as PooL9A for non-QPM lines and Pop 62 and Pop 63 for QPM inbred lines. Pool 9A, which is considered as heterotic group A (HGA) population, was developed from a pool of Kitale synthetic II (HGA), Ecuador 573 ((heterotic group B (HGB), Colombian, Guatemalan, Tuxpeño (HGA) and SR52 (HGA/HGB) [42, 43]. Pool 9A is adapted to the highland transition-zone growing conditions and characterized by late maturity, semi-dent texture, and white grain. Pop 62, on the other hand, was originally derived from pool 40 [44]. Like Ecuador 573, pool 40 was developed for both the intermediate temperate ranges and colder maize growing areas of the tropics and subtropics [45]. Other germplasm sources included in this group also include pop 43, INTB, DRB, ZM605, ZM609, EV7992 and TZM, which represent both HGA and HGB germplasm. Some popular CIMMYT HGA testers (CML312 and CML442) and HGB testers (CML444) were also clustered in group one, highlighting the discrepancy between marker based and combining ability based heterotic groupings.
Group two consisted of 47 members, including CML395, CML202 and several other inbred lines recycled from CML395 and/or CML202, whereas group three consisted of 43 inbred lines that were primarily derived either from Ecuador 573 or CML197 genetic backgrounds. CML395 and CML202 are popular CIMMYT HGB testers used in tropical mid-elevation adapted germplasm in SSA as both carry resistance to maize streak virus (MSV). Ecuador 573, which is an OPV originally obtained from Ecuador and improved through reciprocal recurrent selection with Kitale synthetic II [3, 43], is a popular HGB population adapted to highland growing conditions and characterized by late maturity and flint kernel texture. Ecuador 573 and Kitale synthetic II and inbred lines extracted from them have been extensively used as parents and testers in developing improved germplasm adapted to upper mid-altitude sub-humid and transitional highland sub-humid maize growing areas of Ethiopia [3]. If we rely only on pedigree information, inbred lines in groups two and three should belong to the same heterotic group (HGB). However, the magnitude of genetic distance and heterosis supports the molecular marker grouping. The highest genetic distance between pairs of lines in the present study (0.346) was between CML395 from group two and 142–1-e (derived from Ecuador 573) from group three. Furthermore, BH661, a high yielding three-way cross hybrid released in Ethiopia in 2011 [11] is a cross between a single cross hybrid from group two (CML395/CML202) and 142–1-e. This clearly supports the population structure detected between group two and three in the present study. Therefore, the population structure defined by the molecular data appears more plausible than the conventional heterotic grouping based on pedigree and combining ability studies that clusters CML395 and EC 573 into the same group.
As shown in Fig. 5, results from the present study showed partial agreement with the conventional method of heterotic group designation in tropical inbred lines, but the pattern was not very distinct. Most HGA inbred lines clustered into group 1, while HGB inbred lines were found distributed across all the three groups; most, however, were in the second and third groups. Our findings on the lack of clear pattern of grouping based on the germplasm origin, and inconsistency between molecular marker-based clustering and the conventional classification is in agreement with previous studies on tropical maize germplasm [12, 14, 22] that reported inconsistencies between molecular marker-based and combining ability/pedigree based classifications. As the assignment of inbred lines in to heterotic groups in tropical maize germplasm is a relatively recent phenomenon, the lack of clear genetic divergence between HGA and HGB lines in the current study is not entirely unexpected. Until the early 1990s, most of the tropical maize germplasm improvement effort of CIMMYT, IITA and most national agricultural systems (NARS) was based on the development of pools and populations with stacked traits with the objective of deriving adapted OPVs without consideration to heterotic pattern. With the advent of the seed sector and maize hybrid adoption in the tropics, the focus of CIMMYT and IITA switched to inbred line development in the early 1990s and subsequent assignment of germplasm into distinct heterotic groups. Most tropical germplasm was assigned to either HGA (Tuxpeno, flint background) or HGB (ETO (Estacion Tulio Ospina), dent) which were found to combine well with each other while a small fraction of germplasm was assigned to heterotic group AB (HGAB) due to lack of a clear combining pattern. Due to the high levels of diversity in tropical maize germplasm, it is likely to take several decades before HG can reliably be identified by molecular marker, phenotype or combining ability. The current assignment of heterotic groups to inbred lines is based on test cross performance with various representative testers. However, it remains challenging to divide tropical maize inbred lines into clear heterotic groups based on combining ability results per se as many of them are derived from mixed pools while selection within each heterotic group has not been carried out for long enough to achieve maximum heterotic response between groups [19]. Therefore, many generations of reciprocal recurrent selection may be needed before inbred lines from each heterotic group begin to be significantly divergent [21]. In addition, combining ability based heterotic group assignment relies on yield performance evaluation of different sets of lines with different testers. The reliability of combining ability based heterotic grouping depends on several factors, including (1) the genetic background of the inbred lines, (2) the type and number of testers used, (3) inconsistencies in the number of environments used for yield trials, (4) the involvement of different breeders from the same or different institutions and use of different testers; and (5) lack of common check hybrids across different yield trials that could be used for comparing results across institutions, breeders, and years. Given such limitations on the combining ability based heterotic grouping, the partial agreement between the phenotypic and molecular-based heterotic grouping is expected. Hence, our results, together with others [12, 14, 15] suggest the need for complementing combining ability based assessment with a molecular fingerprinting and pedigree history when determining heterotic groups.
Lack of clear grouping of the inbred lines in this study based on their origin (EIAR, CIMMYT or IITA) is partly attributed to EIAR’s continuous acquisition of germplasm from both CIMMYT and IITA, and also germplasm exchange among maize breeders at CIMMYT and IITA. Maize breeders from the NARS in SSA often have limited source germplasm for their breeding program and are mainly dependent on CIMMYT and /or IITA maize germplasm. CIMMYT provide free access to over 576 publicly available CMLs and many other advanced inbred lines to maize breeder’s worldwide. Adapted inbred lines received from CIMMYT and IITA are crossed to various locally developed maize inbred lines to derive new improved inbred lines and hybrids. The classification of CIMMYT/IITA inbred lines into three heterotic groups (A, B or AB), has indirectly influenced many of the NARS breeders in SSA to adopt a similar system of heterotic grouping, which is essential for establishing a line and hybrid development pipeline.