Novel genomic approaches unravel genetic architecture of complex traits in apple
© Kumar et al.; licensee BioMed Central Ltd. 2013
Received: 2 January 2013
Accepted: 7 June 2013
Published: 12 June 2013
Skip to main content
© Kumar et al.; licensee BioMed Central Ltd. 2013
Received: 2 January 2013
Accepted: 7 June 2013
Published: 12 June 2013
Understanding the genetic architecture of quantitative traits is important for developing genome-based crop improvement methods. Genome-wide association study (GWAS) is a powerful technique for mining novel functional variants. Using a family-based design involving 1,200 apple (Malus × domestica Borkh.) seedlings genotyped for an 8K SNP array, we report the first systematic evaluation of the relative contributions of different genomic regions to various traits related to eating quality and susceptibility to some physiological disorders. Single-SNP analyses models that accounted for population structure, or not, were compared with models fitting all markers simultaneously. The patterns of linkage disequilibrium (LD) were also investigated.
A high degree of LD even at longer distances between markers was observed, and the patterns of LD decay were similar across successive generations. Genomic regions were identified, some of which coincided with known candidate genes, with significant effects on various traits. Phenotypic variation explained by the loci identified through a whole-genome scan ranged from 3% to 25% across different traits, while fitting all markers simultaneously generally provided heritability estimates close to those from pedigree-based analysis. Results from ‘Q+K’ and ‘K’ models were very similar, suggesting that the SNP-based kinship matrix captures most of the underlying population structure. Correlations between allele substitution effects obtained from single-marker and all-marker analyses were about 0.90 for all traits. Use of SNP-derived realized relationships in linear mixed models provided a better goodness-of-fit than pedigree-based expected relationships. Genomic regions with probable pleiotropic effects were supported by the corresponding higher linkage group (LG) level estimated genetic correlations.
The accuracy of artificial selection in plants species can be increased by using more precise marker-derived estimates of realized coefficients of relationships. All-marker analyses that indirectly account for population- and pedigree structure will be a credible alternative to single-SNP analyses in GWAS. This study revealed large differences in the genetic architecture of apple fruit traits, and the marker-trait associations identified here will help develop genome-based breeding methods for apple cultivar development.
Until the end of the 20th century, the lack of high throughput genotyping techniques and the limited development of high-density SNP arrays have hindered the advancement of genome-based breeding strategies for crop improvement. During the last 10 years, the genome sequences of about 20 plant species, including some from the Rosaceae family, were made publicly available , which offers opportunities for transforming breeding strategies to improve the yield and quality of major crops. Genome-wide association studies (GWAS) and genomic selection (GS) are among some new breeding tools proposed for crop improvement [2, 3]. The underlying philosophy of both these strategies is to genotype enough markers across the genome so that at least one of the genotyped markers is likely in LD with the quantitative trait locus (QTL) alleles . Both GS and GWAS can be conducted using the same genotypic and phenotypic data, but their objectives are different . GS is used to predict phenotype from marker profiles alone, to reduce dramatically the length of the breeding cycle and the costs involved in phenotyping [5, 6]. The objective of GWAS is to identify novel functional variation that can be deployed in cultivar development through marker-assisted selection .
GWAS studies in humans have used two fundamentally different designs : family-based and population-based (that use unrelated individuals). The power of a GWAS of a quantitative phenotype using related individuals was shown to be slightly lower than that for a sample of unrelated individuals in a human study , but in crops and livestock controlled mating could make family designs more powerful than a population sample . Both population-based  and family-based [10–12] designs have been used for GWAS in crops. The advantages of using relatives are manifold, including greater quality control, the ability to perform within-family tests of association that are robust to population stratification, and joint linkage and association analysis. A nested association mapping (NAM) population [10, 11], which consists of multiple families derived from multiple inbred lines crossed to one or more reference inbred line, has been used for GWAS of different traits in maize. Multi-parent advanced generation inter-cross (MAGIC) population was first used in Arabidopsis. Other family-based designs, such as parent-offspring, multi-generational pedigrees and multi-parent crosses, have historically been used in quantitative genetic studies. Thus, for plant populations, it is reasonable to consider large number of progenies derived from controlled crosses in various mating schemes for GWAS [4, 11, 13].
Population stratification and cryptic relatedness among studied individuals, when not taken into account, could lead to spurious genotype-phenotype associations in GWAS. Population stratification refers to the inclusion of individuals from different populations, while cryptic (or familial) relatedness refers to the presence of varying degree of genetic relationships between individuals within the study sample. GWAS methods based on the unified mixed linear model (MLM) were developed by  to account for population-level membership (to correct for structure) and individual-level relationships (to correct for cryptic relatedness). A realized individual-level kinship matrix (G) calculated using molecular markers is more efficient than the pedigree-based kinship matrix (A) as it can account for Mendelian sampling and segregation distortion [4, 15]. As family sizes in plant populations are much larger than those in other species, implementation of MLM was computationally very intensive. Therefore, the efficient mixed model association expedited (EMMAX) algorithm was developed to reduce this computational burden by re-parameterizing the MLM likelihood function . Further, a computationally more efficient and powerful compressed MLM (CMLM) that uses a group kinship matrix calculated from clustered individuals was developed recently . Development of these methods has now made it much easier to analyze large amounts of data for GWAS. Unlike fitting each SNP in turn, which is a typical feature of GWAS, simultaneously fitting high density genome-wide SNPs could avoid the need to fit population and pedigree effects in MLM specifically .
In 2010, an international consortium published the first draft of the apple (Malus × domestica Borkh.) genome sequence using DNA from a popular apple variety ‘Golden Delicious’ , which led to re-sequencing of 27 apple cultivars that are the founders in global apple breeding programs. These efforts produced a huge reservoir of DNA markers, which helped the development of the first-generation apple Infinium SNP chip, comprising nearly 8,000 markers . In the present study, we used this 8K SNP chip for GWA analysis of various fruit quality traits in a family-based design. Traits considered in this study relate to eating quality: fruit firmness (FF) and titratable acidity (TA); visual quality: red-flesh coverage (defined as weighted cortical intensity (WCI); see Methods); and susceptibility to physiological disorders: internal flesh browning (IB), bitter pit (BP) and fruit splitting (also termed cracking) (CR). To elucidate the relative contributions of different genomic regions, we implemented single-SNP analysis models, with and without accounting for population structure, and compared these with a model fitting all markers simultaneously. The statistical power of detecting SNP-trait associations was calculated using an expression derived in this study. The relative advantage of using realized relationships compared with pedigree-based expected relationships was also investigated. To our knowledge, this is the first large SNP array-based GWAS study to unravel the genetic architecture of quantitative traits for any major fruit crop.
Goodness-of-fit of ‘Q+K’ (includes population structure and familial relationships) and ‘K’ (only familial relationships) models were compared to understand whether population structure could bias results. Different numbers of PCs of the SNP genotypes matrix constituted the Q matrix. The values of the ‘Q+K’ and ‘K’ models were identical for WCI, TA and BP, but were higher for ‘Q+K’ for the other three traits. Thus, the optimum number, as determined using the Bayesian information criterion (BIC), of PCs varied for different traits: 0 for WCI, TA and BP; 1 for IB; and 2 for FF and CR. However, results with or without incorporating Q in Equation 3 were not materially different, suggesting that accounting only for cryptic relatedness was sufficient to account for population stratification.
Single nucleotide polymorphism (SNP) with the largest effects (i.e. highest value) on various traits; FF: fruit firmness; WCI: weighted cortical intensity; IB: internal browning; TA: titratable acidity; CR: fruit splitting; BP: bitter pit
SNP (NCBI db)
Linkage group & position (bp)
Gene name & ID
Polygalacturonase (PG); MDP0000232611
RING finger and CHY zinc finger domain-containing protein; MDP0000294924
Leucoanthocyanidin Reductase (LAR1); MDP0000279135
Leucoanthocyanidin Reductase (LAR1); MDP0000279135
Estimates of variance explained by single nucleotide polymorphisms (SNPs) for various apple fruit traits; FF: fruit firmness; WCI: weighted cortical intensity; IB: internal browning; TA: titratable acidity; CR: fruit splitting; BP: bitter pit
The availability of genome sequence, the abundance of DNA markers, and high throughput genotyping platforms are providing a range of applications of molecular markers, including pedigree reconstruction, estimation of genetic parameters, and understanding relationships between genotype and phenotype in various species [5, 21–23]. Using a likelihood ratio based parameter , our study showed that using SNP-based realized relationships in MLM could provide a better goodness-of-fit than using pedigree-based expected relationships. Our results also showed that for all traits, except IB, fitting all markers simultaneously could explain most or all the trait heritability. Similar results have been reported in studies on humans  and animals [24, 25]. Also, approximate standard errors of G-based estimates of h 2 were considerably less than those for A-based because the former captured genetic relationships that are not accessible from pedigree records. Similar to studies on animals , our results suggest that the accuracy of artificial selection in plants species can also be increased by using more precise marker-derived estimates of genetic parameters.
With only a little loss of power, family-based designs in association studies offer various advantages compared with population-based designs [7, 8]. The average SNP-derived pair-wise coefficient of relationship in our study was about 0.50, but the loss of power to detect SNP-trait association was small (about 0.05) compared with that for an unrelated sample of the same size. Various factors such as sample size, high LD, minimal effect of population structure, and the presence of some large-effect QTL provide high confidence in the SNP-trait associations identified in this study. The SNP array used in this study was designed to encompass SNPs in the coding region of predicted gene models and some candidate genes such as MdMYB10, MdPG, and MdLAR. Peak association signals for FF, WCI and TA were located close to genomic regions that have been previously identified in bi-parental QTL mapping studies. The SNPs showing the largest effect on FF on LG10 (Figure 4) reside in the polygalacturonase (PG) gene, which depolymerizes cell wall pectin, and the involvement of this gene in the fruit softening process has been previously demonstrated in apple .
Red color in apple flesh results from a high concentration of anthocyanins. The role of the MdMYB10 gene on anthocyanin biosynthesis in red-fleshed apple was demonstrated using various approaches , and this gene has been mapped to LG9 . The SNP marker associated with weighted cortical intensity (WCI) in our experiment is located in the second exon of MdMYB10, which is physically close to the motif in the MdMYB10 promoter that causally regulates transcription of MdMYB10 itself and thereby anthocyanin synthesis in apple flesh . A cluster of SNPs at LG16 commonly associated with WCI and IB resides in the MdLAR1 candidate gene. LAR1 is a key enzyme in the flavonoid biosynthetic pathway, reducing leucoanthocyanidin into the flavanol compound catechin, a monomer of condensed tannins (also known as proanthocyanidins). Perhaps condensed tannins (CTs) act as co-pigments of cyanidin to create more intense red coloration in the fruit and hence the effect on WCI. Common genomic regions (especially MdMYB10 gene) found associated with WCI and IB are supported by results showing high genetic correlation (≈ 0.60) between these two traits . Also, cold-stored fruit from all MdMYB10 transgenic lines of cultivar ‘Royal Gala’ showed varying degrees of symptoms of IB , suggesting a pleiotropic effect of MdMYB10 on WCI and IB. In our study, the estimated genetic correlation differed in degree as well as direction across different LGs, suggesting some possibility of breaking this undesirable correlation by means of carefully designed breeding and selection strategies, but further work on elucidating the genetic architecture of WCI and IB is required first.
The distribution of SNP effects for TA (Figure 4) suggests one major QTL on LG8, supporting similar results from bi-parental QTL mapping studies [30, 31]. However, there is no published report of candidate genes for TA on apple LG8. Segregation analysis approaches showed that inheritance of TA in a large apple population was better described by a mixed genetic architecture (a major gene and polygenes rather than polygenic or Mendelian inheritance ). Our results appeared to be in agreement with  in the sense that the largest-effect SNP accounted for about half the genetic variation, while the other half was accounted for by small-effect genes.
CR, which is a pre-harvest physiological disorder of apple, can be a serious economic problem for some cultivars . However, little is known about the genetic architecture of this trait. Similarly to our results, estimate of h 2 from a previous study  indicated moderate genetic control of CR. While little is known on the physiological causes of genetic variability in CR susceptibility, it may be linked to the internal properties of the fruit during growth and differences in the elasticity of the peel when under particular stresses and strains caused by the developing parenchyma cells beneath it . To our knowledge, there is no published report on marker-trait association for CR. A cluster of SNPs with a large effect on CR resides within the MdLAR1 candidate gene on LG16, which was previously reported to influence some other fruit quality traits such as astringency  and polyphenolic compounds .
BP is also a serious physiological disorder whose expression is generally observed in fruit after storage, but symptoms can also be observed on the fruit surface at harvest (sometimes called lenticel blotch ), and in our study they were classified as the same disorder. Genetic predisposition, calcium nutrition of the fruit, and environmental factors influence incidence of BP [36–38]. Based on the segregation ratio of resistant to susceptible seedlings, it was hypothesized that resistance to BP is controlled by two major dominant genes, named Bp-1 and Bp-2. Different segregation ratios (e.g. 1:1; 2:1 and 7:1) of resistance to susceptible seedlings were observed in various families in our study, suggesting a complex nature to this trait, which is further supported by our results showing that GWA-significant SNPs explained only about half the observed genetic variation (Table 2). Interestingly, the same cluster of SNPs on LG16 showed association with expression of BP and CR, and the direction and magnitude of LG-level genetic correlations were similar. Molecular, physiological and biochemical pathways that commonly contribute to the expression of these two traits are poorly understood, but our study provides a genomic hotspot for further investigations.
MLM that concurrently fits all available SNPs has been adapted in recent GWAS in animals [24, 25] following an earlier study  that showed that provided high-density SNPs are fitted simultaneously, admixed populations can be used to obtain reliable SNP effects even if pedigree structure and population structure have not been explicitly modeled. High correlations between SNP ASEs obtained from the ‘Q+K’ model (Equation 3) and all-marker analysis (Equation 4) for all six traits in this study (Figure 7) reinforces the findings of .
In our study population, the average LD between SNPs separated by 500 kb was high (r 2 = 0.25) largely because of relatedness (e.g. full sibs and half sibs) among seedlings and small effective population size. It is not uncommon to find different LD structures in different types of plant populations within a species. For example, in maize, LD decays within 1 kb in land races, within 2 kb in diverse inbred lines, and can extend up to 500 kb in commercial elite inbred lines (reviewed by ). Preliminary results (not shown) from an unrelated set of 125 individuals from a diverse apple germplasm collection showed that for a given distance (say, 500 kb) between markers, the extent of LD was almost one-third (r 2 = 0.08) of that in this study. One practical implication of these results is that marker-trait associations identified in advanced-generation crosses may not be repeated in relatively less improved breeding material (e.g. diverse germplasm).
One of the key goals of GWAS is to identify large-effect marker-trait associations that can be deployed through marker-assisted selection (MAS) in subsequent generations of cultivar development populations. However, in order to conduct MAS using these SNPs in successive generations, strong LD between marker and QTL must persist across generations. Generally, only a selected set of individuals from the current generation are used as parents for the subsequent generation. Selection will lead to changes in allele frequencies at marker and trait loci, potentially reducing the LD between two loci, similar to that observed in the second generation of our study material (see Figure 3). As a result, the efficiency of GWAS-associated SNPs could be lower in the following generations. Nevertheless, except for WCI, the significant SNPs jointly explained less than 50% of the trait heritability in this study, which raises a question of how much variation in a quantitative trait needs to be accounted by a marker (or two) so that it would be worthwhile for MAS to be applied. Such a MAS scheme is generally viewed as cost effective compared with the genome selection, but this scheme does not bypass the phenotypic evaluation stage because there could be some quantitative traits for which no significant SNPs are identified. On the other hand however, a small SNP assay comprising GWA-significant SNPs for the key breeding traits could be used for pre-screening of seedlings before further field evaluation. Such an approach will not reduce the length of the breeding/selection cycle, but will shift the mean of the selection population. In order to keep the accuracy of MAS similar to that in the generation where SNPs were identified, periodical recalibration of SNP effects would be necessary .
The use of realized relationship matrix will provide higher accuracy of estimated genetic parameters, resulting in increased accuracy of artificial selection. There are apparently major differences in the genetic architecture of various traits in this study, i.e. for traits with similar heritability the distributions of SNPs effect were very different. The majority of SNPs individually explained only a small proportion of trait variation, but fitting all markers simultaneously captured most of the trait heritability for majority apple fruit traits. These findings suggest that genome-based methods could potentially replace the traditional apple cultivar breeding methods.
A set of four female parents and two pollen parents were crossed in a factorial (4 × 2) mating design. One of the crosses was unsuccessful, leaving seven full-sib families. Seedling numbers varied between families, ranging from 40 to 350, with a total population size of 1,200. Seedlings were planted into the orchard (Havelock North, New Zealand) in July 2008 using a randomized block design. Further details of this experiment such as, parents involved, orchard management, harvesting and fruit storage protocols, were reported earlier . Six traits were evaluated on the fruit samples using instrumental, sensory, or visual assessment methods. Phenotypic assessments for all traits were repeated for two years. Fruit splitting, observed as radial cracks in the stem end of the fruit (CR), and bitter pit (BP) were scored visually as presence or absence, with BP symptoms also noted if present within the fruit after cutting. Fruit from each seedling were cut in half across the equator and the proportion of the cortex area that was red (PRA) and the intensity of the red (RI) (=0 (none) to 9 (highest)) were scored. A weighted cortical intensity (WCI) was then calculated (PRA × RI) as an estimation of the amount of red pigment in the fruit. The proportion of the cortex area showing symptoms of cold-store-induced internal browning (IB) was recorded. Assessment protocols for fruit flesh firmness (FF) and titratable acidity (TA) were described in detail in an earlier study . Individual fruit measurements (FF, WCI and IB) were first averaged for each seedling, and the average performance of each seedling over two years was used for testing genotype-phenotype associations.
Details of genotyping protocols for our study population were reported earlier . Briefly, SNP genotypes were scored using the Genotyping Module (version 1.8.4) of the Illumina® GenomeStudio software (Illumina Inc.). The reliability of each genotype call was measured using the GenCall score, and SNPs were subsequently discarded using a sequence of criteria in the following order: GenCall score at the 50% rank (50% GC) < 0.40; cluster separation (ClusterSep) < 0.25; more than 5% missing calls; segregation discrepancy. Finally a high quality set of 2,500 SNPs was retained, and BEAGLE 3.1 software  was then used for imputing missing SNP genotypes.
Before looking at marker–trait associations, we calculated pairwise LD between SNPs, as a surrogate to LD between markers and QTLs, to evaluate the extent of LD in the study population (arbitrarily called the first generation) described above. These LD patterns were compared with those in an another population (second generation) comprised of 1600 seedlings derived from an incomplete factorial mating between six paternal parents (identified from the first generation) and four maternal parents (identified from previous selections). The degree of LD was quantified with the parameter r 2 estimated using GOLD software .
where logLM is the maximum log-likelihood from fitting Equation 1; logL0 is the maximum log-likelihood from fitting the intercept-only model. In addition to comparing values, we also compared estimates of heritability (h 2) of each trait obtained using the A or G matrices. Equation 1 was fitted using ASReml software .
where β is an unknown vector containing fixed effects, including a genetic marker, population structure (Q), and the intercept; X is the known design matrices relating to β. All other effects are same as in Equation 1. Equation 3 was fitted using GAPIT software , which uses computationally efficient and powerful methods, such as EMMAX  and CMLM . To avoid spurious associations that could arise from population structure, we included principal components (PCs) derived from the genotypic data matrix (n × m) as covariates (i.e. Q matrix). The optimal number of PCs was determined by forward model selection using the Bayesian information criterion as implemented in GAPIT. In Equation 3, each SNP was tested in turn using a t-test (H0: No additive association between the SNP and trait), and p-values were obtained. Uncorrected comparison-wise p-value of p < 5 × 10-7, which is generally accepted to represent very strong proof of genome-wide association [47, 48], was used to identify significant marker-trait associations for all traits. A quantile-quantile (Q-Q) plot, which is commonly used for scrutinizing the population stratification in GWAS, was used to assess how well the model used for marker-trait association (Equation 3) accounted for population structure and familial relatedness. In this plot, the negative logarithms of the p-values from Equation 3 were plotted against their expected value under the null hypothesis of no association with the trait. To compare the relative contributions of each SNP, we used values obtained by fitting Equation 3 with and without each SNP.
where p i is the frequency of the A allele at the i th SNP (assuming three possible genotypes at each SNP were scored as AA, AB and BB); q i = 1 - p i ; elements of the i th column of M are 2q i , q i - p i and - 2p i for AA, AB and BB genotypes at a SNP locus. Product–moment correlations between ASE of SNPs obtained from Equations 3 and 4 are reported in this study.
Common genomic regions showing a significant effect on a pair of traits were further investigated by comparing LG-level estimates of between-traits genetic correlations (r g ). For this purpose, correlation between the estimated BLUP-BV (from Equation 1) for two traits was termed as genome-level genetic correlation. BLUP-BVs of each seedling for each trait was then decomposed into LG-level BVs by using SNP ASEs (from Equation 4) and seedling’s SNP genotypes at each LG. These LG-level BVs were then used to estimate LG-level between-trait genetic correlations.
where R (≈ 1 - r 2 h 2(1 - h 2)) is the ratio of the approximate non-centrality parameter for genetically related individuals versus unrelated individuals, assuming that resemblance between individuals is due to additive genetic effects ; r is the coefficient of relationship and h 2 is the narrow-sense heritability; Γ(.) is a gamma function; b and σ b are the regression coefficient and its standard deviation respectively. For derivations of b and σ b , refer to .
We thank Charmaine Carlisle and Kim Green for technical assistance in leaf sampling and DNA extraction. Alex Lipka helped in discussions for analysis using GAPIT software. Michela Troggio is thanked for sharing linkage map positions of SNPs. Peter Visscher is thanked for helpful discussion and suggestions during the course of this study. Luis Gea provided constructive comments and suggestions on the manuscript. This research, and the contribution of SK, CW, DC and RV, was partly funded by Prevar (Contract no. 26015) and the New Zealand Foundation for Research Science and Technology (Contract no. C06X0812). The contribution of MB was carried out as part of the EU-FruitBreedomics project funded by the Commission of the European Communities (Contract FP7-KBBE-2010-265582).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.