 Research article
 Open Access
 Published:
Ascertainment bias from imputation methods evaluation in wheat
BMC Genomics volume 17, Article number: 773 (2016)
Abstract
Background
Wholegenome genotyping techniques like Genotypingbysequencing (GBS) are being used for genetic studies such as GenomeWide Association (GWAS) and Genomewide Selection (GS), where different strategies for imputation have been developed. Nevertheless, imputation error may lead to poor performance (i.e. smaller power or higher false positive rate) when complete data is not required as it is for GWAS, and each marker is taken at a time. The aim of this study was to compare the performance of GWAS analysis for Quantitative Trait Loci (QTL) of major and minor effect using different imputation methods when no reference panel is available in a wheat GBS panel.
Results
In this study, we compared the power and false positive rate of dissecting quantitative traits for imputed and notimputed marker score matrices in: (1) a complete molecular marker barley panel array, and (2) a GBS wheat panel with missing data. We found that there is an ascertainment bias in imputation method comparisons. Simulating over a complete matrix and creating missing data at random proved that imputation methods have a poorer performance. Furthermore, we found that when QTL were simulated with imputed data, the imputation methods performed better than the notimputed ones. On the other hand, when QTL were simulated with notimputed data, the notimputed method and one of the imputation methods performed better for dissecting quantitative traits. Moreover, larger differences between imputation methods were detected for QTL of major effect than QTL of minor effect. We also compared the different marker score matrices for GWAS analysis in a real wheat phenotype dataset, and we found minimal differences indicating that imputation did not improve the GWAS performance when a reference panel was not available.
Conclusions
Poorer performance was found in GWAS analysis when an imputed marker score matrix was used, no reference panel is available, in a wheat GBS panel.
Background
Genetic markers are nowadays an essential part of plant and animal breeding programs. Nextgeneration sequencing (NGS) techniques allow discovering, sequencing, and genotyping thousands of Single Nucleotide Polymorphism (SNPs) covering the whole genome [1]. These SNPs are being used in analyses like transcriptome assembly [2], generation of highquality draft genomes even for complex genomes [3], understanding plant growth [4], evaluating the effect of epigenetics in plant development [5], isolation of mutant genes [6],species evolution and economic insight [7], genetic diversity [8], GWAS [9], and GS [10]. The GBS technique is one of the most used NGS approaches [8–11]. It was developed originally for barley and maize, and later extended to other complex genomes species like wheat [8–11]. GBS that relies on methylationsensitive restriction enzymes is highly efficient [12]. However, GBS generates a large proportion of missing data when alleles are obtained due to the use of short reads and when low sequencing depth are used [12]. Therefore, different strategies to impute missing data have been developed and used for genetic analyses [9]. Some imputation methods use reference panels and are based on Linkage Disequilibrium (LD), while other methods do not require reference panels. In the first group, the most common methods are known as MACH [13], IMPUTE [14], fastPHASE [15], PLINK [16], and Beagle [17]. All of them use haplotype segments from a reference panel densely genotyped to impute missing markers [18–20]. MACH uses a Markov Chain based algorithm to infer pairs of haplotypes for each individual’s genotypes [13]. IMPUTE considers the sequence of pairs of known haplotypes as hidden states, then models the sequence of hidden states based on a recombination map estimated from the reference data, and finally it predicts unknown genotypes [14]. The fastPHASE algorithm is a haplotype clustering algorithm that samples missing genotypes based on allele frequencies estimated from reference haplotypes, and then uses an Expectation Maximization (EM) algorithm to estimate parameter values to infer missing genotypes [15]. PLINK predicts missing data by the local haplotypic background and by the haplotype formed by the two or more flanking SNPs [16]. Finally, Beagle is a haplotype clustering based algorithm that uses the localized haplotype cluster model to group haplotypes at each marker and then finds the most likely haplotype pairs based on the individual’s known genotypes [17]. Therefore, strong LD among markers and low minor allele frequency (MAF) is required for effective LD imputation methods [21]. Additionally, more markers with an even genome coverage and therefore smaller distance among markers, and markers with larger subpopulation differentiation are also desirable to ensure imputation accuracy [22]. The second group of methods do not require a reference panel and include imputation by the mean, the MVNEM algorithm, and random forests [10]. In mean imputation, the most common allele at a particular marker in the population is used to impute missing data. MVNEM, on the other hand, considers the realized additive relationship matrix between the lines and an EM approach assuming that marker genotypes follow a multivariate normal distribution designed for use with GBS. Finally, random forest methods use an algorithm with multiple decision trees to determine a prediction value for each missing data point. For an overview of the imputation methods see [10].
Several studies found that imputation can improve QTL power detection [23, 24], but other studies found that large power is accompanied by either larger false positive rates or an increase in the multipletesting penalty [20, 25]. Unless a ‘onehit’ procedure is used (i.e. the uncertainty of genotypic probability distributions due to the imputation is incorporated in the GWAS analysis), large imputation error can be generated [26]. Other studies found that imputation should be carefully evaluated because quality control of the data is an important source of loss of power [27]. To carry on GWAS analysis, where one marker at a time is being tested, markertrait associations can be estimated without marker imputation using the available information at each marker.
The aim of this study was to compare the performance of imputation methods for GWAS analysis when no reference panel is available in a wheat GBS panel. Specifically, our objectives were: (1) to evaluate the effect of imputation using a golden standard (i.e. simulation over a complete marker score matrix), to determine whether ascertainment bias is responsible for imputation success; (2) to evaluate whether the outcome of the imputation performance is affected by the marker score matrix used to simulate the QTL; and (3) to compare the effect of imputation in a real phenotype wheat panel using GBS data with different missing rates (25 %, 35 % and 50 %) and four phenotypic traits.
Results
The strategies we pursued are explained in the Methods section, and the general procedure presented in Fig. 1. We used different number of QTL and heritabilities to simulate the QTL, along with different thresholds for calling the QTL. We summarized the results with power (PO) and false positive rate (FPR).
Ascertainment bias in imputation performance comparison (golden standard)
When we used a golden standard matrix of barley for simulating the QTL (i.e. a complete dataset, for general approach see Fig. 1A1), we found that for major QTL effects, larger power was obtained without imputing the genotypic matrix. Furthermore, for minor QTL effects, larger power was detected without imputing the genotypic matrix or imputing it with the MVNEM method (G_{ NImp }, G_{ MVNEM } Fig. 2). The smallest false positive rate was obtained for the genotypic matrix imputed by the RF method (G_{ RF }), and the largest false positive rate was obtained with the MVNEM imputation method (G_{ MVNEM }). False positive rates were still really small (i.e. 0.015, Fig. 2). Power was also small in general (i.e. 0.3, Fig. 2). The same pattern was found when using different threshold levels for the dissection of quantitative traits (i.e. Bonferroni corrected by the effective number of independent markers, Fig. 2; Bonferroni correction, Additional file 1; and an arbitrary threshold set at α = 0.01, Additional file 2).
When we simulated QTL over an imputed matrix (for the general approach see Fig. 1A.2), we found that larger power was obtained with the imputed genotypic matrices (G_{ Mean, } G_{ MVNEM } or G_{ RF }), while the largest false positive rate was obtained with the MVNEM imputation method (G_{ MVNEM }) (Fig. 3). However, when QTL were simulated over a notimputed matrix, the largest power was obtained when a notimputed or imputed by the MVNEM genotypic matrices were used (G_{ MVNEM } or G_{ NImp }). This pattern was consistent across number of QTL (i.e. 25 and 50, data not shown) and heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9, Fig. 3). The same pattern was found when using different threshold levels for the dissection of quantitative traits (i.e. Bonferroni corrected by the effective number of independent markers, Fig. 3; Bonferroni correction, Additional file 3; and an arbitrary threshold set at α = 0.01, Additional file 4).
Imputation effect for real GBS data with 25 %, 35 % or 50 % missing information
By using naturally sparse genotypic matrices like GBS in wheat with 25 %, 35 % or 50 % missing data information (for the general approach see Fig. 1b), we detected that larger power was obtained when a notimputed or imputed by the MVNEM genotypic matrices were used (Fig. 4, Additional files 5 and 6). However, when simulating over a matrix with imputed data, larger power was obtained by recover QTL with an imputed matrix (Fig. 4, Additional files 5 and 6). This was true for the different number of QTL (i.e. 25 and 50, data not shown) and heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9, Fig. 4, Additional files 5 and 6). Differences between power were more evident for major QTL, resulting in a reasonable increase of power for high heritabilities (Fig. 4). The largest values of false positive rate were found when simulating with the Y_{simNImp } and G_{ MVNEM } or the Y_{simRF } and G_{ MVNEM } (Fig. 4, Additional files 5 and 6). Additionally, the same pattern was found using different threshold levels (i.e. Bonferroni corrected by the effective number of independent markers, Fig. 4; Bonferroni correction, Additional file 7; and an arbitrary threshold set at α = 0.01, Additional file 8).
Imputation effect on GWAS for real phenotypes
We compared the QTL obtained for GWAS analysis using real phenotypic data from wheat, between the notimputed matrix (G_{ NImp }) with different missing rates (25 %, 35 % and 50 % of missing data), and the genotypic data imputed with the mean, MVNEM or RF method (G_{ Mean }, G_{ MVNEM } or G_{ RF }). The performance of GWAS analysis was similar across imputation methods (Fig. 5, Additional files 9 and 10), but not all QTL were detected across methods. For the 4 traits, plant height (PH, cm), days to heading (DH, days), thousand kernel weight (TKW, g) and spikes per square meter (SPM, number, Fig. 6, Additional files 11 and 12), we detected different putative QTL when using imputed or notimputed matrices. In general, the MVNEM imputation method performed similarly to non imputation, having some QTL being detected by both methods (Fig. 6, Additional files 11 and 12). However, each approach found also unique QTL (Fig. 6, Additional files 11 and 12).
Differences between methods for false positive rate
When we performed FPR boxplots with the replications for analyzing if the differences between the methods are significantly different or due to random errors (Additional files 13, 14, 15, 16, 17), we found that FPR rates were larger for: (i) the imputed genotypic matrices by the MVNEM method for the golden standard, (ii) the imputed genotypic matrix by the MVNEM method (G_{ MVNEM }) for the ascertainment bias, (iii) the imputed genotypic matrices by the MVNEM or RF methods (G_{ MVNEM, } G_{ RF }) for the GBS data with 35 % or 50 % missing data, (iv) and the imputed genotypic matrices by the RF method (G_{ RF }) for the GBS data with 25 % missing data.
Discussion
New wholegenome genotyping techniques are constantly being developed and used for genetic analyses like GWAS [9]. Although GBS is a powerful tool for genotyping hundreds of individuals with thousands of SNPs, it generates large amounts of missing information, and therefore, researchers have applied several strategies to impute these missing [14–17]. However, when retained a considerable amount of missing information using GBS data in wheat or artificially removing genotypic data from complete panels in barley, we found that imputation does not improve the dissection of quantitative traits performance in several situations. Our results should be restricted to our panels that have a specific LD (barley and wheat) and SNP quality, due to the continuous improvement of the sequencing technologies that allows the decrease of costs and therefore the increase of sequencing depth and quality, leading to a lower missing rate.
Ascertainment bias in imputation performance comparison (golden standard)
When we used the “golden standard” marker score matrix, the notimputed marker score matrix outperformed the imputation methods for all the combinations of parameters (Fig. 2, Additional files 1 and 2). The higher values of false positive rate found with the MVNEM matrix and lower values of power found with the RF matrix for all thresholds could be a consequence of an imputation error affecting the signal of the QTL.
The fact that we also found that the notimputed marker score matrix outperformed the imputation methods comparing both, power and false positive rate simultaneously, when we used real GBS data (i.e. data with missing points, Fig. 4), suggests that using an imputed matrix for GWAS analysis could introduce an ascertainment bias. This could be caused when there is no reference panel, and the uncertainty of genotypic probability distributions due to the imputation is not considered, as methods based on LD have found that if some restrictions are taken into account (i.e. strong LD among markers, low minor MAF, short distances between notimputed markers, and markers with higher subpopulation differentiation), the imputation accuracy and then the GWAS is improved [22, 28].
Although the low power found to detect QTL for the barley marker score matrix could theoretically be due to low LD between markers in the same LD blocks, we do not expect this to be the reason of low power in our study. When there are unlinked QTL controlling a trait, the power is moderate even with large populations and high heritabilities [29]. However, we do not expect unlinked QTL within the LD blocks due to the cluster of markers within those blocks [30], and because the genome coverage of the markers was very high, having 50 % of its SNPs, at a distance smaller than 0.625 cM (Table 1). The small population (122 lines) used for barley dataset could be the reason affecting the low values of power detected, as the power is a function of the population size [31]. However, this should not differently affect the imputation methods. Additionally, the great differences found in power and false positive rate between major and minor QTL, could indicate that major QTL are the QTL mostly detected by any of the imputation methods. Other LD structures in different populations could make our results to vary, therefore, this results are restricted to the populations used in this analysis. Further analyses considering different population structure should be tested.
Imputation effect for real GBS data with 25 %, 35 % or 50 % missing information
The differences found when we simulated QTL on top of imputed or notimputed marker score matrices (Fig. 4, Additional files 5 and 6) were probably due to the imputation method used and the simulation. Therefore, we found that notimputing was the best option for evaluating one marker at a time in GWAS analysis using GBS data with 25 %, 35 % or 50 % missing information, especially for detecting major QTL.
Imputation effect on GWAS for real phenotypes
As no significant differences were detected in the real wheat datasets in terms of the type 1 error inflation imputation (Fig. 5, Additional files 9 and 10), we consider that imputation does not improved the GWAS performance and therefore is not needed.
The traits evaluated in this paper were selected for having high heritability values and being related or a component of grain yield. The high heritability values may have reduced the differences between the QTL found with G_{ NImp } or G_{ MVNEM }.
We found QTL where previous QTL were reported. The QTL found for TKW (chromosome 1B, bin 224 and 242) with the G_{ NImp }, G_{ MVNEM } and G_{ Mean } matrices for 50 % missing data, and with the G_{ Mean } matrix for 25 % and 35 % missing data, are partially coincident with a QTL reported for green leaf area [32], a QTL reported for Near Differential Vegetative Index [33] and a QTL reported for yield, anthesis and plant height [34]. A QTL found for TKW (chromosome 1D, bin 205) with the G_{ NImp }, G_{ MVNEM } and G_{ Mean } matrices for 25 %, 35 % and 50 % missing data, is coincident with a QTL reported for grain yield and plant height [34]. The QTL found for TKW (chromosome 2D, bin 167) with 3 marker score matrices (G_{ NImp, } G_{ MVNEM } and G_{ Mean }) for 25 %, 35 % and 50 % missing data, SPM (chromosome 2D, bin 167) with the G_{ NImp } matrix for 25 % and 50 % missing data, and with 3 marker score matrices (G_{ NImp, } G_{ MVNEM } and G_{ Mean }) for 35 % missing data, are coincident with a QTL reported for kernel weight, Near Differential Vegetative Index and flag leaf [33]. A QTL found for DH (chromosome 3B, bin 282) with G_{ NImp } and G_{ MVNEM } for 50 % missing data is coincident with a QTL reported for grain filling duration [32]. A QTL found for SPM (chromosome 4A, bin 179) with the G_{ NImp } and G_{ MVNEM } matrices for 25 %, 35 % and 50 % missing data, is coincident with a QTL reported for anthesis and plant height [34]. The QTL found for DH (chromosome 4B, bin 106) with the G_{ NImp } matrix for 50 % missing data, is coincident with a QTL reported for yield and plant height [34]. A QTL found for DH (chromosome 6B, bin 116) with the G_{ NImp } matrix for 35 % and 50 % missing data, and with the G_{ NImp } and G_{ Mean } matrices for the 25 % missing data, is coincident with a QTL for yield and plant height [28]. A QTL found for PH (chromosome 7A, bin 225) with the G_{ NImp } and G_{ Mean } matrices for 50 % missing data, is coincident with yield and anthesis [34]. These positions are based on bins and should be regarded as an approximation. These could be improved after the draft of the genome is available [35].
As we found that QTL detected by the G_{ NImp } and G_{ MVNEM } matrices were similar, we believe that imputation do not improve GWAS analysis.
Conclusions
Imputation can introduce an ascertainment bias to GWAS analysis using GBS within crops when a reference panel is not available. Comparing the GWAS performance by the power and false positive rate with imputed or notimputed marker score matrices, poorer performance was found when an imputed marker score matrix was used. Additionally, the power and false positive rate changed in a clear way between major and minor QTL, showing that differences among imputation methods were more evident for major QTL and that the detection of minor QTL is negligible. Our results are restricted to the wheat panel used, as with different LD they could vary, and as well with different GBS quality data, which is affected by different SNP identification algorithms.
Methods
Dataset
We used two datasets: (1) a complete SNPs barley panel array (i.e. 99 % coverage), and (2) a GBS wheat marker score matrix with an average of 25 %, 35 % or 50 % missing points and phenotypic data (for general approach see Fig. 1).
The complete barley SNP marker score array dataset (Additional file 18), consisted in a panel of 122 barley advanced inbred lines from a population of 360 described in [36]. Briefly, 1,096 SNPs from the Barley Oligonucleotide Pool Assay1 (BOPA 1, Additional file 19) were selected [37, 38]. A total of 122 lines were chosen to form 2 complete datasets without missing information (Table 1).
The wheat GBS dataset (Additional file 20), consisted on a panel of 384 advanced inbred lines from breeding programs: 186 genotypes from the National Wheat Breeding Program from Uruguay (INIAUruguay, National Institute of Agricultural Research), 55 genotypes from the National Wheat Breeding Program from Chile (INIAChile), and 143 genotypes from the International Breeding Center of Maize and Wheat (CIMMYT, International Maize and Wheat Improvement Center), published in [39]. The CIMMYT genotypes share common ancestors with the INIAChile genotypes (see [39] for more details). DNA was extracted by the DNeasy Plant Maxi Kit (QIAGEN). Library construction was conducted at Kansas State University (Manhattan, Kansas) using a PstIMspI GBS protocol [10]. The sequencing was performed on an Illumina HiSequation 2000 at the DNA core facility at the University of Missouri, Columbia, Missouri, and the McGill UnivesityGénome Quebec Innovation Centre (Montreal, Canada) for each set of libraries. SNPs were obtained using the TasselGBS Pipeline [40]. The base quality and distribution of sequences was studied with the Galaxy (https://galaxyproject.org/) software. SNPs with more than 25 %, 35 % or 50 % missing points and with minor allele frequency (MAF) smaller than 10 % were excluded. Sequences were blasted to the SyntheticxOpata map (synop) using the blastn function from NCBIBLAST+ package using the number of descriptions and the number of threads set to one. Therefore, SNPs were placed into recombination bins [11] (Additional file 21). A final matrix set of 18,337 SNPs was obtained for 50 % missing data (Table 2), a final matrix set of 8,227 SNPs was obtained for 25 % missing data (Additional files 22 and 23), and a final matrix set of 11,858 SNPs was obtained for 35 % missing data (Additional files 24 and 25).
The phenotypic data for the wheat panel was obtained from an evaluation in a Mediterranean environment in Santa RosaChile in 2011 (36° 329’ S, 71° 559’ W; 217 m.a.s.l.). The field was irrigated with 50 mm m^{2} at each of four moments: tillering, flag leaf emergence, heading date, and grain filling (see [33] for further details). The experimental design was an alphalattice with 20 replications and 20 incomplete blocks. The traits evaluated were: plant height (PH, cm) evaluated from the base of the plant to the flower insertion (Additional file 26), days to heading (DH, days) was recorded when 50 % of the culms showed emerged ears (Additional file 27), thousands kernel weight (TKW, g, Additional file 28), and spikes per square meter (SPM, number, Additional file 29). We obtained the best linear unbiased predictors (BLUPs) for each genotype using the following model for each trait: y _{ ijk } = µ + a _{ i } + β _{ j } + δ _{ k(j)} + ε _{ ijk } where y _{ ijk } is the value for the phenotypic trait corresponding to the ith genotype, jth replication, and kth incomplete block, μ is the overall mean, a _{ i } is the random effect of the ith genotype with a _{ i } ~ N(0, σ ^{2}_{ g } ), β _{ j } is the effect of the jth replication, δ _{ k(j)} is the random effect of the kth incomplete block within the jth replication with δ _{ k(j)} ~ N(0, σ ^{2}_{ B } ), ε _{ ijk } is the experimental error corresponding to the ith genotype, jth replication and kth incomplete block with ε _{ ijk } ~ N(0, σ ^{2}_{ e } ). The genotypic breeding values were estimated with the function lmer (lme4 package) in R statistical software [41]. Broad sense heritabilities were estimated in R statistical software [35] using the above model (Table 3).
Imputation methods
For the barley SNP array panel, we started with a genotype by marker score matrix with 122 genotypes (rows) and 1,096 markers (columns) without missing values Markers were scored as {1, 1}. Then, we randomly generated missing values in order to have the same coverage as the GBS panel (50 %). Finally, three methods were used to fill in those missing values, MVNEM, which considers the realized additive relationship matrix between the lines and an EM approach assuming that marker genotypes follow a multivariate normal distribution [10], Random Forest (RF), which uses an algorithm with multiple decision trees to predict a value for each missing point, and the Mean, which uses the average value score per marker (i.e. the expected allele value at the particular marker). Imputation was conducted in R statistical software [41] with the A.mat function (rrBLUP package) [42].
For the wheat GBS panel, we started with: (i) a genotype by marker score matrix with 384 genotypes (rows) and 18,337 markers (columns) with 50 % of missing values, (ii) a genotype by marker score matrix with 384 genotypes (rows) and 8,227 markers (columns) with 25 % of missing values, and (iii) a genotype by marker score matrix with 384 genotypes (rows) and 11,858 markers (columns) with 35 % of missing values. Markers were scored as the number of alleles {NA, 1, 1}. We used the same methods as the previous sections to impute by the MVNEM and the Mean.
Simulation procedure
To evaluate the effect of imputation using a golden standard with the barley SNP array, we created phenotypic vectors simulating QTL on top of the complete barley marker score matrix (Y_{simNoNA }). The phenotypic vectors were the sum of the effects of genotypic and residual terms, Y_{sim} = g + e. The genotypic effect was calculated as the sum of the markers (selected as QTL) effects and markers effects were obtained from a Beta(2, 6) distribution. Markers selected as QTL were obtained from the LD blocks defined from a single linkage agglomerative procedure [30] with euclidean distances between markers and a minimum of 1.5 cM to consider independent groups. QTL with major effects were defined as the QTL with effects larger than the 75 % of the maximum, and QTL with minor effect were defined as the remaining QTL. The residual term was obtained by sampling from a normal distribu tion, N(0, σ^{2} _{e}), where σ^{2} _{e} = (1 h ^{2})σ^{2} _{g}/ h ^{2} and σ^{2} _{g} was the variance of the realized g. One vector for the combinations of number of QTL (i.e. 25 and 50), different heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9), and for each one of 500 iterations was created. Then, we created missing data at random, imputed (i.e. G_{ NImp, } G_{ MVNEM }, G_{ Mean } and G_{ RF }) and pursued the GWAS analysis with each combination of genotypic matrix, evaluating power and false positive rate (for the general approach see Fig. 1A.1).
For the ascertainment bias evaluation, we first created the missing data and then simulated the QTL on top of each matrix: notimputed marker score (Y_{simNImp }), imputed with MVNEM [10] marker score (Y_{simMVNEM }), imputed by the mean marker score (Y_{simMean }) and imputed with RF [10] marker score (Y_{simRF }). Finally, we performed the GWAS analysis with each genotypic marker score (i.e. G_{ NImp, } G_{ MVNEM }, G_{ Mean } and G_{ RF }) and for each phenotypic vector (i.e. Y_{simNImp }, Y_{simMVNEM }, Y_{simMean } and Y_{simRF }, for the general approach see Fig. 1A.2). We then compared the power and false positive rate.
For evaluating GWAS performance based on simulated phenotypes with the wheat GBS panel (Fig 1b) data we first created vectors of phenotypic values (i.e. Y_{simNImp }, Y_{simMVNEM }, Y_{simMean } and Y_{simRF }). Each phenotypic vector was simulated for different number of QTL (i.e. 25 and 50), different heritabilities (i.e. 0.2, 0.4, 0.6, 0.7, 0.9) as in the previous section. In order to avoid collinearity, LD blocks were defined as the bins in each chromosome and a marker chosen at random within each LD block was considered a QTL. One vector for each combination of the parameters and for each one of 500 iterations was created. We performed the simulations in R statistical software [41].
GWAS analysis
For the GWAS analysis, the mixed model described by [43] was used: y = Xβ + Qv + Zu + e, where y is the phenotypic vector (n x 1) with n the total number of lines, X is a (n x m) SNPs matrix with m the number of SNPs coded as described before {NA, 1, 1}, β is a (m x 1) vector of allelic effects to be estimated, Q is a (n x q) incidence matrix with q origin’s groups, v is a (n x 1) populations fixed effect vector, Z is the genotypic incidence matrix, u is the vector of random background polygenic effects, u ~ N(0, Aσ^{2} _{g}), where A is the realized additive relationship matrix obtained with the A.mat function from package rrBLUP [36] in R statistical software [35] and e is the residual error, e ~ N(0, σ^{2} _{e}). For each Y _{ sim } , we used the 4 genotypic marker score to recover the QTL (i.e. G_{ NImp, } G_{ MVNEM }, G_{ Mean } and G_{ RF }). We performed the analysis for three different thresholds (threshold) to define markers as significant: (1) Bonferroni correction, (2) Bonferroni correction using the effective number of markers, Li&Ji method [38], and (3) a liberal threshold of α = 0.01. GWAS analysis was accomplished with GWAS function from rrBLUP package [42] in R statistical software [41–45]. We defined as true positives (TP) the number of bins with a QTL and at least one significant marker; false positives (FP) the number of bins with no QTL and at least one significant marker; true negatives (TN) the number of bins with no QTL and no significant markers, and false negatives (FN) the number of bins with QTL and no significant markers. We evaluated power (PO = TP/(TP + FN)) and false positive rate (FPR = FP/ (FP + TN)) [39] for QTL detection. We evaluated performance for QTL of major and minor effect.
Abbreviations
 BLUPs:

Best linear unbiased predictors
 BOPA:

Barley oligonucleotide pool assay1
 CIMMYT:

International maize and wheat improvement center
 DH:

Days to heading
 FN:

False negatives
 FP:

False positives
 FPR:

False positive rate
 GBS:

Genotypebysequencing
 GS:

Genomewide selection
 GWAS:

Genomewide analysis
 ICARDA:

International center for agricultural research in the dry areas
 INIA:

National institute of agricultural research
 LD:

Linkage disequilibrium
 MAF:

Minor allele frequency
 MVNEM:

Multivariate normal expectation maximization
 NGS:

Nextgeneration sequencing
 PH:

Plant height
 PO:

Power
 QTL:

Quantitative trait loci
 RF:

Random forest
 SNPs:

Singlenucleotide polymorphism
 SPM:

Spikes per square meter
 SR2011:

Santa rosa 2011
 TKW:

Thousands kernel weight
 TN:

True negatives
 TP:

True positives
 USDA:

United States department of agriculture
References
 1.
Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genomewide genetic marker discovery and genotyping using nextgeneration sequencing. Nat Rev Genet. 2011;12:499–510.
 2.
Martin JA, Wang Z. Nextgeneration transcriptome assembly. Nat Publ Gr. 2011;2:671–82.
 3.
Hirsch CN, Buell CR. Tapping the Promise of Genomics in Species with Complex, Nonmodel Genomes. Annu Rev Plant Biol. 2013;64:89–110.
 4.
Bedewitz MA, GóngoraCastillo E, Uebler JB, GonzalesVigil E, WiegertRininger KE, Childs KL, Hamilton JP, Vaillancourt B, Yeo YS, Chappell J, DellaPenna D, Jones AD, Buell CR, Barry CS. A RootExpressed L Phenylalanine: 4Hydroxyphenylpyruvate Aminotransferase Is Required for Tropane Alkaloid Biosynthesis in Atropa belladonna. Plant Cell. 2014;9:3745–62.
 5.
Zhong S, Fei Z, Chen Y, Zheng Y, Huang M, Vrebalov J, Mcquinn R, Gapper N, Liu B, Xiang J, Shao Y, Giovannoni JJ. Singlebase resolution methylomes of tomato fruit development reveal epigenome modifications associated with ripening. Nat Biotechnol. 2013;31:154–9.
 6.
Schneeberger K. Using nextgeneration sequencing to isolate mutant genes from forward genetic screens. Nat Publ Gr. 2014;15:662–76.
 7.
Zhang G, Liu X, Quan Z, Cheng S, Xu X, Pan S, Xie M, Zeng P, Yue Z, Wang W, Tao Y, Bian C, Han C, Xia Q, Peng X, Cao R, Yang X, Zhan D, Hu J, Zhang Y, Li H, Li H, Li N, Wang J, Wang C, Wang R, Guo T, Cai Y, Liu C, Xiang H, Shi Q, Huang P, Chen Q, Li Y, Wang J, Zhao Z, Wang J. Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nat Biotechnol. 2012;30:549–54.
 8.
Heslot N, Rutkoski J, Poland J, Jannink JL, Sorrells ME. Impact of marker ascertainment bias on genomic selection accuracy and estimates of genetic diversity. PLoS ONE. 2013;8:e74612.
 9.
Romay MC, Millard MJ, Glaubitz JC, Peiffer JA, Swarts KL, Casstevens TM, Elshire RJ, Acharya CB, Mitchell SE, FlintGarcia SA, McMullen MD, Holland JB, Buckler ES, Gardner CA. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 2013;14:R55.
 10.
Poland J, Endelman J, Dawson J, Rutkoski J, Wu S, Manes Y, Dreisigacker S, Crossa J, SánchezVilleda H, Sorrells M, Jannink JL. Genomic Selection in Wheat Breeding using GenotypingbySequencing. Plant Genome J. 2012;5:103.
 11.
Poland JA, Brown PJ, Sorrells ME, Jannink JL. Development of highdensity genetic maps for barley and wheat using a novel twoenzyme genotypingbysequencing approach. PLoS ONE. 2012;7:e32253.
 12.
Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotypingbysequencing (GBS) approach for high diversity species. PLoS ONE. 2011;6:e19379.
 13.
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–34.
 14.
Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet. 2007;39:906–13.
 15.
Scheet P, Stephens M. A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–44.
 16.
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81:559–75.
 17.
Browning SR, Browning BL. Rapid and accurate haplotype phasing and missingdata inference for wholegenome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81:1084–97.
 18.
Browning SR. Missing data imputation and haplotype phase inference for genomewide association studies. Hum Genet. 2008;124:439–50.
 19.
Jannink JL, Iwata H, Bhat PR, Chao S, Wenzl P, Muehlbauer GJ. Marker Imputation in Barley Association Studies. Plant Genome J. 2009;2:11.
 20.
Hao K, Chudin E, McElwee J, Schadt EE. Accuracy of genomewide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet. 2009;10:27.
 21.
Pei YF, Li J, Zhang L, Papasian CJ, Deng HW. Analyses and comparison of accuracy of different genotype imputation methods. PLoS ONE. 2008;3:e3551.
 22.
Iwata H, Jannink JL. Marker genotype imputation in a lowmarkerdensity panel with a highmarkerdensity reference panel. Accuracy evaluation in barley breeding lines. Crop Sci. 2010;50:1269.
 23.
Guan Y, Stephens M. Practical issues in imputationbased association mapping. PLoS Genet. 2008;4:e1000279.
 24.
Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, Gupta N, Neale BM, Daly MJ, Sklar P, Sullivan PF, Bergen S, Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Purcell SM, Haas DW, Liang L, Sunyaev S, Patterson N, de Bakker PIW, Reich D, Price AL. Extremely lowcoverage sequencing and imputation increases power for genomewide association studies. Nat Genet. 2012;44:631–5.
 25.
Almeida MAA, Oliveira PSL, Pereira TV, Krieger JE, Pereira AC. An empirical evaluation of imputation accuracy for association statistics reveals increased typeI error rates in genomewide associations. BMC Genet. 2011;12:10.
 26.
Aulchenko YS, Struchalin MV, van Duijn CM. ProbABEL package for genomewide association analysis of imputed data. BMC Bioinformatics. 2010;11:134.
 27.
De Bakker PIW, Ferreira MAR, Jia X, Neale BM, Raychaudhuri S, Voight BF. Practical aspects of imputationdriven metaanalysis of genomewide association studies. Hum Mol Genet. 2008;17:122–28.
 28.
Bernardo R. Breeding for quantitative traits in plants. 2nd ed. Minnesota: Stemma Press; 2010.
 29.
Sibson R. SLINK: an optimally efficient algorithm for the singlelink cluster method. Comput J. 1973;30–34.
 30.
Lande R, Thompson R. Efficiency of markerassisted selection in the improvement of quantitative traits. Genetics. 1990;124:743–56.
 31.
He S, Zhao Y, Mette MF, Bothe R, Ebmeyer E, Sharbel TF, Reif JC, Jiang Y. Prospects and limits of marker imputation in quantitative genetic studies in European elite wheat (Triticum aestivum L.). BMC Genomics. 2015;16:1–12.
 32.
Edae EA, Byrne PF, Haley SD, Lopes MS, Reynolds MP. Genomewide association mapping of yield and yield components of spring wheat under contrasting moisture regimes. Theor Appl Genet. 2014;127:791–807.
 33.
Bennett D, Reynolds M, Mullan D, Izanloo A, Kuchel H, Langridge P, Schnurbusch T. Detection of two major grain yield QTL in bread wheat (Triticum aestivum L.) under heat, drought and high yield potential environments. Theor Appl Genet. 2012;125:1473–85.
 34.
Mathews KL, Malosetti M, Chapman S, McIntyre L, Reynolds M, Shorter R, van Eeuwijk F. Multienvironment QTL mixed models for drought stress adaptation in wheat. Theor Appl Genet. 2008;117:1077–91.
 35.
Mayer KFX, Rogers J, el Dole J, Pozniak C, Eversole K, Feuillet C, Gill B, Friebe B, Lukaszewski AJ, Sourdille P, Endo TR, Kubalakova M, Ihalikova J, Dubska Z, Vrana J, Perkova R, Imkova H, Febrer M, Clissold L, McLay K, Singh K, Chhuneja P, Singh NK, Khurana J, Akhunov E, Choulet F, Alberti A, Barbe V, Wincker P, Kanamori H, et al. A chromosomebased draft sequence of the hexaploid bread wheat (Triticum aestivum) genome. Science. 2014;345(6194):1251788.
 36.
Gutiérrez L, Germán S, Pereyra S, Hayes PM, Pérez CA, Capettini F, Locatelli A, Berberian NM, Falconi EE, Estrada R, Fros D, Gonza V, Altamirano H, HuertaEspino J, Neyra E, Orjeda G, SandovalIslas S, Singh R, Turkington K, Castro AJ. Multienvironment multiQTL association mapping identifies disease resistance QTL in barley germplasm from Latin America. Theor Appl Genet. 2015;128:501–16.
 37.
Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks N, Ramsay L, Druka A, Stein N, Svensson JT, Wanamaker S, Bozdag S, Roose ML, Moscou MJ, Chao S, Varshney RK, Sz P, Sato K, Hayes PM, Matthews DE, Kleinhofs A, Muehlbauer GJ, Deyoung J, Marshall DF, Madishetty K, Fenton RD, Condamine P, Graner A, Waugh R. Development and implementation of highthroughput SNP genotyping in barley. BMC Genomics. 2009;13:1–13.
 38.
Szűcs P, Blake VC, Bhat PR, Chao S, Close TJ, CuestaMarcos A, Muehlbauer GJ, Ramsay L, Waugh R, Hayes PM. An Integrated Resource for Barley Linkage Map and Malting Quality QTL Alignment. Plant Genome J. 2009;2:134.
 39.
Lado B, Matus I, Rodríguez A, Inostroza L, Poland JA, Belzile F, del Pozo A, Quincke M, Castro M, von Zitzewitz J. Increased genomic prediction accuracy in wheat breeding through spatial adjustment of field trial data. G3 (Bethesda). 2013;3:2105–14.
 40.
Glaubitz JC, Casstevens TN, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES. TASSELGBS: a high capacity genotyping by sequencing analysis pipeline. PLoS ONE. 2014;9(2):e90346.
 41.
R Core Team: R. A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051070, https://www.rproject.org/ 2015.
 42.
Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome J. 2011;4:250–5.
 43.
Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES. A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38:203–8.
 44.
Li J, Ji L. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity. 2005;95:221–27.
 45.
Chengsong Z, Jianming Y. Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics. 2009;182:875–88.
Acknowledgements
The research that gives rise to the results presented in this publication was funded by the National Agency for Research and Innovation under the code POS_NAC_2014_1_102695 and FONTAGRO (Project FTG803808 and Project FTG 061706). The authors wish to express their appreciation for the effort of the technical personnel of all the involved institutions.
Availability of data and materials
All data is included as supporting files for the manuscript.
Authors’ contributions
AJC was in charge of the barley project. IM, MQ, JvZ, MC and AdP created the wheat project. IM and AdP conducted phenotyping in wheat. JvZ and BL conducted the genotyping in wheat. SB, AGR, MM, AAFG and LG designed the hypothesis, methodology and conducted statistical analysis. SB, AGR, BL and LG wrote the paper. All authors read and approved the final paper.
Competing interests
The authors declare that they have no competing interest.
Author information
Affiliations
Corresponding author
Additional files
Additional file 1: Figure S1.
Power (PO) and false positives rate (FPR) for major and minor QTL with 25 QTL, for the golden standard form barley, with a Bonferroni threshold. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), a marker score matrix to simulate the QTL (i.e. Y_{simNoNA }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM } and G_{ Mean }). (PDF 28 KB)
Additional file 2: Figure S2.
Power (PO) and false positives rate (FPR) for major and minor QTL with 25 QTL, for the golden standard from barley, with α = 0.01 threshold. Each parameter was calculated for the combinations of: number of QTL (q), heritabilties (h ^{2}), a marker score matrix to simulate the QTL (i.e. Y_{simNoNA }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM } and G_{ Mean }). (PDF 28 KB)
Additional file 3: Figure S3.
Power (PO) and false positives rate (FPR) with 25 QTL, for major and minor QTL for ascertainment bias in imputation performance comparison in barley, with a Bonferroni threshold. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM } and Y_{simMean }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM } and G_{ Mean }). (PDF 34 KB)
Additional file 4: Figure S4.
Power (PO) and false positives rate (FPR) with 25 QTL, for major and minor QTL for ascertainment bias in imputation performance comparison in barley, with a α = 0.01 threshold. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM } and Y_{simMean }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM } and G_{ Mean }). (PDF 34 KB)
Additional file 5: Figure S5.
Power (PO) and false positives rate (FPR) with 25 QTL and 25 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM, } Y_{simMean } and Y_{simRF }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 156 KB)
Additional file 6: Figure S6.
Power (PO) and false positives rate (FPR) with 25 QTL and 35 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM, } Y_{simMean } and Y_{simRF }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 156 KB)
Additional file 7: Figure S7.
Power (PO) and false positives rate (FPR) with 25 QTL and 50 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a Bonferroni threshold. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM } and Y_{simMean }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM } and G_{ Mean }). (PDF 35 KB)
Additional file 8: Figure S8.
Power (PO) and false positives rate (FPR) with 25 QTL and 50 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a α = 0.01 threshold. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM } and Y_{simMean }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM } and G_{ Mean }). (PDF 36 KB)
Additional file 9: Figure S9.
QQ plots of the pvalues from the GWAS analysis from real phenotype wheat data with 25 % missing rate and a Bonferroni threshold corrected by the effective number of independent markers. For each trait measured and each marker score matrix evaluated, a qqplot of the pvalues resulted form the GWAS analysis is presented. The marker score matrices were: NImp (not imputed) in turquoise, Mean (mean imputed) in green, MVNEM (Multivariate Normal Expectation Maximization method) in coral and RF (Random Forest method) in orchid. The phenotype traits are: DH, days to heading; PH, Plant Height; SPM, Spikes Per Square Meter; TKW, Thousands Kernel Weight. (PDF 359 KB)
Additional file 10: Figure S10.
QQ plots of the pvalues from the GWAS analysis from real phenotype wheat data with 35 % missing rate and a Bonferroni threshold corrected by the effective number of independent markers. For each trait measured and each marker score matrix evaluated, a qqplot of the pvalues resulted form the GWAS analysis is presented. The marker score matrices were: NImp (not imputed) in turquoise, Mean (mean imputed) in green, MVNEM (Multivariate Normal Expectation Maximization method) in coral and RF (Random Forest method) in orchid. The phenotype traits are: DH, days to heading; PH, Plant Height; SPM, Spikes Per Square Meter; TKW, Thousands Kernel Weight. (PDF 418 KB)
Additional file 11: Figure S11.
Manhattan plots of the GWAS analysis for real phenotype wheat data with 25 % missing rate and a Bonferroni threshold corrected by the effective number of independent markers. For each trait measured and each marker score matrix evaluated, a manhattan plot of the GWAS analysis is presented. The phenotype traits are: DH, Days to Heading; PH, Plant Height; SPM, Spikes Per Square Meter; TKW, Thousands Kernel Weight. The marker score matrices were: NImp (not imputed), Mean (mean imputed), MVNEM (Multivariate Normal Expectation Maximization method) and RF (Random Forest method). QTL detected by the NImp matrix are in turquoise, QTL detected exclusively by the MVNEM matrix are in coral, QTL detected exclusively by the Mean matrix are in green, and QTL detected exclusively by the RF matrix are in orchid. (PDF 539 KB)
Additional file 12: Figure S12.
Manhattan plots of the GWAS analysis for real phenotype wheat data with 35 % missing rate and a Bonferroni threshold corrected by the effective number of independent markers. For each trait measured and each marker score matrix evaluated, a manhattan plot of the GWAS analysis is presented. The phenotype traits are: DH, Days to Heading; PH, Plant Height; SPM, Spikes Per Square Meter; TKW, Thousands Kernel Weight. The marker score matrices were: NImp (not imputed), Mean (mean imputed), MVNEM (Multivariate Normal Expectation Maximization method) and RF (Random Forest method). QTL detected by the NImp matrix are in turquoise, QTL detected exclusively by the MVNEM matrix are in coral, QTL detected exclusively by the Mean matrix are in green, and QTL detected exclusively by the RF matrix are in orchid. (PDF 745 KB)
Additional file 13: Figure S13.
Boxplots of false positives rate (FPR) for major and minor QTL with 25 QTL, for the golden standard form barley, with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), a marker score matrix to simulate the QTL (i.e. Y_{simNoNA }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 110 KB)
Additional file 14: Figure S14.
Boxplots of false positives rate (FPR) with 25 QTL, for major and minor QTL for ascertainment bias in imputation performance comparison in barley, with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM, } Y_{simMean } and Y_{simRF }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 139 KB)
Additional file 15: Figure S15.
Boxplots of false positives rate (FPR) with 25 QTL and 50 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM, } Y_{simMean } and Y_{simRF }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 144 KB )
Additional file 16: Figure S16.
Boxplots of false positives rate (FPR) with 25 QTL and 25 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM, } Y_{simMean } and Y_{simRF }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 132 KB)
Additional file 17: Figure S17.
Boxplots of false positives rate (FPR) with 25 QTL and 35 % missing rate, for major and minor QTL to evaluate the GWAS performance based on simulated matrix with a Bonferroni threshold corrected by the effective number of independent markers. Each parameter was calculated for the combinations of: heritabilties (h ^{2}), marker score matrices to simulate the QTL (i.e. Y_{simNImp }, Y_{simMVNEM, } Y_{simMean } and Y_{simRF }), and marker score matrices to perform the GWAS analysis (i.e. G_{ NImp, } G_{ MVNEM, } G_{ Mean } and G_{ RF }). (PDF 143 KB)
Additional file 18:
Barley SNP marker score array. Genotypes are presented as rows and SNPs as columns. (txt 278 kb)
Additional file 19:
Barley SNP marker map. Markers, chromosome and position (cM) are presented as columns. (txt 18 kb)
Additional file 20:
Wheat GBS dataset with 50 % missing information. Genotypes are presented as rows and SNPs as columns. (txt 16.3 Mb)
Additional file 21:
Wheat GBS bin map for 50 % coverage. Markers, chromosome and position (bins) are presented as columns. (txt 287 kb)
Additional file 22:
Wheat GBS dataset with 25 % missing information. Genotypes are presented as rows and SNPs as columns. (txt 16.3 Mb)
Additional file 23:
Wheat GBS bin map for 25 % coverage. Markers, chromosome and position (bins) are presented as columns. (txt 287 kb)
Additional file 24:
Wheat GBS dataset with 35 % missing information. Genotypes are presented as rows and SNPs as columns. (txt 16.3 Mb)
Additional file 25:
Wheat GBS bin map for 35 % coverage. Markers, chromosome and position (bins) are presented as columns. (txt 287 kb)
Additional file 26:
Best linear unbiased predictors (BLUPs) for each genotype for plant height. (txt 9 kb)
Additional file 27:
Best linear unbiased predictors (BLUPs) for each genotype for days to heading. (txt 8 kb)
Additional file 28:
Best linear unbiased predictors (BLUPs) for each genotype for thousands kernel weight. (txt 9 kb)
Additional file 29:
Best linear unbiased predictors (BLUPs) for each genotype for spikes per square meter. (txt 9 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Brandariz, S.P., González Reymúndez, A., Lado, B. et al. Ascertainment bias from imputation methods evaluation in wheat. BMC Genomics 17, 773 (2016). https://doi.org/10.1186/s1286401631205
Received:
Accepted:
Published:
Keywords
 GBS
 QTL
 GWAS
 Power
 False positive