Aluminum tolerance association mapping in triticale

Background Crop production practices and industrialization processes result in increasing acidification of arable soils. At lower pH levels (below 5.0), aluminum (Al) remains in a cationic form that is toxic to plants, reducing growth and yield. The effect of aluminum on agronomic performance is particularly important in cereals like wheat, which has promoted the development of programs directed towards selection of tolerant forms. Even in intermediately tolerant cereals (i.e., triticale), the decrease in yield may be significant. In triticale, Al tolerance seems to be influenced by both wheat and rye genomes. However, little is known about the precise chromosomal location of tolerance-related genes, and whether wheat or rye genomes are crucial for the expression of that trait in the hybrid. Results A mapping population consisting of 232 advanced breeding triticale forms was developed and phenotyped for Al tolerance using physiological tests. AFLP, SSR and DArT marker platforms were applied to obtain a sufficiently large set of molecular markers (over 3000). Associations between the markers and the trait were tested using General (GLM) and Multiple (MLM) Linear Models, as well as the Statistical Machine Learning (SML) approach. The chromosomal locations of candidate markers were verified based on known assignments of SSRs and DArTs or by using genetic maps of rye and triticale. Two candidate markers on chromosome 3R and 9, 15 and 11 on chromosomes 4R, 6R and 7R, respectively, were identified. The r2 values were between 0.066 and 0.220 in most cases, indicating a good fit of the data, with better results obtained with the GML than the MLM approach. Several QTLs on rye chromosomes appeared to be involved in the phenotypic expression of the trait, suggesting that rye genome factors are predominantly responsible for Al tolerance in triticale. Conclusions The Diversity Arrays Technology was applied successfully to association mapping studies performed on triticale breeding forms. Statistical approaches allowed the identification of numerous markers associated with Al tolerance. Available rye and triticale genetic maps suggested the putative location of the markers and demonstrated that they formed several linked groups assigned to distinct chromosomes (3R, 4R, 6R and 7R). Markers associated with genomic regions under positive selection were identified and indirectly mapped in the vicinity of the Al-tolerant markers. The present findings were in agreement with prior reports.


Background
Hexaploid triticale (X Triticosecale Wittmack) is a hybrid of tetraploid wheat and diploid rye with genome composition AA, BB and RR. It is cultivated in Poland mainly as a fodder cereal, and its area of cultivation doubled during the last 10 years [1]. Triticale is frequently grown on acid soils in the presence of excessive toxic aluminum ions that inhibit root growth and seed yields [2]. The development of tolerant cultivars, including triticale, is an important breeding objective.
Tolerant plants can be identified by physiological tests [3][4][5][6][7][8][9]. This robust approach is inexpensive and generally provides a reliable measure of tolerance [4]. It relies on morphological traits that may not be directly related to the expression of Al-tolerant genes in response to environmental factors. More direct methods based on DNA markers [10][11][12][13] and saturated genetic maps or, preferentially, consensus maps of the species are needed to overcome these issues.
One of the most promising marker platforms for such studies is Diversity Arrays Technology (DArT) [14], which enables the identification of thousands of highly reliable markers in a single run [15][16][17]. DArT performs well in many species in which several markers can be assigned to individual chromosomes [15,18]. This system is also available for triticale and the first genetic map saturated with DArTs was recently published [19]. Moreover, AFLPs and SSR markers can be used for such studies [20][21][22].
There are at least two approaches to the identification of markers useful for marker-assisted selection (MAS). The first one is based on biparental mapping populations and it allows the identification of QTLs and linked markers [23,24], while the second method involves the study of "non-Mendelian" populations and is called association mapping [25,26]. The "Mendelian" approach enables only the identification of loci segregating in tested samples of a given cross. Thus, many different mapping populations may be needed to represent the allelic diversity of the genes contributing to the character under study in the tested species. Moreover, alleles present in some crosses may not always be expressed with major effects in other crosses within the species or may not be present there at all [27,28]. In contrast, the association mapping approach allows for the identification of many loci and alleles among all the individuals of the population [26]. This type of analysis was done in maize, lettuce, barley, wheat, oat and soybean [29][30][31][32][33][34][35][36]. However, in the absence of genetic maps, markers obtained through association mapping cannot be assigned to the respective chromosomes.
Additional sets of markers useful for MAS could be derived from markers representing genomic regions under putative selection pressure [37]. Genomic regions under positive selection pressure may be bound to adaptation traits in breeding materials [38,39]. Such markers would be of special value if mapped in the region of known QTLs.
The initial attempts to identify the chromosomal location of the Al-tolerant QTLs were performed on the wheat-triticale substitution lines [40][41][42]. The rye chromosomes number 3, 4, and 6 seem to contain major Al tolerance-related genes [40][41][42]. Aluminum tolerance also appeared to be controlled by certain wheat genes, but their location on chromosomes in triticale is unknown [43]. Little is known about the genes coding for this trait in triticale [41]. Nevertheless, studies performed in several octoploid triticale genotypes demonstrated that Al tolerance was associated with a high level of citrate exudation from roots [43], which is mediated by a transporter encoded by a Multidrug and Toxin Efflux (MATE) family gene mapped to 4BL in wheat [11]. In addition, studies in wheat suggested that Aluminum-Activated Malate Transporter (ALMT) family genes located on 4DL may participate in the trait [44]. A gene belonging to the same family was also identified in rye on the 7RS chromosome [45]. DNA markers based on the sequences of such genes may be used by breeders for MAS in triticale, although the causal link between these genes and Al resistance remains to be established.
The present study aimed to identify molecular markers associated with Al tolerance among plants randomly selected from advanced triticale breeding materials. The association mapping approach was applied, using General (GLM) and Multiple Linear (MLM) Models, as well as Statistical Machine Learning (SML). The chromosomal location of the markers was determined based on available genetic maps of triticale and rye.

Al tolerance test
Out of 232 individual plants representing 232 breeding forms, the roots of 76 plants were irreversibly damaged by aluminum and did not continue to grow after the test, and 35 showed little regrowth ability (transformed value of regrowth below 0.2 (Table 1)). Both types of individuals were classified as non-tolerant (N). The plants with transformed regrowth between 0.2 and 0.5 were called moderately tolerant (I), while those with regrowth greater than 0.5 were classified as tolerant (T). Among the individuals representing breeding forms of triticale, only three were more tolerant than rye cv. Dańkowskie Złote and just one exceeded cv. Strzekęcińskie, which were used as controls (not shown).

Genotyping
The marker platforms used allowed the identification of 3289 polymorphic markers: 3117 DArT, 145 AFLP and 27 SSR. After the removal of redundant markers, the number was reduced by nearly one half, as shown for certain chromosomes of rye in Table 2.

Agglomeration analysis
Clustering of the individuals in the 3R set resulted in the formation of two groups (100% of bootstrapping, not shown). Similar analyses of the 4R and 7R sets also revealed the presence of two groups with high bootstrap value. However, no structuring was present in the 6R marker set. In all cases, the groups did not correspond to the Al tolerance trait shown in the physiological test.

Structure identification
The ad hoc statistic ΔK revealed strong data structuring for the 3R, 4R and 7R marker sets with K equal to 2. The 6R set exhibited weak structuring with two putative groups of individuals. In contrast to the agglomeration analyses, Bayesian statistics grouped the sets according to aluminum tolerant phenotypes. The moderately tolerant forms were grouped with the tolerant ones. The number of individuals classified into a given group varied from chromosome to chromosome (Table 3), possibly due to the difference in the number of individuals present after merging procedures in each chromosomal set.

Association mapping
In the 3R chromosomal set, a single associated marker (wPt-3564) was identified by every approach (Table 3). Moreover, SML identified an additional significantly associated marker (rPt-401520). Analysis of the 4R set revealed seven associated markers identified by GLM and MLM analyses simultaneously. SML identified five associated markers, and three of them were common for all methods (Table 3). Among the 6R chromosome markers, eleven were associated with aluminum tolerance as indicated by GLM and MLM. The SML approach identified four additional markers that could be associated with the trait of interest. In the highly structured 7R set, eleven markers were associated with tolerance according to GML and MLM. Four of them were also detected by SML ( Table 3). All of these markers passed the Bonferroni test (Table 3) and showed a good fit with the data (see r 2 parameter, Table 3) in most cases, with better results for the General Linear Model (GML) than for the Mixed Linear Model (MLM) approach.
The redundant counterparts of several associated markers were excluded from association mapping for the simplicity of the analyses and computation efficiency. This information is provided in Table 3.

Positive and balancing selection
Among the 3R markers, six reflected genomic regions under putative positive selection pressure ( Table 3). The 4R set was represented by three markers associated with positive selection, while thirteen were associated with balancing selection. Positive selection was also identified by a single marker and balancing selection by 14 markers in the 6R set. Finally, two markers of the 7R set were assigned to loci under positive and ten under balancing selection.

Indirect mapping of Al-tolerant genes
All Al tolerance-associated DArT markers were assigned exclusively to rye chromosomes, and no association with the wheat genome was detected. A single marker that was highly associated with Al tolerance (wPt-3564: p-value E-09, r 2 = 0.22; see Table 3) and a less significant marker (rPt-401520, p = 0.0145) were mapped on 3R based on the triticale genetic map [19] as separated by about 130 cM ( Figure 1). Among 4R markers (Figure 2), some associated and redundant markers mapped in proximity to one another. Two associated and four redundant markers assigned to the 4R chromosome mapped within a 3.8 cM region of the triticale genetic map. Some of these tightly linked markers were also identified on the rye genetic map [15], and they were within the same chromosomal region based on markers common to the two genetic maps (rye and triticale). Those markers exhibited the highest association values (E-07, r 2 = 0.146, see Table 3). Several other associated markers (E-04-E-06) were randomly distributed along the rye map and were missing in the triticale map. Two markers downstream of the linkage group represented by rPt-507784 and rPt-410768 also exhibited high association values ( Table 3). Analysis of the 6R linkage maps showed the presence of three marker groups. One of them covered about 2.2 cM and consisted of three DArT markers with the highest p values (E-09 see Table 3) and r 2 about 0.17 (depending on the association method used) and was located on the triticale map, while two others (E-06 and E-05) were present on the rye map. Those two groups were separated by about 60 cM (Figure 3). One of the groups was approximately 10-20 cM from the group located on the rye map, while the other one was distal from it. Interestingly, the two linkage groups were not located in the same chromosomal region of the maps. DArT markers associated with Al tolerance and assigned to 7R formed a single linkage group on triticale [19] and another one on rye [15] genetic maps based on map alignment ( Figure 4). The triticale linkage group consisted of  Table 3 Arrangement of Al tolerance associated markers (AS), redundant markers and outliers under positive (PS) and balancing (BS) selection for separate chromosomes and they localization on available genetic maps of triticale [19] and rye [15] Chromosome    Table 3 Arrangement of Al tolerance associated markers (AS), redundant markers and outliers under positive (PS) and balancing (BS) selection for separate chromosomes and they localization on available genetic maps of triticale [19] and rye [15] Table 3 Arrangement of Al tolerance associated markers (AS), redundant markers and outliers under positive (PS) and balancing (BS) selection for separate chromosomes and they localization on available genetic maps of triticale [19] and rye [15] Table 3 Arrangement of Al tolerance associated markers (AS), redundant markers and outliers under positive (PS) and balancing (BS) selection for separate chromosomes and they localization on available genetic maps of triticale [19] and rye [15] (Continued)   rPt-402561  --------wPt-6978, wPt-2013, wPt-1320, rPt-506799, rPt-505233, rPt-400005, rPt-400819    Table 3 Arrangement of Al tolerance associated markers (AS), redundant markers and outliers under positive (PS) and balancing (BS) selection for separate chromosomes and they localization on available genetic maps of triticale [19] and rye [15] Bolded markers -markers associated according to both methods (TASSEL and SML). ΔK indicates the most probable structure of the data. In bracket the extreme of the ΔK is indicated. Significance of the associated markers is indicated for GLM, MLM and SML. The r 2 and Bonferroni test p-values are also indicated for the reference.
two highly associated markers (E-08, r 2 = 0.186, see Table  3), while the one identified on the rye map was formed with less important markers (E-06, r 2 = 0.144). Many associated markers revealed via association mapping were not assigned to any of the linkage groups in the two maps.

Mapping outliers
The majority of positive and balancing selection DArT markers were assigned to the chromosomes based on the known location of the markers. However, only some of them were located on the genetic maps of rye [15] and triticale [19]. A few markers related to positive selection were identified on the genetic maps. The rPt-508975 marker was present on the triticale, while rPt-390252 was detected on the rye genetic map [15] of the 3R chromosome ( Figure 1). The two markers were separated from each other by more than 70 cM and were not in the vicinity of the associated markers. The markers rPt-400317 and rPt-505352 were located on the rye and triticale maps [19] on the 4R chromosome ( Figure 2). These markers, as well as some balancing selection markers, were in close proximity to the cluster of associated markers (ca. 9 cM apart). The remaining outliers were slightly further apart from the associated marker. However, all of them formed a linked group that spread over approximately 20 cM, with outliers preferentially under positive selection located closer to the   associated markers than those under balancing selection. There were few outliers in the 6R and none in the 7R linkage groups (Figure 3 and 4) that could be assigned to those chromosomes by means of available rye and/or triticale genetic maps.

Discussion
Several factors need to be controlled when performing association mapping. First, the plant material must be properly selected and properly phenotyped. Then, after sample profiling with a carefully selected molecular marker system, redundant data and missing markers should be preferentially eliminated, genetic structure should be evaluated and statistical analyses can then be performed. The plant material itself is possibly the most important factor [46,47]. For association studies, the most diverse or elite inbred lines [32,48,49], cultivars [33,50,51], and land races in the case of rice [52,53] should be used. Prior studies were based on 57 [54] to 577 [52] plants with the most common number ranging from 70 to 150 plants [26]. In this context, our mapping population consisting of 232 advanced breeding forms of triticale was quite large and above the lower limits used by others. Considering that triticale is predominantly a self-pollinated species and some of the forms that originated from double haploids were developed via anther culture, the plant material used is appropriate for association mapping studies with dominant marker platforms [14,55].
Another factor of importance for association studies is the careful phenotyping of plant materials to eliminate false associations between trait and markers [56]. In our experiments, we used a well described and widely explored test for Al tolerance [7]. Although the test is based on root regrowth, which is an indirect approach to the identification of tolerant plants, it is usually assumed to correlate well with the trait [57]. Nevertheless, it is obvious that Al tolerance is multigenic and a physiological test is not necessarily the best choice for screening plant material [58]. Another important issue is the use of relative rather than direct measures of a trait to enable comparison between different experiments [59]. The ratio between the root regrowth of a given plant and the longest root regrowth measured within the analyzed population is a possible measure. The ratios underwent arcsine transformation [ Figure 3 Alignment of triticale (A) and rye (B) 6R genetic maps.
The associated markers and those belonging to the same redundant marker bin(s) (see also  aluminum tolerance values were rescaled to new values to fulfill the statistical requirements of quantitative traits. The DArT platform proved to be useful in association mapping performed on wheat [29,33], barley [34] and oat [31], and was a reasonable choice for our studies on advanced breeding forms of triticale, which identified several markers assigned to different chromosomes. However, it should be stressed that even using DArT markers it is practically impossible to avoid missing data that may appear at a level lower than 5% in the case of DArTs. Moreover, due to the high sensitivity of the approach, numerous rare markers with low frequencies were identified. Because such markers may influence linkage disequilibrium, they had to be removed from the analysis [60,61]. Similarly, identical markers or tightly linked ones (redundancy) may reduce the sensitivity of association mapping. Our approach nearly entirely eliminated or significantly reduced missing markers without involving mclust R-CRAN packages that are insensitive to such data [61]. Elimination of redundancy did not affect the information on markers that could be alternatives for the associated ones. Nevertheless, by using this approach, the advanced breeding forms used in each chromosomal set were reduced due to the merging of identical forms following the removal of missing and merging redundant markers. In some cases, the number of forms used dropped from 232 to 141. However, the chromosomal sets remained sufficiently large and exceeded the lower limits used by other studies [54].
The presence of population structure may result in spurious associations that could lead to numerous false positives [55,56]. To avoid such a problem, we used the agglomeration analysis implemented in the PAST software [62] and Bayesian statistics implemented in the Structure program [56]. Both methods indicated the presence of data structuring without separating winter and spring forms. Interestingly, while agglomeration and Bayesian approaches were capable of identifying data structuring, only the latter approach grouped individuals according to their known Al tolerance and was thus selected for routine use. Although calculations performed in the Structure software are time consuming due to the requirement for many burning periods, iterations and repetitions for each K value tested, they deliver information on the average genetic structure of the chromosomal sets required for association mapping in TASSEL [63], which is the most widely used software for association studies in plants [12,26,30,35]. It was used in similar studies on wheat and allowed for the identification of markers associated with traits such as response to stem rust, leaf rust, yellow rust, powdery mildew, grain field, heading date, flowering time, etc. [29,33].
Association mapping using both the GLM and MLM methods resulted in congruent results. However, the MLM approach usually provided higher r 2 values and a stronger association for the same markers than GLM. This confirms that the involvement of data structuring and relationships among analyzed forms improves the resolution of association mapping. An alternative approach based on Statistical Machine Learning (SML) to identify associated markers [64] was also applied. This approach has numerous advantages over the Bayesian method and it does not require time consuming analyses of population structure, as calculations are performed relatively fast [64]. It allowed identification of Al tolerance-associated markers that mostly, but not always, corresponded with those obtained in TASSEL. The discrepancies were possibly due to the fact that the whole data set rather than the chromosomal sets was used for calculations. Another possible explanation is the effect of structure on the results of association analysis. It will be interesting to compare the results with those of an SML algorithm that corrects for "structure", as this version has been developed recently (DArT PL, unpublished). In general, the smaller number of associated markers detected by SML is consistent with the more conservative (and likely more realistic) performance of this method when compared to several other techniques, as reported by Bedo et al. [64].
Based on known assignments of DArTs, Al-tolerant associated markers were localized to 3R, 4R, 6R and 7R chromosomes independently of the mapping approach used. No association was detected on the wheat genome. Our results are in agreement with several prior reports [40][41][42]65], indicating that Al-tolerant genes crucial for the expression of the trait in triticale are located on rye chromosomes rather than on wheat chromosomes. Considering that the strongest associations were in the 3R and 7R chromosomes, our results are congruent with those presented earlier [41,65,66] and our own results on several biparential triticale mapping populations (in preparation). Keeping in mind that most of our Al tolerance-associated markers have redundant counterparts, we succeeded in identifying as many as 52 candidate markers (46 via mapping in TASSEL and 14 following the SML approach, including eight markers identified via both methods).
Although association mapping may provide valuable information on associated markers and comparison with known DArTs may suggest their chromosomal assignments, their precise location is difficult to determine if saturated consensus genetic maps are not available. Unfortunately, such maps do not exist in triticale. However, a recently published report generated triticale and rye genetic maps using DArT markers [15,19]. Although rearrangements of rye chromosomes in the triticale genome in comparison to the rye genome may occur [67], changes in the distance between markers within several cM should not be a frequent occurrence. Thus, both maps could be used to verify whether associated markers assigned to the same chromosome fall within the same region or not. Such information proved to be valuable in estimating the putative location and the number of QTLs at least in the 4R and 6R chromosomes, where markers associated with the trait were located within a very short distance, which indicated the presence of two QTLs. Although few markers associated with Al tolerance were mapped to 3R, they were highly associated with the trait, indicating that a single QTL is present on the 3R chromosome. Similarly, at least two QTLs of different significance are present on 4R. The data for the 6R chromosome suggest that there might be as many as three QTLs, but only one seems to be highly significant. Finally, it is suggested that a single QTL represented by highly associated markers is also located on 7R, which is in agreement with previously published reports. Prior studies reported one QTL located on 3RS [41,42], 4RL [40,68] and 7RS [65]. Gallego and Benito [69] identified two isozyme loci linked to the rye Alt1 gene on chromosome 6R. This gene is probably the same as that located on chromosome arm 6RS (Alt1) by Anioł and Gustafson [40]. These results reported in the literature suggest that our QTLs may be located on the same arms of the chromosomes mentioned above. However, this localization is difficult to verify because prior studies used wheat-rye addition lines.
Genome regions under selection pressure for a given trait are likely to be involved in the expression of the trait [37]. Positive selection is considered to be responsible for adaptive traits [39]. The forms used in this study were selected for aluminum tolerance via many generative cycles; therefore, the identification of genomic regions under selection pressure (via markers called outliers), could be the method of choice to identify linked markers. Unfortunately, outliers reflecting genomic regions under positive selection located in the vicinity of markers highly associated with Al tolerance (ca 9 cM apart) were only present in the 4R linkage group. Moreover, some outliers under balancing selection were also within the same genomic region. Interestingly, all outliers indirectly mapped to the fourth chromosome covered the same region, extending over approximately 20 cM. In addition, the possibility that a single marker under positive selection could be close to the group of markers highly associated with the trait on 6R cannot be excluded. However, with the currently available rye and triticale maps (including possible discrepancies in synteny/colinearity between these genomes due to genome rearrangements), such a hypothesis is difficult to test. Similarly, another outlier under positive selection could be located in the proximity of the associated marker on 3R (ca 50 cM apart), while the other one did not appear to be linked to the second Al-associated marker. Our data confirm that outliers reflecting genomic regions under positive selection may be linked to the trait of interest, at least in the material used in this study. Nevertheless, it is evident that markers identified via analysis of outliers need independent confirmation of their value for MAS purposes.

Conclusions
The DArT approach was used to generate numerous polymorphic markers for association mapping and to support the chromosomal location of the markers. Association mapping using GML, MLM and SML resulted in comparable results, although data obtained by SML differed to some extent from those derived by GLM and MLM. Involvement of genetic maps of rye and triticale allowed the grouping of markers according to their chromosomal positions and the identification of specific genomic regions (possibly QTLs) that could be involved in the expression of the trait. Outliers related to positive selection could be useful as additional candidate markers linked to the trait of interest.

Plant materials
The 232 triticale breeding forms used in the study were originated from the Experimental Station (Małyszyn, Poland) and consisted of 193 winter and 39 spring inbreed forms. Three winter triticale lines and 15 spring lines were double haploids (DH). Each triticale form was represented by a single, randomly selected plant.

Al tolerance test
A standard Al tolerance physiological test was performed [7]. Triticale seeds were sterilized in 10% sodium hypochlorite for 10 minutes and then washed in water. After germination for 24 h at 10°C on moist filter paper in Petri dishes, they were transferred to a polyethylene net floated in a tray. The tray was filled with basic medium containing 2.0 mM CaCl 2 , 3.25 mM KNO 3 , 1.25 mM MgCl 2 , 0.5 mM (NH 4 ) 2 SO 4 and 0.2 mM NH 4 NO 3 (pH 4.5), and left for 3 days under controlled-environment growth cabinet (POL-EKO-APARATURA, ST500 B40 FOT10) conditions at 25°C, photoperiod 12/12 h day/night, light intensity 40 W m -2 and aeration. The plants were then transferred onto the same medium supplemented with AlCl 3 (0.59 mM (16 ppm)). After 24 h, roots were washed with water and seedlings were placed again in the basic medium for 48 h. To assess tolerance levels, roots were stained in 0.1% Eriochrome Cyanine R for 10 minutes. The continued growth ability of roots was a measure of Al tolerance/sensitivity. To evaluate the response of seedling roots, the length of regrowth in mm was measured. The highly tolerant rye cultivars Dańkowskie Złote (winter rye) and Strzekęcińskie (spring rye) were used as controls.

Phenotypic data transformation
The direct measures of root regrowth in mm were recalculated using the longest regrowth of all the seedlings as the denominator. Arcsine transformation was performed according to the formula arcsine square root (regrowth/ the longest regrowth), where regrowth was measured in mm.

DNA isolation
Total genomic DNA was isolated from fresh leaves of 14-day-old seedlings using the Plant DNeasy MiniKit 250 (Qiagen) following the manufacturer's instructions. DNA quantity was measured spectrophotometrically (NanoDrop ND-1000), and its integrity and purity was verified via electrophoresis on 1.2% agarose gels stained with EtBr (0.1 μg/ml) in TBE.

Genotyping
The protocol for the AFLP fingerprinting followed that described by Vos et al. [70] with minor modifications according to Bednarek et al. [71]. Samples of genomic DNA (0.5 μg) were digested with EcoRI/MseI, following ligation of adaptors and pre-selective amplification. The following rye SSRs were assigned to the 7R chromosome and used under the experimental conditions and thermal profiles suggested by the owners of microsatellite bases: SCM [73]). Amplified products (PTC-225 Peltier Thermal Cycler (MJ Research)) were denatured and separated on a 7% denaturing polyacrylamide gel following overnight exposure to X-ray films at -35°C.
DArT marker analysis was performed by Diversity Arrays Technology P/L, Canberra, Australia using methods described by Tinker et al. [60].

Data preparation for GLM and MLM
DArT molecular markers were transformed into binary (presence/absence) matrices and divided according to chromosome assignment. An additional matrix with unassigned markers (DArTs, SSRs and AFLPs) was also prepared.
When more than 30% of data were missing, individuals were removed from further analysis (seven forms out of 234). Each chromosomal marker set was checked for the presence of identical or similar plant forms using agglomeration analysis (UPGMA) and Dice genetic distance in PAST software [62]. The forms were assumed to be identical if the differences between them did not exceed 2% and if their molecular profiles, except when missing markers, were identical. The profiles of such individuals were merged, and missing markers were replaced by their counterparts from the other individual. The rationale for this was that even if two lines differed from each other (considering possible variation due to missing data), they were still significantly related and therefore representing them as a single entry would still be meaningful in association studies and should reduce redundancy. If a discrepancy in the Al tolerance of the individuals forming merged assemblies arose, then the highest value of the trait was assigned to the assembly.
Preliminary elimination of redundant markers was performed in the AFLPop ver. 1.1 excel add-in [74]. Due to numerous missing data, additional elimination steps were needed. Markers were clustered (UPGMA) using Dice genetic distance with PAST software, and those separated by a genetic distance lower than or equal to 2% (formed marker assembly) were merged. Missing data were completed using information from the redundant markers of the contiguous assembly. Only one representative of the given redundant marker assembly was retained, and information on the removed markers was saved for further analysis. Finally, low PIC markers, with a minor allele frequency of less than or equal to 5%, were also removed from analyses.

Data structure
UPGMA clustering using Dice genetic distance was applied on the basis of the data from all non-redundant individuals and non-redundant markers of each chromosomal data set with PAST [62]. The robustness of the branches was estimated using 1000 bootstrap replicates.
Genetic structure was studied with STRUCTURE 2.2.3 program [75] following a Bayesian approach and using no admixture model or independent allele frequencies.
Each simulation was run using burn-in and MCMC (Markov Chain Monte Carlo) lengths of 300 000. The range of possible Ks was tested from 1 to 10. Each simulation was run 10 times to quantify the amount of likely variation for each K. Estimation of the uppermost hierarchical level of the genetic structure was made using an ad hoc statistic ΔK and following the procedure described by Evanno et al. [76]. Computations were made using the BioPortal project [77].

Average genetic structure
The average genetic structure of each chromosomal set was estimated in CLUMPP [78] based on ten Q matrices obtained in STRUCTURE for the given K.

Linkage disequilibrium
For LD calculations, the correlation squared (r 2 ) was used because it is relatively insensitive to small sample sizes and low allele frequencies [79]. Moreover, (r 2 ) is adequate for mapping QTLs [26,79]. The General Linear Model (GML) and Multiple Linear Model (MLM) implemented in TASSEL software [63,80] were applied.
For the purpose of MLM analysis, kinship matrices adequate for dominant markers were evaluated in SPA-GeDi [81]. Kinship matrix data concerning averaged structures were calculated using CLUMPP software and based on Q matrices.

Statistical Machine Learning (SML)
Marker-trait associations were tested using SML technology as described by Bedo et al. [64]. The algorithms described in this paper were implemented as a "web service" by F. Detering (DArT PL, not published) on DArT PL's intranet. The software was run on the "non-redundant" marker set and a set of phenotypic data (see Phenotypic Data Transformation). For each marker in the dataset, the software calculates the PAVE value [64], which measures the contribution of this marker to the model by describing the phenotype as well as the probability (P) of this effect being observed by chance only. In addition, the software determines the complexity of the model (number of markers) contributing to phenotypic variation.

Indirect location of markers on genetic maps
The information available on triticale [19] and rye [15] genetic maps saturated with DArTs was used to locate the markers indirectly on triticale chromosomes.

Candidate genomic regions under selection pressure
Markers reflecting genomic regions under putative positive and balancing selection were identified by the Mcheza software [82]. The input data was organized based on the average structure of a chromosomal set obtained as described above. In an infinite allele model with 95000 simulations, "neutral" and "forced" mean F ST options were applied.