Plant material and phenotyping
The plant material for this association study represents a half-diallel mating scheme of of 21 maize inbred lines (7 Flint and 14 Dent) from the breeding program of the University of Hohenheim, Germany, with 98 hybrids resulting from the factorial mating scheme of Dent by Flint lines [24]. The set of Flint lines is composed by four lines with European Flint background (F037, F039, F043, F047) and three with Flint/Lancaster background (L028, L035, L043). The Dent lines include eight lines with an Iowa Stiff Stalk Synthetic (S028, S036, S044, S046, S049, S050, S058, S067) and six with an Iodent background (P033, P040, P046, P048, P063, P066). Field trials were conducted to collect the phenotypic data at five locations for the inbred lines in 2003 and 2004 and at six locations for the hybrids in 2002 [5, 25]. Field data for GY were measured in Mg ha− 1 with adjustment to 155 g kg− 1 grain moisture (Additional file 1: Table S1).
For transcriptome expression analysis and sRNA sequencing all 21 inbred lines were grown under controlled conditions (25 °C, 16 h day, 8 h night, 70% air humidity) for seven days, the whole seedlings including roots were flash-frozen in liquid nitrogen. Five biological individuals of the same genotype were pooled before RNA isolation.
SNP data
SNP data were generated with the Illumina MaizeSNP50 chip [26] in the study of Frascaroli et al. [4].
Microarray transcriptome expression data generation and analysis
Microarray gene expression data was generated on the 46 k maize oligo nucleotide array [27]. RNA-probe synthesis and microarray analysis are described in the study of Thiemann et al. [28]. All expression data has been deposited in the NCBI GEO under accession number GSE17754.
Small RNA isolation, sequencing and sequencing data processing
Small RNA isolation, sequencing experiments as well as sequencing data processing and normalization are described in the study of Seifert et al. [22]. All sequence data has been deposited in the NCBI GEO under accession number GSE51662. Details on the sRNA sequencing data are given in Additional file 2: Table S2.
Identification of discriminative markers
Polymorphic SNPs, where nucleotides of at least one line differed from the remaining lines, were considered as discriminative markers. For mRNA, differential expression is defined as described for microarray analysis for at least one inbred line combination. For sRNA, differential expression between inbred lines is defined as a minimum expression of the lower expressed parent of 0.5 rpmqn and a two-fold expression change towards the higher expressed parent. In case the expression of the lower parent is below 0.5 rpmqn, the higher parent needs to have at least an expression of 1 rpmqn to consider a sRNA as differentially expressed.
Correlation of genomic and mRNA/sRNA expression differences with hybrid performance
The euclidean distances De (1) were calculated for all three data types (SNP, mRNA, sRNA) as the sum of marker differences for all markers that are differential in at least one inbred line combination. The differences for the data types are calculated for the combination of the inbred lines i and j, with ds(i,j) being the expression difference for a specific sRNA, mRNA, or SNP.
$$ {D}_e\left(i,j\right)=\sqrt{\sum \limits_{s=1}^n{d}_s\left(i,j\right)} $$
(1)
The expression difference ds for sRNA and mRNA expression data cs between the lines i and j is calculated as follows:
$$ {d}_s\left(i,j\right)={\left({c}_s(i)-{c}_s(j)\right)}^2 $$
(2)
The difference ds for SNP data with cs being the actual sequence between the lines i and j is calculated as follows:
$$ {d}_s\left(i,j\right)=\left\{\begin{array}{l}0\kern0.5em \mathrm{if}\kern0.5em {c}_s(i)\kern0.5em \ne \kern0.5em {c}_s(j)\\ {}1\kern0.5em \mathrm{if}\kern0.5em {c}_s(i)\kern0.5em =\kern0.5em {c}_s(j)\end{array}\right. $$
(3)
Marker trait association
Associations of markers with the traits HP for GY were established analogous to Frisch et al. [5] by separating the hybrids into classes of low and high trait values (L, H) with equal size and binomial testing. For each individual marker (SNP, mRNA, sRNA) the number of hybrids with differential marker (sequence, expression) between the inbred parents was counted for both classes L and H as oL and oH respectively. With the null hypothesis that differential expression occurs with the same probability for both classes, the probability Ps (4) of a marker being associated to the trait was estimated via the binomial distribution probability function. This function depends on the number of hybrids whose inbred lines exhibit differential expression for the given sRNA in the classes L and H:
$$ {P}_s=\sum \limits_{k={k}_{min}}^n{Bin}_{n,p}(k)\kern1em with\kern1em n=\left({o}_H+{o}_L\right),p= 0.5 $$
(4)
with
$$ n=\left({o}_H+{o}_L\right) $$
(5)
and setting equal probability for association with L and H by p = 0.5. The parameter kmin depending on positive (6) or negative (7) association:
$$ {k}_{min}={o}_L\kern1em if\kern1em {o}_L>{o}_H $$
(6)
$$ {k}_{min}={o}_H\kern1em if\kern1em {o}_L<={o}_H $$
(7)
All markers with p-values lower than the probability threshold after adjustment for multiple testing via FDR correction at 0.05 [29] were considered as associated to the specific trait (HP for GY). The certainty of the association against random artifacts was tested by permutation analyses (100 runs) of the datasets by randomly re-assigned hybrid trait values to the hybrids.
Calculation of distances for trait associated markers
We calculated trait-associated marker binary distances for two inbred lines i and j and a defined set of n markers as follows:
$$ {D}_b\left(i,j\right)=\sqrt{\frac{1}{n}\sum \limits_{s= 1}^n{x}_s} $$
(8)
with xs being set to 1 for differential markers between the two inbred lines and 0 otherwise.
To integrate the opposing binary distances for positively and negatively associated markers in one distance measure, we developed the combined binary distance which integrates the binary distance for npos positively associated markers Db,pos and nneg binary distance for negatively associated markers Db,neg for the two inbred lines i and j as follows:
$$ {D}_{b, com}\left(i,j\right)=\frac{D_{b, pos}\left(i,j\right)\cdot {n}_{pos}+\left( 1\hbox{-} {D}_{b, neg}\left(i,j\right)\right)\cdot {n}_{neg}}{n_{pos}+{n}_{neg}} $$
(9)
Prediction of hybrid performance
The prediction of HP was performed after Frisch et al. [5] using a linear regression model. In contrast to Frisch et al. [5] both positively and negatively associated markers were integrated by using the combined binary distance (Formula 9) as follows:
$$ Y\left(i,j\right)={\beta}_0+{\beta}_0\ast {D}_b\left(i,j\right) $$
(10)
Additionally a HP prediction based using multivariate linear regression (MLR) was performed as follows:
$$ {Y}_m\left(i,j\right)={\beta}_0+{\beta}_1\cdot {D}_{b, pos}\left(i,j\right)+{\beta}_2\cdot {D}_{b, neg}\left(i,j\right)+{\beta}_3\cdot mf\left(i,j\right)+{\beta}_4\cdot {m}_d\left(i,j\right) $$
(11)
Four predictors were included in the MLR, including the binary distances of positively associated markers Db,pos, as well as negatively associated markers Db,neg. The third predictor mf represents the fraction of differential positively associated markers npos(i,j) of all differential associated markers, given by the sum of npos(i,j) and differential negatively associated markers nneg(i,j), between the two lines the two lines i and j:
$$ mf\left(i,j\right)=\frac{n_{pos}\left(i,j\right)}{n_{pos}\left(i,j\right)+{n}_{neg}\left(i,j\right)} $$
(12)
The fourth predictor md represents the dominance of positively or negatively differential associated markers between the lines i and j, given as npos(i,j) and nneg(i,j), to the difference of the positively and negatively associated markers (npos, nneg) defined as follows:
$$ {m}_d\left(i,j\right)=\frac{n_{pos}\left(i,j\right)-{n}_{neg}\left(i,j\right)}{n_{pos}-{n}_{neg}+0.1} $$
(13)
The denominator is incremented by 0.1 to avoid division by zero, if the sets of positively and negatively associated markers are equally large. This predictor includes information about the number of differential associated markers in relation to all associated markers.
We performed three different prediction scenarios as described by Schrag et al. [3]. In the scenario of type-0 prediction none of the two parents were used in test-crosses, whereas for type-1 prediction one of the two parents has been used, for type-2 prediction test-cross data for both parental inbred lines is available. For all prediction types (type-2, type-1, type-0) 3 Flint and 5 Dent lines were chosen randomly to form the estimation set. In the type-1 prediction one of the two heterotic groups was randomly selected to define the lines with known test-cross data. All lines not selected in the estimation set were used as validation set to assess the prediction accuracy as the Pearson correlation coefficient of observed and predicted values for HP for GY. The prediction accuracy was determined in 100 cross-validation runs.
sRNA differential expression analysis between heterotic groups
We used DESeq2 [30] to call differentially expressed sRNAs in support of the threshold-based differential sRNA expression results from 7 Flint × 14 Dents inbred lines without biological replication. We explored the differential expression between the Dent and Flint heterotic groups by setting three genetically most related lines within the groups, according to the genomic grouping of Frisch et al. [5], as replicates and analyzed two different sets (Set1: Flint: F039, F043, F047; Dent: S036, S050, S058 / Set2: Flint: L024, L035, L043; Dent: P033, P040, P066). The sRNAs of each set were filtered for sequences with a summed read count from all replicates of 10 or higher. The differential expression was tested using DESeq2 [30] individually for each set with lines assorted by the heterotic groups. All sRNAs with FDR < 5% were considered as differentially expressed.
We analyzed the differentially expressed sRNAs for known miRNA sequences as well as overlap with hybrid performance associated-sRNAs (hpa-sRNAs) that we identified by marker-trait association analysis. The fractions of differentially expressed hpa-sRNAs of the two sets were compared to the fractions of threshold-based differential hpa-sRNAs from the 9 corresponding inbred line combinations.
sRNA enrichment analyses
The p-values for enrichment and depletion of HP for GY associated sRNAs of specific sequence length were computed by bootstrap analysis as described in Seifert et al. [22].
Reference genome mapping of sRNAs and annotation analysis
The mapping of sRNAs to the B73 reference genome (AGPv4; July 2017) [31] as well as annotation analysis were perfomed as described in Seifert et al. [22].