Overview of data and analyses
This paper uses eight datasets to (1) conduct QTL mapping with BovineHD (high density) SNP in milk production and component traits, (2) conduct association studies with imputed sequence variants in target regions for milk production and component traits, and (3) conduct an eQTL analysis with sequence variants in target regions to identify likely causal variants. Many of the datasets represent exact data or expanded datasets from previously described analyses and Additional file 1: Table S5 shows the number of animal records used in this analysis for each data type and their references (where relevant). New data includes the 16 milk component traits and its collection was approved by the Department of Primary Industries Ethics Committee. This is the first analysis to consider the two completely independent data sources of sequence variants from a global initiative (i.e. the 1000 bull genomes dataset) and from a dataset generated in New Zealand by the Livestock Improvement Corporation. Further details on data and the analysis are given below.
Data collection for milk component traits
There were 728 cows whose combined morning and afternoon milk samples were measured for lactose, mineral (calcium, potassium, magnesium, sodium, phosphorus, sulphur, zinc) and protein (lactoperoxidase, lactoferrin, immunoglobulinG, alpha-lactalbum, beta-lactoglobulin, kappa-casein, alpha-S1-casein, beta-casein) concentrations. Traits were measured 1 or 2 times with a 6 week interval between samplings. Details for the number of records, trait means and measurement units are given in Additional file 1: Table S1. Minerals were assayed by microwave acid digestion of homogenised milk samples in a mixture of nitric acid and hydrogen peroxide and measuring the digestant using Inductively Coupled Plasma Emission Spectroscopy (all minerals except zinc) and atomic absorption spectrophotometry (zinc only). Major milk proteins (alpha-lactalbum, beta-lactoglobulin, the 3 casein types) were measured using capillary zone electrophoresis [16] with minor proteins (lactoperoxidase, lactoferrin, immunoglobulinG) quantified by HPLC.
Phenotype and genotype preparation for milk component traits
The model fitted to the data aimed to correct phenotypes for non-genetic effects. ASReml [17] was used to fit the following model to each trait: trait = mean + breedi + age4 + dim4 + HYSj + PEj + animj + ej,k; where i = breed code (8 levels, accounting for degrees of Holstein, Jersey and unknown ancestry); age4 and dim4 = covariates of cow age (age) and days-in-milk (dim) fitted as 4th order polynomials; PEj, animj and HYSj = random effects for permanent environment [PE ~ N(0,σ2
PE)], additive genetic [anim ~ N(0,σ2
A)] and herd-year-season (HYSj, HYS ~ N(0,σ2
HYS)] for cow j and ej,k is the residual for measurement k from cow j. Thus a phenotype for animal j was ∑n(PEj + animj + ej,k)/n, where n is the number of records for cow j. Only cows with 2 records were used in the final analysis (i.e. up to 444 animals). Animals had real and imputed Illumina BovineHD BeadChip genotypes for 632,003 SNP. Quality control procedures and imputation were carried out as part of the larger population of genotyped bulls and cows (see below) following [5]. Quality checks included pruning of SNP on the basis of their GenTrain score (Gen-Call > 0.6) and removal of SNP with less than 10 copies of the rare allele in the larger population. Imputation used Beagle v3 [18]. The cows included in this dataset had a strong family structure and most were from one of 8 sire families (Additional file 1: Figure S4).
Data for milk production traits
The milk production data is the Holstein reference of 8,478 cows and 3,049 bulls as described by Kemper et al. [6]. Briefly, these are animals evaluated under Australian conditions for 5 milk production traits; milk yield (L/lactation), fat yield (kg/lactation), protein yield (kg/lactation), fat percentage in milk (%) and protein percentage in milk (%). Traits were obtained from the Australian Dairy Herd Improvement Scheme as either trait-deviations (for cows) or daughter-yield deviations (for bulls) which are phenotypes pre-corrected for non-genetic effects. Some of these records are highly accurate as they are the culmination of up to 6 lactations or, in the case of bulls, many 1000’s of daughter records, potentially with multiple lactations contributing to each daughter record. The Jersey cow population used for validation of the variants in the latter stages of the association study is the reference dataset of 3,917 cows from Kemper et al. [6] where the phenotypes are trait-deviations for the traits as described above for the Holstein animals. All animals had real and imputed Illumina BovineHD BeadChip genotypes for 632,003 SNP which had passed quality control procedures [5].
Identification of QTL regions with HD SNP genotypes
QTL were identified in milk production and component traits using regions showing high variance in local genomic estimated breeding values (GEBV, i.e. genetic merit) [19]. Variance in local GEBV were obtained for milk production traits from Kemper et al. [6] using the Holstein-only reference population of 11,527 bulls and cows analysed with the weighted BayesR procedure. This analysis weighted bull and cow records to account for heterogeneous error variance of the data and was found to have moderate-to-high predictive value for overall genetic merit (accuracy = 0.58–0.88) [6]. Thus high variance in local GEBV aimed to identify genomic regions underlying variation in the predicted genetic merit. From Kemper et al. [6], variance in local GEBV are calculated as the variance in Wv, were W is a matrix of SNP genotypes for the reference population in a 250 kb region and v is the SNP effect estimated by BayesR. The local variance in GEBV has the advantage of accounting for the haplotype structure of the data and analysis of small regions (sliding windows of 250 kb) overcomes, in part, problems associated with simultaneous fitting of all variants (e.g. splitting of QTL effects between adjacent SNP in strong LD [20]). Windows of 250 kb were chosen to represent haplotypes segregating in the population prior to breed formation [21]. QTL were defined in milk production traits as the 2 % of the genome with the highest variance in local GEBV. The highest 2 % of windows represent about 90 % of the total cumulative window variance in each trait.
QTL mapping for component traits was also conducted using BayesR [5, 6] and variance in local GEBV. As the heritability of these traits was unknown (and could not be estimated accurately due to the strong half-sib structure in the data), we assumed SNP effects came from a mixture of normal distribution with variance equal to 0, 0.00005, 0.0005 and 0.005 of the phenotypic variance (σ2
P). Local GEBV were calculated as described above from the estimated SNP effects [6]. Milk composition traits showed a range of genetic architectures, with the largest QTL (defined as the 0.1 % of windows explaining the highest variance) explaining > 95 % of the cumulative variance for some simple traits (Grp I traits; Additional file 1: Figure S5) but < 25 % of the total in more complex traits (Grp II traits; Additional file 1: Figure S5). Only the largest QTL for each trait were explored further, where these QTL were investigated for co-location with QTL from milk production traits. Although more formal approaches for declaring QTL under Bayesian frameworks are available, e.g. the calculation of Bayes factors [22], the approach taken here could be applied directly to available data and formal testing used a chi-squared test for independence (see below).
We tested the hypothesis that QTL for milk production traits are independent of QTL for milk composition traits. Thus the expectation was that there should be no overlap between these two sets of QTL. That is, if we select 2 % of the genome with milk production QTL, 0.1 % of the genome with QTL for component traits and there are 10,015 independent windows, then we expect < 1 window overlapping between the two sets (0.02 × 0.001 × 10,015 windows < 1 window). Since the QTL analysis used sliding windows of 250 kb with 50 kb between adjacent windows, we performed the test on the average number of overlapping QTL from each set of non-overlapping windows. The chi-squared test with Bonferroni corrected P-value [0.05/(16 component traits x 5 milk production traits)] tested if the number of significant overlapping QTL regions was more than expected by chance (P < 0.05). As a negative control, we also tested the overlap between milk component QTL and a trait with good prediction accuracy but no known relationship to milk component traits. The trait selected was stature (accuracy = 0.54) [6] and, as expected, there was no significant overlap between stature QTL and the largest QTL identified for the 16 milk component traits. Chi-squared tests for all trait pairs with co-locating QTL are given in Additional file 2.
Imputation of sequence variants and association study in targeted regions
The two regions were chosen for association studies with imputed to full sequence variants. These regions were the most promising novel finding (BTA1:144.4Mbp) and an example of a region near the casein complex (BTA6:87.5Mbp) which has several genes encoding for the major milk proteins. Imputation used phased Holstein variant calls (n = 260) from run4 of the 1000 bull genomes project [9] and Minimac2 [23], where SNP used for imputation were quality checked for concordance with 800 K genotypes [24]. Imputed regions included a minimum of 4 Mb surrounding each QTL and focus on either 28,474 (Chr1:143-146Mbp) or 4,527 (Chr6:87–87.5Mbp) variants in the target regions. Sequence variants from the 1000 bull genomes includes bi-allelic SNP and and small bi-allelic indels. Variants with minor allele frequency > 0.001 were tested for association with the milk production and composition traits (from Table 1) using genotype probabilities in EMMAX [11] and an identity-by-state matrix constructed with 800 K genotypes. Association tests for milk production traits in Holstein bull and cow datasets were conducted separately (to minimise the effect of the heterogeneous error variance of these two data types), and then combined assuming a n degree-of-freedom for a chi-squared test statistic where the test statistic for each variant was given by ∑nt2 (where n is the number of t-statistics included in the test [11]). The multi-trait analysis only used the milk production traits identified as containing QTL (i.e. those identified Table 1, for each region). Analyses fitting SNP covariates used the same procedure as above and the covariate option in EMMAX. Validation of sequence variants using Jersey cattle used Jersey cow genotypes and phenotypes (n = 3917) as described by Kemper et al. [6] and consisted of SNP genotypes for 632,003 SNP. Sequence imputation used Minimac2 [22], as above, and phased Jersey animals (n = 61) from the 1000 bull genomes [9] as the imputation reference.
eQTL data collection and analysis
Expression QTL analysis was conducted using imputed genomic sequence in conjunction with a mammary RNA sequence dataset representing 406 lactating cows. These data comprised an expanded dataset to that described previously [25]. Briefly, samples were derived by mammary tissue biopsy and total RNA libraries prepared for 100 bp paired end sequencing on the Illumina HiSeq 2000 instrument. Library preparation and sequencing was performed by NZ Genomics Limited (NZGL; Auckland, New Zealand) or the Australian Genome Research Facility (AGRF; Melbourne, Australia). Sequence reads were mapped to the UMD3.1 genome using Tophat2 (version 2.0.12) [26], yielding an average of 88.9 million mapped read-pairs per sample. Expression phenotypes representing SLC37A1, CSN1S1, and CSN3 were quantified using v1.14.0 of DESeq [27], representing variance stabilised read counts corresponding to gene structures from Ensembl gene set release 77.
RNAseq animals were genotyped using the Illumina BovineHD BeadChip (N = 377), or Illumina SNP50k BeadChip (N = 29), with the latter cohort imputed to the BovineHD BeadChip prior to sequence imputation using v4 of Beagle [18]. These data were then merged with an additional variant set called directly from the RNAseq alignments, representing a high confidence, quality-filtered consensus set called using GATK HaplotypeCaller (v3.1) and Samtools (v1.2) [28, 29]. Whole-genome sequence imputation was performed using a sequence reference population of 556 animals described elsewhere [25]. Briefly, genome sequence variants were identified using GATK HaplotypeCaller (v3.1) and phased using Beagle (v4) [18, 28]. Variants with initial allelic R2 values > 0.95 in the reference population were retained and imputed into the target population using Beagle (v4) [18]. Any variants in the target population with imputation R2 values < 0.70, and minor allele frequency < 0.001 and Hardy-Weinberg thresholds of P < 1 × 10−10 were removed from further analysis. Plink (v1.90) [12] was used to test the association between sequence variants in the QTL regions and the normalised expression phenotypes described above. BovineHD BeadChip genotypes in conjunction with the identity by state and multidimensional scaling procedure implemented in Plink (v1.90) [12] to calculate population structure covariates for inclusion in the SNP association models. Ten covariates were fitted in these models, representing a practicable number of covariates which together explained > 50 % of the genotypic variation. Models also included a single fixed effect to account for differences in cohorts/sequencing facilities. The sequence intervals comprised 22,263 variants for analysis of SLC37A1 (Chr1:143-146Mbp), and 3,169 variants for analysis of CSN3 and CSN1S1 (Chr6:87–87.5Mbp). The eQTL results presented correspond to the 371 animals that passed all quality-filtering criteria, consisting of removal of genome-wide expression outliers based on principal component analysis [30], nominal genotype call rate (<0.95), and other quality metrics.