Variant Calling and Genotyping
Whole-genome sequencing was performed as previously described [51] for 1300 animals, to form a reference population for sequence imputation. Briefly, animals with a mixture of Holstein-Friesian (HF; N = 306), Jersey (J; N = 219), HF × J (N = 717), or other breeds and cross breeds (N = 58) were sequenced on Illumina HiSeq 2000 instruments targeting 100 bp paired-end reads. Genome sequence data were aligned to the ARS-UCD1.2 genome assembly [52] using BWA-MEM (version 0.7.17, [53]). Variant calling was conducted using GATK HaplotypeCaller (version 4.0.6.0, [54]) with variant quality score recalibration applied. After filtering, this variant set (21,005,869 variants) was phased using Beagle (version 5.0, [55]) to create an imputation reference panel.
A separate, non-overlapping population of 411 animals was used for RNA sequencing (RNA-seq) of which 99 were also used for Chromatin Immunoprecipitation followed by sequencing (ChIP-seq). The majority of these animals had previously been genotyped using the Illumina BovineHD SNP-chip. The remaining subset of 27 cows had been genotyped on a lower density panel (Illumina Bovine SNP50 BeadChip). Imputation to WGS resolution was performed as part of a larger study [51] using the same reference population as described above, and resulted in a variant set of 16,640,294 variants following post-imputation filtering to remove variants with minor allele frequencies less than 0.01 in the 99 animal subset.
For the allele specific analysis, SNP alleles were phased as maternal or paternal. Homozygous genotypes remained unchanged but for heterozygous genotypes, alleles were defined as maternal or paternal based on the sire genotypes. If the sire was homozygous for an allele, that allele was designated as paternal, if the sire was heterozygous the allele was defined based on the phasing with the previous SNP.
Masked Genome
To prevent bias to the reference alleles when mapping reads to the reference genome, a masked genome was created with a neutral allele for SNPs that were heterozygous in the sequenced animals. Imputed genetic variants described above were filtered for 1% minor allele frequency in this sample set. Data for the 99 ChIP-seq animals were then extracted and filtered again at 1% minor allele frequency. This set of variants (N = 14,536,882) was used to create a masked genome where a non-variant allele was placed at that location in the reference genome of ARS-UCD1.2 [52].
ChIP-seq and RNA-seq
Mammary biopsies and RNA sequencing were performed as reported previously [23, 24]. Briefly, high-depth mammary RNA-seq was conducted on tissue from 411 cows, sampled in three batches at different points in time. Following library preparation, samples were sequenced using the Illumina HiSeq 2000 instrument to produce 100 bp paired-end reads, multiplexed at two samples per lane.
Prior to mapping, reads were processed using Trimmomatic (version 0.39, [56]) in paired-end mode, with settings LEADING:20 TRAILING:20 SLIDINGWINDOW:3:15 MINLEN:50. Processed reads were mapped against the masked genome described above using STAR (version 2.7.0, [57]) in two stages. In the first stage, exon and junction information from the RefSeq database (annotation release 106 [52]) of protein-coding genes was used to produce an initial mapping, which in turn was used to identify additional novel exons and splice junctions for remapping in the second stage. This resulted in a median of 39 million uniquely mapped read-pairs per cow.
ChIP-seq was performed on a subset of 99 animals from the 411 RNA-seq animals, utilising duplicate biopsies obtained at the same time as samples used for RNA extraction and gene expression analyses. Whole frozen tissue samples (weighing between 6 and 37 mg) were fixed for 10 minutes with 10% formaldehyde and chromatin prepared using the Magnify Chromatin Immunoprecipitation kit (ThermoFisher) as per the manufacturer’s instructions. Fixed chromatin was sheared to 200-500 bp using the Covaris S2 (Covaris) for 3 min, duty cycle five, % intensity four and 200 cycles per burst.
Chromatin immunoprecipitation was performed using the Magnify Chromatin immunoprecipitation kit (ThermoFisher) with some modifications. 30ul of sheared chromatin was immunoprecipitated with 0.25μg of antibody. Depending on the amount of sample each reaction was performed 1,2 or 3 times and the samples were combined after de crosslinking using the MinElute PCR purification kit (QIAGEN). Sequencing libraries were prepared for each ChIP sample and a control for each chromatin preparation (input sample) using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs) as per the manufacturer’s instructions and run on the HiSeq 3000 (Illumina) in a 150-cycle paired end run.
Each library was sequenced to between 20 and 200 million reads (median 58 million). Raw sequence reads were trimmed of adapters and poor-quality ends using Trimmomatic (version 0.38, [56]). Bases of quality less than 20 were removed from the 3′ and 5′ ends of the sequence and trimmed reads with length less than 50 were removed. Trimmed reads were mapped to the masked ARS-UCD1.2 genome [52] using BWA-MEM (version 0.7.17-r1188, [53]) with default settings. Poor-quality reads with q < 15 were removed using Samtools (version 1.9, [58]), and duplicate reads were also removed. MACS2 (version 2.1.1, [59]) with default settings was used to call peaks from mapped ChIP-seq reads with input reads as control. The quality of peaks was checked with deepTools plotFingerprint (version 2.5.4, [60]) and SPP (version 1.0, [61]).
To generate a consensus set of peaks for each mark across all samples, equal numbers of reads from each bam file were randomly sampled and merged using Samtools (version 1.9, [58]). Peaks were called from the merged bam file using MACS2 (version 2.1.1, [59]) with default settings as described above.
Allele Counts
For the allele specific analysis, maternal and paternal read counts were calculated by counting maternal/paternal alleles for all SNPs under a ChIP peak (peak SNP or pSNP) or within an exon (transcript SNP or tSNP) from mapped ChIP/RNA-seq reads using GATK tools (version 4.1.2, [54]. First, a gVCF file was created at base pair resolution using GATK HaplotypeCaller. Then, allele counts were calculated at each SNP under a peak or in an exon using GenotypeGVCF, applying the option “depth per allele by sample”. Individuals genotyped as homozygous at the pSNP/tSNP were excluded from the analysis because they were not informative for allele-specific analyses. Individuals genotyped as heterozygous in the genomic data but were monoallelic at the pSNP/tSNP in the RNA/ChIP data were also excluded to remove potential imputation errors.
To test whether these allele count phenotypes from the same peak/exon shared paternal or maternal allelic bias, all phenotypes under a peak/exon were tested for similarity using a G-test.
For s pSNPs under a peak (or tSNPs in an exon):
Let
nij = allele count for pSNP i where i = 1 to s and j = maternal or paternal.
ni. = total number of counts for SNP i.
n.j = total number of maternal or paternal alleles over all s SNPs.
n.. = total of all counts.
These make an sX2 contingency table. To test the null hypothesis that the ratio of maternal to paternal alleles is the same for s SNPs.
G = 2( ∑ [nij ∙ ln(nij)] + ∑ [n.. ∙ ln(n..)] − ∑ [n.j ∙ ln(n.j)] − ∑ [ni. ∙ ln(ni.)]).
Read counts for peaks/exons with multiple p/tSNPs were calculated as the sum of maternal and paternal allele counts for all p/tSNP in the peak/exon.
Allele Specific QTL analysis
All SNPs within one megabase (1 Mb) of the midpoint of the peak/exon were tested for association with the phenotype (maternal and paternal allele counts at the exon or peak). We called these SNPs the driver SNPs or dSNPs. Two statistical tests were used to assess the significance of the relationship. The first test was used to filter for cases where sample size was small.
For the first test, where:
0 = reference allele and 1 = alternate allele.
For each individual i that had dSNP genotype 1|0, let Mi = number of counts of the maternal allele at the phenotype.
For each individual i that had dSNP genotype 1|0, let Pi = number of counts of the paternal allele at the phenotype.
For each individual j that had dSNP genotype 0|1, let Nj = number of counts of the maternal allele at the phenotype.
For each individual j that had dSNP genotype 0|1, let Qj = number of counts of the paternal allele at the phenotype.
And T = ∑ Mi + ∑ Pi + ∑ Nj + ∑ Qj and A = ∑ Mi + ∑ Qj.
Then:
\(Z=\left(\frac{A}{T}-0.5\right)\times \sqrt{4T}\) would be approximately normally distributed.
P values for the normal distribution were calculated in R (version 3.6.1, [62]). dSNPs which were significant at p < 0.001 were taken through to the second test.
For the second test:
We calculated a linear model: Yi = a + bXi
Where, for each individual i,
$${Y}_i= lnln\ \left(\frac{maternal\ allele\ count+10}{paternal\ allele\ count+10}\right)$$
When individual i had dSNP genotype 0|0 or 1|1
When individual i had dSNP genotype 1|0
When individual i had dSNP genotype 0|1
P values were calculated in R [62].
Histone QTL
Read counts for both the ChIP and Input BAM files were counted for each of the corresponding ChIP-seq consensus peaks. As an initial quality control filter, peaks were removed where the peak read count was below the 1% quantile across all peaks. To remove peaks that were potentially caused by artefacts in the reference genome, additional peaks were removed where the input read depth was more than five times the average across all peaks. This yielded a data set comprising peak and input read counts for 503,921, 293,903, and 387,770 peaks (for H3K27ac, H3K4Me1, and H3K4Me3 respectively).
Next, phenotypes suitable for mixed linear model analyses were generated. First, each Peak (Pij) and Input (Iij) read count was normalised by dividing by the mean read count per animal (across all peaks) to yield normalised counts (PNij and INij) across all peaks i and animals j. Ordinary least squares (OLS) was then applied to remove the effect of the Input read depth, in the following manner: for each peak i, let yi = lnln (PNi, ∙ + 1) and xi = lnln (INi, ∙ + 1) , then fit the model yi = α + βxi + εi, ∙. The vector of residuals εi, ∙ was then used as the phenotype for histone QTL (hQTL) discovery.
Prior to association analysis, further filtering was applied to remove outlier samples. Here, individuals were removed using principal components analysis (PCA) criteria, in an approach similar to that employed by Ellis, Gupta [63], those animals with PCA values more than four standard deviations from the mean in any of the first seven components were excluded. This filter yielded a data set containing 34, 96, and 97 cows for H3K27ac, H3K4Me1, and H3K4Me3 respectively. Of these, 33, 94, and 95 respectively had imputed sequence-resolution genotypes available for subsequent analyses.
Histone QTL discovery was performed using GCTA (version 1.93, [64]), applying mixed linear model association testing using the ‘leave one chromosome out’ approach to avoid double fitting of variants of interest (MLMA-LOCO). The genomic relationship matrix (GRM) was created using GCTA with IlluminaHD genotypes. The MLMA-LOCO analysis was run using the subset of the imputed whole genome sequence genotypes that mapped within 1 Mb either side of the peak, incorporating one covariate for ChIP-seq sequencing batch.
Expression QTL
Reads mapping to exons in the RefSeq protein-coding gene database (AR 106 [52]) were counted for all 411 cows using the featureCounts function of the Subread software package (version 1.5.3, [65]). Genes with a median read count of less than five were excluded. The remaining expression data were aggregated by gene and processed using the Bioconductor (version 3.10, [66]) package DESeq2 (version 1.26.0, [67]), transforming the read counts using the variance-stabilising transformation (VST), to yield phenotypes suitable for mixed-model analysis. Next, outlier samples were detected and excluded using PCA on the VST-transformed phenotypes as described for ChIP data, yielding a population of 392 cows. To facilitate the discovery of exon-eQTL (eeQTL), reads were recounted on an individual exon basis, using featureCounts as described above, for a population of 371 animals that comprised the subset of the 392 cows that also had imputed genotypes available (see “Variant Calling and Genotyping” section above). Exons with median read counts of less than five were excluded. Lastly, individual exon expression phenotypes were produced by transforming the read counts using VST and adjusted to remove the effect of sequencing batch, then eeQTL were identified using the MLMA-LOCO approach implemented in GCTA, as described for ChIP-seq above, with imputed sequence genotypes extracted from within 1 Mb of the gene.
Enrichment
Enrichment of QTL under peaks was determined using the formula outlined below [68]:
Enrichment = (C/A)/(B/D) where: A is the number of positions under peaks, B is the number of positions that were QTL, C is the number of QTL under a peak and D is the number of positions in the genome. Values greater than 1 indicate enrichment and less than 1 depletion.
Putative causal variants
Significant dSNPs were filtered to obtain a list of putative causal variants which met the following criteria:
-
p < 0.0001 in hQTL and eeQTL analysis and the first test of asb and ase QTL analysis.
-
Located under the histone peak
-
Had the same direction of effect in all 4 analyses
P values for each putative causal variant were calculated by identifying the Chi-square value for each individual p-value (from the hQTL, eeQTL, asbQTL and aseQTL analyses) and combining them. The combined p-value was calculated from the chi-squared distribution with 4 degrees of freedom. Only the lowest p-value SNP for each peak-exon pair was included. Putative causal variants described here were compared to independent publicly reported gene expression QTL from the cattle Genotype-Tissue Expression (cGTEx [25]) project. For each variant the direction of effect on each exon in our data was compared to the direction of effect for the corresponding gene for eQTL found in mammary (n = 175) and blood (n = 698). cGTEx gene eQTL data was downloaded from the cGTex website (https://cgtex.roslin.ed.ac.uk/).
Identification of putative binding motifs
Sequence motifs were identified for asbQTL dSNPs where the variants were (a) located in the peak for which they were associated (with the exception of H3K27ac, for which very few such sites were identified), and (b) were significant at p-value thresholds of 1 × 10− 8, 1 × 10− 7, or 1 × 10− 6, for H3K4Me1, H3K4Me3, and H3K27ac respectively (based on Bonferroni, but with lower stringency where low numbers of sites were selected). Additionally, dSNPs were selected for aseQTL that were (a) within 10Kb of the TSS of their associated gene (to reduce the number of variants while enriching for cis-acting sites), and (b) were significant at a threshold of 1 × 10− 15. For all selected dSNPs, 21 bp (i.e., the SNP site ±10 bp either side) of DNA sequence was extracted from the ARS-UCD1.2 reference genome (24), as well as the corresponding reverse complement sequence, and these sequences were subsequently clustered using complete linkage with Levenshtein distances calculated between sequences measured over only their central nine bases. The resulting tree was then cut at height 3, corresponding to a maximum edit distance of three nucleotides between the central 9 bp of any pair of sequences. The resulting groups of similar sequences (excluding those with too few member sequences) were then used to produce position frequency matrices (PFMs [69];) that represent candidate transcription-factor (TF) binding motifs. Minimum set sizes to identify a motif were set to ten sequences per cluster.
The JASPAR2018 database [70] was subsequently used to identify transcription factors potentially targeting these motifs. PFMs produced by clustering were compared to those annotated for TF binding sites using the TFBSTools package (version 1.24.0 [71];) in R (version 3.6.2) using the “PFMSimilarity” function and candidate factors selected with relScore > 90%. Both the CORE and POLII collections from JASPAR2018 were used, with the CORE database limited to vertebrate taxa. Clustering and motif identification were similarly applied to the list of variants highlighted as putative causative variants (described above), with the exception that these variants were not required to be within 10Kb of the corresponding gene.
Skewedness of allele frequency towards the reference allele and the positive effect allele was investigated for each cluster. The sequences from which each cluster was produced were identified, and a list produced of the reference and positive effect direction alleles of each of the variants around which those sequences were extracted. Within each list of alleles, the numbers of each base present were compared against a null distribution of equal representation (i.e., 25% each) using the multinomial log-likelihood test, implemented in the R package XNomial (v1.04 [72]).
Enrichment of sequence motifs was explored by comparing observed and expected numbers of motif consensus sequences. Consensus sequences were produced using the following method. First, entropy scores were calculated for each base in the PFM matrices, and bases trimmed from the left and right of each matrix where the entropy (in bits) fell below 0.2. For the remaining positions (columns) in the matrix, base frequencies were sorted and summed until exceeding a threshold of 0.85, when the summed bases were incorporated into a regular expression. For example, for a column with base frequencies A = 0.4, C = 0.3, G = 0.2, T = 0.1, 0.4 + 0.3 + 0.2 > 0.85, so the text “[ACG]” would be appended. Observed genomic motif counts were then produced using Perl (v5.26.1) by matching the resulting regular expression against the sequence of each autosome, plus chromosome X, and the results summed. Expected counts were produced by first counting the number of each nucleotide in the reference sequence for the same set of chromosomes, to produce base frequencies. These were then summed for each position in the consensus sequence (e.g., for the example above, f(A) + f(C) + f(G)), then multiplied together and by the total length of sequence. Enrichment scores were calculated as the observed counts divided by the expected counts.
Analysis of key lactation genes
We manually evaluated the gene lists presented through the ‘Putative causal variants’ analysis described above to identify candidates of prior interest. This analysis was performed in an ad hoc manner, where the aim was not to conduct a systematic survey of the literature but rather identify candidate genes that we ( [24, 27, 28, 31, 32, 35]) and others ( [26, 29, 30, 33, 34]) have recurrently highlighted through previous lactation trait GWAS and QTL studies.
For the more in-depth analyses of the CSF2RB gene, data was leveraged from a previous, detailed investigation of that locus [31]. Here, association statistics were recomputed for presentation in the current study, analysing milk yield phenotypic records from 29,350 cows presented in the original Lopdell, Tiplady [31] paper. This analysis tested all imputed sequence variants within an interval that was ±50Kb of the segment containing the CSF2RB gene and ChIP-peaks of interest (chr5:75,228,399-75,422,505), comprising 1446 variants. Association testing was performed as previously described [31], using GCTA (version 1.91.3beta [64]) and fitting mixed linear models that omitted the CSF2RB host chromosome (i.e. chromosome 5) from the GRM. This GRM was identical to that previously described [31], being constructed from variants from the Illumina BovineHD SNPchip. Since the Lopdell, Tiplady [31] analysis was based on data mapped to the UMD3.1 genome (i.e. the reference assembly preceding that used for all other analyses in the paper), genotype data were positionally ‘lifted over’ to the ARS-UCD1.2 reference genome (24) using a custom script, with these transposed data displayed in the current paper.