Animals
Samples from 82 Angus bulls (45 with muscle and 37 with liver samples) from lines of cattle divergently selected for residual feed intake (RFI) were taken from the Agricultural Research Centre, Trangie, NSW, Australia [21] and 20 first lactating Holstein cows (each of them with both WBC and liver samples) were taken from the DEDTR Ellinbank Research Farm, VIC, Australia [22].
For the GWA studies, 20 complex traits were used from 3296 Bos taurus cattle, which were a subset of 10,191 Bos taurus, Bos indicus and Bos taurus×Bos indicus crosses used in a multi-trait test [23]. The names and abbreviations of the traits are shown in Additional file 2: Table S10. The phenotypes and genetic parameters estimated using these data are described in full by Bolormaa et al. (2014) [23]. In another GWA study for RFI, 5614 Bos taurus cattle described by Khansefid et al. (2014) were used [24].
Genotypes
The Angus bulls that had RNA-Seq data, also had Illumina BovineSNP50 (liver samples) or Illumina BovineHD SNP genotypes (muscle samples). The low density SNP were imputed to high density (HD) genotypes using BEAGLE [25]. The Holsteins had BovineHD SNP genotypes, described by Pryce et al. (2012) [26]. Additionally, 45 Angus bulls with muscle samples also had whole genome sequence (WGS) data with an average coverage of 6.7 fold [27].
The animals in GWA studies had 729,068 SNP genotypes, genotyped using either HD chip or a lower density SNP chip and then imputed to HD using BEAGLE [25] as described by Bolormaa et al. (2014) [23].
The HD SNP genotypes of animals with RNA-Seq data and 5614 animals used in GWA study for RFI were imputed and phased to WGS of 28,899,038 SNPs using FImpute [28]. The WGS genotypes of 45 Angus bulls were also phased by FImpute. The reference genomes used for the imputation were WGS data in Run 4.0, 1000 bull genomes project [27], consisting of 27 breeds and 1147 sequenced animals, including 138 Angus (Black and Red) and 311 Holstein cattle (288 black and white and 23 red and white).
RNA-Seq data
We included here the tissue sampling procedures and RNA-Seq protocol as reported by Khansefid et al., 2017 [29] for completeness in this manuscript. All animals were monitored for one week post procedure for complications before being returned to the main herd at their respective farms.
The beef cattle liver tissue sampling procedure was described in detail by Chen et al., 2011 [30]. Briefly, each animal was sedated with intramuscular 2% xylazine (Xylazil-20, Ilium) and complete local anaesthesia was obtained with Adrenaline by under skin infiltration. On the right side, being about halfway down the curvature of the ribs, the cannula was pushed to penetrate the liver and then rotated and advanced to a depth of 3–5 cm to cut out a biopsy. The procedure was carried out by a professional veterinarian and cut sites were consistent among animals. Liver biopsy samples were expelled into 2 ml tubes of RNAlater™ solution (Ambion, Applied Biosystems). The semitendinosus muscle samples were taken from the growing bulls by a purpose built 12 V powered motorised biopsy drill. The hairs were clipped from a 10 cm × 10 cm area around the point of incision (located 15 and 25 cm below the anus, and over the “poverty groove” of the hind leg) and then that area was wiped with gauze soaked in 70% ethanol solution. Local anaesthetic (5 ml Lignocaine®) is infiltrated under the skin, and a single 1 cm incision is made through the skin using a sterile scalpel. The biopsy needle (approx. 5 cm long and 3 mm inside diameter), attached to the drill was then passed into the muscle with a simultaneous vacuum applied to hold the sample in the biopsy needle and then transferred into 2 ml RNAlater™ solution (Ambion, Applied Biosystems). Total RNA was extracted from liver and muscle tissue by TRI Reagent® (Ambion) and Qiagen RNeasy MinElute kit (Qiagen) using a modified protocol. Approximately 30 mg of liver tissue samples (100 mg of muscle tissue), were finely minced and then mixed with 500 μL TRIzol® Reagent (Life Technologies). The mix were immediately homogenized for approximately 45–50 s and incubated at room temperature for 5 min. The resulting lysate was mixed with 100 μL BCP (1-Bromo-3-chloropropane) incubated at room temperature for 10 min, followed by centrifugation at 10000 g at 4 °C for 15 min. The top aqueous layer was transferred to a new microfuge tube and mixed with an equal volume of 75% ethanol. The resulting lysate was then loaded onto the Qiagen RNeasy MinElute column and RNA was purified per protocol with on column DNA digestion. The extracted RNA quality was assessed using an Agilent Bioanalyzer 2100 (Agilent Technologies) to have integrity number (RIN) of more than 7.0. Ten μg of the extracted RNA from the sampled tissues were enriched with Dynabeads® mRNA Purification Kit (Invitrogen). The cDNA molecules were prepared, barcoded with 24 unique adaptors, enriched by PCR with Phusion® High-Fidelity DNA Polymeras (New England Biolabs Ltd), purified with Agencourt AMPure XP (Beckman Coulter) and selected for target size of 200 bp. Libraries were individually barcoded, pooled and run on a HiSeq™ 2000 (Illumina Inc.) in a 101 bp paired-end run.
For the Holsteins, the blood sampling and liver tissue biopsy, RNA extraction and sequencing were described by Chamberlain et al., 2015 [11] and Khansefid et al., 2017 [29]. In brief, approximately 30 ml of whole blood was collected by venipuncture of the coccygeal vein using BD Vacutainer® Blood Collection Tubes and then centrifuged at 2000 g for 15 min. The white blood cells (about 1500 μl for each animal) were separated and stored in 1.2 ml RNAlater® RNA Stabilisation Solution (Ambion, Applied Biosystems). Liver biopsies were collected by restraining cows in a crush and giving them 10 ml of lignocaine hydrochloride 2 % into the subcutaneous, inter-costal and peritoneal tissues at the site of the insertion of the biopsy punch. A small incision was made with a scalpel before a biopsy punch was inserted into the liver to collect approximately 2–3 g of tissue. Following removal of the biopsy punch, betadine cream was placed in the incision site. Cows were given intramuscular antibiotics (Excenel RTU 2 ml/100 kg) and anti-inflammatory drugs (Ketoprofen 2 ml/100 kg) before being released from the crush. Immediately following collection samples were frozen in liquid nitrogen and then stored at − 80 °C. RNA was extracted from WBC using the RiboPure™ Blood Kit (Ambion, Applied Biosystems) and from liver samples using the RiboPure™ Kit (Ambion, Applied Biosystems) according to the manufacturer’s instructions. All samples were assessed to have RIN greater than 8.0. Sequencing libraries were prepared using the TruSeq™ RNA Sample Preparation Kit v2 Set A (Illumina, Inc.) and selected for size of 200 bp. All libraries were uniquely barcoded, pooled and sequenced on a HiSeq™ 2000 (Illumina Inc.) in a 105 bp paired-end run.
After mRNA sequencing, the raw reads were passed through quality control (QC) filters. Reads were trimmed of adaptor sequence and bases with qscore < 15, where 3 consecutive bases had a qscore of less than 15, the rest of sequence was removed. Reads with minimum average qscore < 20 or read length < 50 after trimming were removed and only the paired reads used in alignment [31].
Alignment of RNA reads
To improve the mapability of reads, 2 customized reference genomes (paternal and maternal) were made for each animal using 28,899,038 phased SNP genotypes as described by Chamberlain et al. (2015) [11]. This strategy reduced bias in the counts towards the alleles represented in the reference genome [32, 33]. For each animal, the RNA-Seq reads were aligned to its maternal and paternal customised reference genomes using TopHat2 [34] with default input parameters and annotation release 75 GFF file (General Feature Format) of bovine genome assembly UMD3.1.
Abundance of alleles and genes
Counting alleles with Samtools mpileup
The SNP genotypes in WGS data (real or imputed) were used to find heterozygote SNPs in each animal and confirm that each heterozygote SNP observed in RNA-Seq reads was not a sequencing error or the result of mismapped reads. However, the maternal and paternal reference genomes contain errors as well, because the DNA sequence depth for the animals with WGS was relatively low and the remaining animals had imputed WGS genotypes. To minimize errors in detecting ASE because of sequencing and imputation errors, heterozygote SNP in RNA-Seq data had to have at least 1 count of each allele, thereby excluding cases of strict mono allelic expression, and a coverage > 10. We used Samtools mpileup [35] command to measure the number of alleles expressed in RNA-Seq data for heterozygote SNPs found in WGS genotypes of each animal. As we mapped the RNA-Seq reads to 2 reference genomes for each animal, we had two counts for each allele. If the counts of alleles from maternal and paternal alignments were different due to mapping bias, the average of the two counts was reported.
Counting genes with HTSeq
The total number of reads mapped to each gene in the reference genome was counted using the HTSeq python package [36]. We assumed some reads were not mapped to the reference genome due to the number of mismatches. When the reads could map to either the maternal or paternal reference genome slightly better, we used the superior one as the reference genome to measure the gene abundance.
Statistical analysis of ASE
To detect ASE, we require a SNP in the transcript (tSNP), however the causative variant responsible for differences in expression may be a SNP close to the gene, referred to here as a driver SNP (dSNP, Fig. 1a). Allelic imbalance can be the result of PO-ASE (Fig. 1b) or ASE (Fig. 1c) [8]. In order to distinguish ASE from PO-ASE in animals heterozygous for the tSNP, we used the log10 of the ratio of maternal to paternal allele counts and included the total abundance of alleles (T) as the weight (the reciprocal of the error variance in eq. 1). A separate analysis was performed for each combination of a tSNP and a dSNP within 50 kb of the tSNP.
$$ {\boldsymbol{y}}_{\boldsymbol{SNP}}=\mathbf{1}\upalpha +{\boldsymbol{X}}_{\mathbf{1}}{\boldsymbol{b}}_{\mathbf{1}}+\boldsymbol{e} $$
(1)
ySNP is an N × 1 vector, where ySNP i = log10(maternal allele count / paternal allele count) for animal i at the tSNP and N is the number of heterozygous animals at the tSNP. 1 is an N × 1 vector of 1’s and ɑ is a scalar intercept that indicates a parent of origin effect. X1 is an N × 1 vector coding the genotype of each animal at a dSNP which may be driving the expression of the tSNP such that homozygotes are coded 0, heterozygote with the reference allele (UMD3.1) inherited from the sire is coded − 1 and a heterozygote with reference allele from the dam is coded + 1. b1 is a scalar measuring ASE due to the allele carried at the dSNP. e is the vector of random residual effects e~N(0, T− 1Iσ2e).
We restricted the data analysed to tSNPs where there were at least two heterozygous animals who had received opposite alleles from their sire so that we could distinguish PO-ASE from other cases of ASE. The variants within 50 kb of tSNP were used as dSNP to see if these variants regulate the expression of the tSNP. So, if the parent of origin of the allele controls the allelic imbalance (Fig. 1a), individuals with − 1, 0 and + 1 genotypes show the same allelic expression frequency and the intercept (ɑ) would be significantly different from 0. On the other hand, if the actual allele of the dSNP regulates the expression (Fig. 1c), homozygote individuals show no allelic imbalance and individuals with − 1 and 1 genotypes have reversed allelic imbalance which influences the slope of the regression line (b1).
The number of significant effects of ɑ and b1 were calculated (p < 0.001). The FDR was calculated with eq. 2 [12].
$$ FDR=\frac{P\left(1-\frac{A}{T}\right)}{\left(\frac{A}{T}\right)\left(1-P\right)} $$
(2)
where P is the p-value of the test, A is the number of SNP that were associated with gene expression at P and T is the total number of SNP tested.
Local eQTL mapping
In local eQTL mapping, the effect of the dSNP on the overall expression of a gene was measured (Fig. 1d). The genes expressed in more than 25% of animals in each group were used to find eQTL separately. The gene counts were normalized with a weighted trimmed mean (TMM) of the log expression ratios and the library size for each animal was adjusted to have equivalent counts per million (cpm) using edgeR [37]. Normalizing the counts minimizes the compositional difference between the libraries and calculating cpm removes the effect of sequencing depth on the variation of gene counts in different samples. Finally, log10 of the counts were used to calculate the association between the abundance of a gene and the dSNPs with ASReml [38] for each group independently using eq. 3.
$$ {\boldsymbol{y}}_{\boldsymbol{gene}}=\mathbf{1}\upalpha +{\boldsymbol{X}}_{\mathbf{2}}{b}_2+\boldsymbol{Zu}+\boldsymbol{e} $$
(3)
ygene is an N × 1 vector where ygene i = log10(normalized gene count) for animal i, 1 is an N × 1 vector of 1’s and ɑ is a scalar intercept. X2 is an N × 1 vector containing the coded genotypes at the dSNP such that x2i = 0, 1 or 2 if animal i has 2, 1 or 0 copy of the reference genome (UMD3.1), respectively and b2 is a scalar measuring the effect of the dSNP on gene expression. Z is an N × M matrix allocating records to animals where M = number of animals in the pedigree and u is an M × 1 vector with ui = the genetic effect of animal i on gene expression ~ N(0, Aσ2u) while pedigree information was used to make the relationship matrix (A). e is the vector of random residual effects e~N(0, Iσ2e). The number tests showing significant association between SNP and gene expression (p < 0.001) and the corresponding FDR were calculated.
Comparing ASE, PO-ASE and local eQTL mapping within and across RNA-Seq data
Within each RNA-Seq dataset, the ASE, PO-ASE and local eQTL mapping results were compared to see if the same SNPs were significantly associated with ASE and gene expression in the same or different datasets. Firstly, a 2 × 2 contingency table was constructed by classifying a SNP as significant or not (p < 0.0001) in two datasets. Then a chi-squared test was used to test if SNPs were significant in both datasets more often than expected by chance according to the proportion of significant SNPs in each dataset. The degree of overlap was calculated as the observed number of SNPs significant in both datasets divided by the number expected if overlap was due to chance and referred to as ‘fold enrichment’. Secondly, for SNPs significant in both datasets, the proportion in which the allele increasing expression was the same in both datasets was calculated. The ASE, PO-ASE and local eQTL results were compared and fold enrichment was calculated across the 4 RNA-Seq datasets to see if mutations that were detected in one dataset can be found in the other ones. For ASE and PO-ASE comparison for each gene, we compared results based on the same tSNP and same dSNP. However, we also made some comparisons for the same dSNP but any tSNP within that gene. The comparisons involving local eQTL are only matched on dSNP and gene since no tSNP is used in eQTL analysis.
Validating ASE and local eQTL mapping in a dataset consisting of 18 tissues
The ASE and eQTL mapping results in the 4 RNA-Seq datasets analysed here were compared with ASE measurements in another study of 18 tissues of a lactating Holstein cow [11] to investigate whether the same SNPs were associated with ASE and gene expression in other tissues. To test the significance of the overlap a chi-squared test, similar to that used to compare datasets, was performed and the proportion of SNPs in which the same allele was associated with increased expression was calculated. This comparison was only carried out using the same tSNP and treating it as the driver SNP as well. Furthermore, the average similarity between ASE and eQTL mapping results of Angus and Holstein cattle liver samples in our study with the validation dataset was tested using a paired t-test. We expected to see more similarity between our Holstein RNA-Seq dataset and the validation data because they were the same breed and all were lactating cows.
Combining ASE and eQTL within and across RNA-Seq data
ASE and local eQTL analysis both detect cis eQTL, therefore combining the results from both analyses should improve the power of detecting variants influencing the expression of genes. So, the ASE and local eQTL mapping results were combined in a meta-analysis within each dataset. In addition, ASE, PO-ASE and eQTL mapping results across all datasets were also combined. Additional file 2: Figure S1 depicts all within and across RNA-Seq datasets analyses.
In the meta-analyses for any of the within or across RNA-Seq datasets, if we had multiple estimates (n) of the effect (Ui) of a SNP on expression of a gene, we calculated a weighted average of these estimates as follows: Each Ui was converted to a signed t-value (=Ui/si where si = standard error of Ui). Then the p-value of this t statistic was calculated (pi) using the appropriate degrees of freedom and the equivalent z value for this p was determined by z = Ф− 1(p) and Ф is the cumulative standard normal distribution. The z scores were given the same sign as the original effect Ui and finally z̅* was calculated with eq. 4 where z̅* is approximately distributed N(0,1).
$$ {\overline{z}}^{\ast }=\sqrt{\frac{{\left(\sum \limits_{i=1}^n{z}_i^{\ast}\right)}^2}{n}} $$
(4)
In pooled analyses we merged the tests in the combined datasets and kept the original effects of SNPs, so some SNPs could have multiple solutions.
GWAS
For the GWAS of 20 traits for Bos taurus cattle with BovineHD SNP genotypes we used ASReml and the same model as Bolormaa et al. (2014). In addition, the multi-trait GWAS results were obtained from the study of Bolormaa et al. (2014) [23]. The genotypes for 729,068 SNP passed the quality control filters as described by Bolormaa et al. (2013) [12]. However, the number of SNP with minor allele frequency (MAF) greater than 0.005 were not equal for all traits because not all cattle were measured for all traits.
An additional GWA study for RFI in Bos taurus cattle was performed on imputed genome sequence (24,041,262 SNPs with MAF > 0.001) with ASReml using the model and cattle described by Khansefid et al. (2014) [24].
Comparing QTL and eQTL
We compared GWAS with expression results to find whether SNP associated with conventional phenotypes (QTL) are also associated with gene expression in the local eQTL or ASE analysis. The hypothesis, that SNP significantly associated with the traits (p < 0.001) were more likely to be associated with gene expression (p < 0.001) than expected by chance, was tested using a chi-squared test (H0: QTL and eQTL are independent). However, the assumption of the chi-squared test is that the SNPs are independent. This is not warranted because nearby SNPs are likely to be in linkage disequilibrium. Consequently, significant SNPs for both traits and gene expression tend to be clumped together on the genome. To overcome the limitation of the chi-squared test 3 other tests were performed:
1) To derive a valid distribution of the Χ2df = 1 statistic under the null hypothesis, that eQTL and QTL are independent, a permutation test was used. The vector of GWAS results was shifted by 10 to 90% of number of SNPs, and the overlap between the QTLs in the new vector and the eQTL was calculated as a chi-square statistic. This was repeated 10,000 times to generate the distribution under the null hypothesis. Each replicate of the permutation tests can show either more overlap between QTL and eQTL than expected by chance (enrichment) or less overlap than expected by chance (impoverishment), so the replicates were divided into these two groups and separate distributions calculated for the Χ2 statistics according to whether the replicate showed enrichment or impoverishment. For example, the X2 statistics from the permutation test comparing the meta analysis eQTL (combined ASE and eQTL mapping results in all tissues) with QTL found in a multi-trait GWAS test, is shown in Fig. 4, where the permutations with impoverished overlapping were given negative signs. To find the appropriate thresholds from the 10,000 permutations, the Χ2 values for each replicate was calculated and, after changing the sign of impoverished replicates to negative, the Χ2 values were sorted from smallest to the largest. The thresholds that separate the top 5% and bottom 5% (i.e. Χ2 with rank 500 and 9500, respectively) are marked on Fig. 4 as blue dotted lines and were used to test the experimental X2 statistic and declare them as significant or not at p < 0.05 for a one-tail test since only enrichment is of interest. In Fig. 4 these thresholds are Χ2 = − 13.38 and 6.64 for significance level p < 0.05 in the tests for impoverishment and enrichment respectively. In the permutations experiments, the replicates with less than expected overlapping QTL and eQTL (impoverishment) were more frequent than replicates with enrichment because the GWAS results include a few clusters of QTL and shifting the position of the eQTL is likely to place them between clusters causing impoverishment. So, in the majority of permutation experiments, the calculated threshold for a significant impoverishment was a larger value than for significant enrichment. For comparison, the theoretical Χ(1)2 distribution is shown in red and the thresholds for 5% (− 3.84 and 3.84) for tests with impoverishment and enriched overlapping are annotated in red vertical dotted lines for significance level P < 0.05.
2) For each gene individually, SNPs were classified as significant or not for both gene expression and the trait leading to a 2 × 2 contingency table. The significance of the overlap between SNPs significant as QTL and as eQTL was tested by a Χ2 test.
3) To combine this within-gene information from (2) across all genes, a linear model was used to test the association between SNPs affecting the trait and SNPs affecting gene expression across the whole genome but on a within gene basis using model 5.
$$ \boldsymbol{y}=\mathbf{1}\upalpha +{\boldsymbol{X}}_{\mathbf{3}}{b}_3+{\boldsymbol{X}}_{\mathbf{4}}{b}_4+\boldsymbol{e} $$
(5)
y is an N × 1 vector, where y i = 1 if the SNP was significantly associated with gene expression (p < 0.001) and 0 otherwise, and N is the number of SNPs. 1 is an N × 1 vector of 1’s and ɑ is a scalar intercept. X3 is an N × M design matrix where M = number of genes, allocating SNPs to their corresponding gene, and b3 is an M × 1 vector with b3 i = the effect of gene i on gene expression. X4 is an N × 1 vector with X4i = 1 if the SNP had a significant effect (p < 0.001) on the complex trait and 0 otherwise and b4 is a scalar measuring the association between GWAS and gene expression results. e is the vector of random residual effects ~ N(0, Iσ2e).
A positive correlation between gene expression and GWAS results (b4 significantly positive) indicates that the QTL are more likely to be eQTL and its p-value shows how reliable the association is.