Mapping mammalian topological association domains to the bovine genome
Topological association domain (TAD) genomic coordinates from human and mouse embryonic stem cells (hESCs and mESCs) [14], human IMR90 fibroblasts (IMR90) [14] and mouse cortex [44] were obtained from Yue Lab (http://promoter.bx.psu.edu/hi-c/download.html). TAD genomic coordinates from the fresh or frozen liver of mouse and dog were provided as supplementary Additional file 1: Table S1 by Rudan et al. [16]. Genomic coordinate conversion files, mammalian reference genomes and mammalian genome annotations were all loaded through Bioconductor (version 3.2) [45].
All genomic coordinates of mammalian TADs were converted to the bovine reference genome Bos taurus UMD3.1 (UCSC Genome Browser assembly ID: bostau6) using UCSC Batch Coordinate Conversion program (liftOver) [46] (default settings), which was run inside R (version 3.2.4) using Bioconductor (version 3.2) [45] (AnnotationHub [47], rtracklayer [48] and GenomicRanges [49] packages). Genomic coordinate conversion files were required as an input to liftOver, but were absent for the conversion of some versions of reference genome assemblies. In those cases, an intermediate genome was used to convert the source TADs to bostau6. As a result, the hESC and IMR90 TADs were mapped from hg18 to bostau6 in two steps through an hg19 intermediate. The mESC and mouse cortex TADs were mapped directly to bostau6. The mouse liver TADs were mapped from mm10 to bostau6 through bostau7 (NCBI ID: Btau4.6.1.). The dog liver TADs were mapped to bostau6 from canfam3 through canfam2 (Additional file 9: Figure S2).
Converting large genomic segments inevitably created a large number of genomic fragments. We employed a recovery procedure, a filtering procedure, and a local refinement procedure to finalise the putative bovine TADs. Throughout this study, we specified “input” as the data that was input to the first step of the genomic conversion, “query” as the data that was input to each further step of the genomic conversion, and “target” as the output from each step of the genomic conversion. The recovery procedure was employed in each step of the genomic conversion as follows:
For each input TAD set, w was denoted as the minimum width of input TADs.
-
(1)
If TAD fragments from the same query sequence were no more than w away from each other in the target genome, the TAD fragments were merged into a single TAD in the target genome.
-
(2)
If TAD fragments from different query sequences overlapped in the target genome, the TAD fragments were kept as separated fragments in the target genome.
Genomic conversion in combination with the recovery procedure resulted in 6 sets of putative bovine TADs (stage 1) that were respectively from the 6 sets of input TADs. A filtering procedure was then applied to the putative bovine TADs (stage 1), in order to reduce TAD fragments of low confidence in each TAD set. The filtering procedure was performed as follows:
-
(1)
For each genomic position i in the bovine genome, the number of TAD sets that agreed i was in a TAD was calculated and denoted as ai.
-
(2)
If a putative bovine TAD (stage 1) was no less than 200 Kb wide, and contained at least one position i where ai was no less than 4, this TAD (stage 1) was kept in the TAD set (stage 2). The 200-Kb threshold was selected because it was able to effectively filter out the small homological genomic fragments which were incapable of forming a TAD in the bovine genome.
The filtering procedure resulted in 6 sets of putative bovine TADs (stage 2) that were respectively from the 6 sets of putative bovine TADs (stage 1). A local refinement procedure was then applied to merge any putative bovine TADs (stage 2) that overlapped more than 1 bp. The 1 bp threshold (exclusively) was used, because TADs were calculated from frequencies of interactions between genomic bins [14, 16]. This resulted in the end genomic coordinates of some TADs being the same as the start genomic coordinates of their downstream TAD. These ‘TAD1.TAD2’ structures were shown in the input TAD sets, and if no large genomic rearrangements occurred, the homologous counterparts of these ‘TAD1.TAD2’ structure were kept in the final bovine TAD set. However, when large genomic re-arrangements occurred, source TADs from the same TAD set could overlap to a great extent in the bovine genome. Each of these overlapping TAD fragments was > 200 Kb with high frequency of internal interactions, indicating that they could form into a larger interaction domain in the bovine genome, and therefore were merged into one putative bovine TAD. The local refinement procedure resulted in 6 sets of final bovine TADs that were respectively from the 6 sets of putative bovine TADs (stage 2). A schematic workflow of the TAD mapping procedure is provided as Additional file 9: Figure S2.
Scanning for putative bovine liver CTCF binding motifs
CTCF binding genomic coordinates from the liver tissue of human, mouse, dog and macaque were downloaded from E-MTAB-437 [25]. Mammalian CTCF binding sequences were extracted from masked reference genome hg19, mm9, canfam2 and rhemac2 respectively. Masked reference genome is reference genome where interspersed repeats and low complexity DNA regions are detected and concealed, and is required for motif discovery.
All mammalian CTCF binding sequences were provided as one input to MEME-ChIP [50] in order to identify evolutionarily conserved CTCF binding motifs. Parameter settings for MEME-ChIP were based on the MEME program settings in Vietri Rudan et al. [16], who also used the same CTCF ChIP-Seq data from Schmidt et al. [25]. We looked for 0 or 1 DNA motif per ChIP-Seq peak, and the maximum input dataset size was adjusted to 70 Mb to account for our input file size.
The latest version of JASPAR CORE motif databases (2016), JASPAR CNE motif database (2008), JASPAR POLII database (2008), JASPAR PHYLOFACTS (2008), UniPROBE [51], and Homo sapiens comprehensive model collection database (HOCOMOCO; version 10) were obtained from MEME Suite [27] (http://meme-suite.org/). These TF binding motifs that were made available to public databases were used to validate our CTCF binding motif profiles from MEME-ChIP outputs.
Since only the more conserved CTCF binding motifs are reported to TF binding motifs databases, and an increasing number of novel motifs are discovered as more CTCF ChIP-Seq data are made available, all motifs from MEME-ChIP were input to FIMO [52] (default settings) to scan for the occurrence of putative CTCF binding motifs in the bovine genome. We created two sets of putative bovine CTCF binding motifs from FIMO outputs which differed by P-value stringencies. A less stringent set of putative bovine CTCF binding motifs was selected by a motif P-value no larger than 10−5. A more stringent set of putative bovine CTCF binding motifs was selected by a motif score no less than 80 and a motif P-value no larger than 10−8 and where overlapping regions were merged. Both sets of putative bovine CTCF binding motifs were used for all downstream analyses in this paper unless otherwise noted.
Enrichment of biological hallmarks at topological association domain boundaries
Bovine transfer RNA (tRNA) genes and short interspersed element (SINE) annotations were all loaded through Bioconductor (version 3.2) [45]. Human housekeeping gene names [53] were downloaded as per paper and were used as the list of bovine housekeeping genes. Bovine genomic regions that were enriched for tri-methylation of lysine 4 on histone H3 (H3K4me3) and acetylated lysine 27 on histone H3 (H3K27ac) signals from bovine liver tissue were downloaded from E-MTAB-2633 [54]. Bovine genomic regions that were enriched for H3K4me3 signal from the tender and tough longissimus dorsi of Aberdeen Angus steers were downloaded from GSE61936 [55]. Putative bovine enhancers in homology with human and mouse enhancer databases from VISTA, FANTOM5 and dbSUPER were as described by Wang et al. [56].
Putative bovine TAD boundaries were defined as those genomic intervals, no larger than 400 Kb, between adjacent putative bovine TADs from the same TAD set. For those putative bovine TADs where the end position of the previous TAD was the same as the start position of the following (i.e. a knot), 400-Kb genomic intervals centring on the “knot” were selected as TAD boundaries. A schematic graphical representation of the TAD, TAD boundaries and ‘knot’ is provided as Additional file 10: Figure S3. The level of enrichment of the following biological hallmarks was tested at TAD boundaries from each set of putative bovine TADs: human housekeeping genes, bovine tRNA genes, bovine SINEs, putative bovine CTCF binding motifs, putative bovine enhancer regions in homology with those in VISTA, FANTOM5 and dbSUPER databases, bovine liver H3K4me3 regions, bovine liver H3K27ac regions, bovine tender and tough muscle H3K4me3 regions.
The enrichment analysis was run as a permutation test with 10,000 random repeats to test whether the biological signal overlapped significantly more with the putative bovine TAD boundaries than the rest of the genome. We denoted n as the number of base pairs of a biological signal that overlapped with the putative bovine TAD boundaries in the original dataset. In the permutation test, for each chromosome, we slid the TAD boundaries along the chromosome by M positions, where M was a number randomly selected between 1 and the length of the chromosome. After sliding, if a TAD boundary exceeded the range of the chromosome, we recycled the exceeding subset of the TAD boundaries to the start of the chromosome. Then we counted the number of base pairs of a biological signal overlapped with the slid putative bovine TAD boundaries and denoted this number as m. The fold change of enrichment was defined as the ratio of n to the mean of all m values in the 10,000 permutations. The ranking position of n within the distribution of all m values over all random samples, denoted as R, was determined, and a P-value to test the significance of the ranking was computed. For the largest n among all m values, the P-value was set to <0.0001 and otherwise it was \( \frac{R}{10001} \). We declared a biological signal “highly enriched” in a set of putative bovine TAD boundaries if n was larger than 95% of all m values. Our permutation tests resulted in 66 independent analyses (6 sets of finalised putative bovine TADs × 11 biological signals).
Testing allele-specific expression within regulatory units
Chamberlain et al. [31] observed runs of genes expressed in favour of the same parental chromosome. We argued that these runs of allele-specifically expressed genes could be located within the same TADs and or between the same CTCF binding motifs. To investigate this, we obtained allele-specific expression (ASE) data from Chamberlain et al. [31]. The ASE data was from 1 lactating cow’s 18 tissues. In each tissue, the ASE data was the number of RNA transcripts aligned to each parental allele at all heterozygous exonic positions. The heterozygous exonic positions were determined from the cow’s whole genome sequence variants [31], and mapping bias was taken into account following methods that were described by Degner et al. [57]. The ASE data was used to calculate ASE scores as follows:
$$ {Y}_i=\mathit{\log}10\ \left(\frac{P_i+1}{M_i+1}\right) $$
(1)
where Yi was the ASE score from a tissue at position i; Pi was the number of RNA reads from a tissue aligning to position i that had the paternal allele; and Mi was the number of RNA reads from a tissue aligning to position i that had the maternal allele. One was added to all read counts in order to obtain valid ASE scores even for the mono-allelic expressed SNPs, i.e. where only one of the two alleles were expressed. A positive ASE score meant that more parental than maternal alleles were detected among RNA reads at position i. A negative ASE score meant that more maternal than parental alleles were detected among RNA reads at position i. An ASE score of zero meant that equalled numbers of parental and maternal alleles were detected among RNA reads at position i.
We found that most heterozygous exonic SNPs were distributed within regulatory units. If most heterozygous exonic SNPs in the same regulatory units were from the same parental chromosome, the mean ASE scores of those SNPs would be significantly deviated from 0; alternatively, if heterozygous exonic SNPs were randomly expressed from parental chromosomes, the mean ASE scores of those SNPs would be close to 0. Analyses of variance (ANOVA) models were used to quantify this ASE variation (measured by mean ASE scores) from regulatory unit to regulatory unit. Three independent ANOVA models were fitted, where the ASE scores in a tissue were fitted as the response in all three ANOVA models, but the regulatory units in each model were different:
-
Model (1) was a TAD only model that defined a regulatory unit only by the putative bovine TADs from a TAD set.
-
Model (2) was a CTCF only model that defined a regulatory unit only by the CTCF gaps, which were the genomic intervals between the set of 2 CTCF binding motifs (motif score ≥ 80 and motif P-value ≤ 10−8).
-
Model (3) was a TAD+CTCF model that defined a regulatory unit by both TAD and CTCF gaps. The CTCF gaps were embedded within the putative bovine TADs from a TAD set. Model (3) tested if TAD or CTCF gaps accounted for more variation observed in the ASE data.
In each ANOVA model, all innermost genomic regions were required to contain no less than 5 SNPs that had valid ASE scores. We found majority of the valid ASE SNPs (> 94% in TAD only model, > 82% in CTCF only model, and > 73% in TAD+CTCF model) were distributed among different genes within the same regulatory unit. There were 108 independent ANOVA tests performed in model (1) which were from the combination of 6 sets of finalised putative bovine TADs and 18 bovine tissues, 18 independent ANOVA tests performed in model (2) from the combination of 1 set of CTCF gaps and 18 tissues, and 108 independent ANOVA tests in model (3) from the combination of 6 sets of finalised putative bovine TADs and 18 tissues. The significance of an ANOVA test was declared at P-value ≤ 10−8.
The permutation tests, with 10,000 repeats, were performed to test whether the observed ANOVA result in each model was random. In each permutation test, the ASE scores were shuffled across the whole genome and then the model was refitted with the permuted dataset. The R-squared value from the original dataset, R, was compared with the 10,000 null R-squared values from the random shuffles, R′. A significant ANOVA cohort was declared if R was larger than all R′ values. In a significant ANOVA test, a regulatory unit was declared to display significant ASE effects if the absolute value of the averaged ASE scores in the region, |n|, was larger than 99% of the absolute value of the averaged ASE scores in the permutations ∣m∣, i.e. false discovery rate (FDR) < 0.01, and the p-value for n was no larger than 10−6. There were 108 independent permutation tests performed in model (1), 18 permutation tests performed in model (2) and 108 permutation tests performed in model (3).
Enrichment of significant aseQTLs and eQTLs within putative bovine CTCF binding motifs
The heterozygous quantitative trait loci (QTLs) that were associated with allelic-specific expression (aseQTLs) and expression variation (eQTLs) were obtained from Chamberlain et al. [32, 33]. The aseQTL and eQTL mappings were performed using mRNA transcripts from 141 lactating cows’ white blood and milk cells, and mapping bias was taken into account following the methods that were described by Chamberlain et al. [32]. The aseQTL and eQTL data was a table of effect and P-value between an aseQTL/eQTL and its eGene target. The RNA sequence data is available from NCBI Sequence Read Archive (Bioproject accession PRJNA305942).
We hypothesized that significant cis-expressed QTLs were enriched at putative bovine CTCF binding motifs. We tested this hypothesis using aseQTLs and eQTLs from white blood and milk cells, both of which indicate cis-expression QTLs. The significant cis-QTLs were defined by a P-value threshold p, which was tested at 10−5 to 10−8. In each significant threshold, we selected a type of cis-QTLs (e.g. aseQTLs from white blood cells) whose P-value was no larger than p. In the case where multiple eGenes were found to be significantly associated with the same cis-QTL, the cis-QTL was only counted once. Then we calculated the number of significant cis-QTLs within putative bovine CTCF binding motifs, and denoted this number as n.
To test the frequency of significant cis-QTLs occurring in the rest of the genome, for each chromosome, we slid the significant cis-QTLs by M positions, where M was a number randomly selected between 1 and the length of the chromosome. After sliding, if the position of a cis-QTL exceeded the range of the chromosome, we redefined the position of the cis-QTL as the slid position subtracting the chromosome length. We denoted m as the number of those slid cis-QTLs within putative bovine CTCF binding motifs. The slide and recalculation were repeated 10,000 times. The fold change of enrichment was defined as the ratio of n to the mean of all m values from the 10,000 permutations. The ranking position of n within the distribution of all m values, denoted as R, was determined, and a P-value to test the significance of the ranking was computed. For the largest n among all m values, the P-value was set to <0.0001 and otherwise it was \( \frac{R}{10001} \). We declared significant cis-QTLs highly enriched at putative bovine CTCF binding motifs if n was larger than 99% of all m values (FDR < 0.01). Our tests resulted in 32 independent analyses (2 types of cis-QTLs × 2 cell types × 4 significant thresholds × 2 sets of CTCF binding motifs).
Testing frequency of each eGene and its most significant cis-QTL locating within the same TAD
We hypothesized that the most significant cis-expressed QTLs for each eGene fell more often within the same TAD as the eGene than not. We tested this hypothesis using aseQTLs and eQTLs, both of which indicate cis-expression QTLs. Our cis-QTLs were detected using mRNA transcripts from 141 lactating cows’ white blood and milk cells [32]. We selected the most-significant cis-QTL for each eGene within a TAD. The most-significant cis-QTL for each eGene was required to pass a significance threshold which was tested from 10−5 to 10−8. In the case where multiple cis-QTLs were ranked with the most significance level towards the same eGene, we broke the tie by selecting the cis-QTL that was linearly the furthest away from the eGene. The linearly furthermost cis-QTL was chosen in order to avoid bias favouring the most significant cis-QTL to be within the same TAD as the eGene. The result of this procedure was a number of cis-QTL and eGene targets, which occurred in the same TAD out of all possible pairs. The expected number of significant aseQTL/eQTL and eGene within the same TAD was defined as follows:
$$ E=N\times \overline{M}\times p $$
(2)
where E is the expected number of significant cis-QTL and eGene within the same TAD, N is the total number of eGenes overlapping a TAD, and \( \overline{M} \) is the maximum number of eSNPs in a chromosome that were used for the aseQTL/eQTL mapping, and p was the P-value threshold for the association between the eSNP and the eGene.
A chi-square test was performed to test whether the observed number of furthermost and most-significant cis-QTL and eGene within the same TAD was statistically significant or not. The chi-square test was defined as follows:
$$ {\chi}^2=\frac{{\left(E-O\right)}^2}{E} $$
(3)
where χ2 is the chi-squared value, E is defined as above, and O is the observed number of significant cis-QTL and eGene within the same TAD. A significant chi-squared test was defined as a corresponding chi-square test P-value <0.001. There were 96 independent chi-squared tests (2 types of cis-QTLs × 2 cell types × 4 significant thresholds × 6 sets of finalised putative bovine TADs).