Investigation of ancestral alleles in the Bovinae subfamily

Background In evolutionary theory, divergence and speciation can arise from long periods of reproductive isolation, genetic mutation, selection and environmental adaptation. After divergence, alleles can either persist in their initial state (ancestral allele - AA), co-exist or be replaced by a mutated state (derived alleles -DA). In this study, we aligned whole genome sequences of individuals from the Bovinae subfamily to the cattle reference genome (ARS.UCD-1.2) for defining ancestral alleles necessary for selection signatures study. Results Accommodating independent divergent of each lineage from the initial ancestral state, AA were defined based on fixed alleles on at least two groups of yak, bison and gayal-gaur-banteng resulting in ~ 32.4 million variants. Using non-overlapping scanning windows of 10 Kb, we counted the AA observed within taurine and zebu cattle. We focused on the extreme points, regions with top 0. 1% (high count) and regions without any occurrence of AA (null count). High count regions preserved gene functions from ancestral states that are still beneficial in the current condition, while null counts regions were linked to mutated ones. For both cattle, high count regions were associated with basal lipid metabolism, essential for survival of various environmental pressures. Mutated regions were associated to productive traits in taurine, i.e. higher metabolism, cell development and behaviors and in immune response domain for zebu. Conclusions Our findings suggest that retaining and losing AA in some regions are varied and made it species-specific with possibility of overlapping as it depends on the selective pressure they had to experience. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07412-9.


Background
Divergence and speciation result from long periods of adaptation, selection, and genetic drift after separation of subpopulations. Separation forces individuals to adapt within the current isolated environment and gradually differ from the initial population. Various methodologies and theories have been proposed in efforts for deciphering this process since 19th century [1].
Recently, the availability of whole genome sequences (WGS) has become of increasing importance in genetic studies [2]. In cattle studies for example, WGS data of various breeds have been used for inference of demographic history, identi cation of production traits, calculation of effective population size, estimation of genetic relationships, and population structure analysis [3][4][5].
In closely related species, most of protein-coding genes are strongly conserved. Orthologous genes derive from a common ancestor and may present variation between species due to divergence [6]. Further down the speciation event, these genes can appear in different sites of the karyotype of each respective species. For example, the orthologues of ULK4, which is located on human chromosome 3, are found on chromosome 9 of rat (Mus musculus), chromosome 2 of Rhesus monkey (Macaca mulatta), and chromosome 8 of Norway rat (Rattus norvegicus) [7].
Alleles having diverged through mutation are called derived alleles (DA), while alleles that persist in their initial state are termed ancestral alleles (AA) [8]. A reasonable method to assess AA is by comparing shared polymorphic sites of closely related species. Alleles that are still intact and shared by all the related species are most likely the ancestral allele [9]. Another method consists of verifying the allelic state of the last common ancestor (LCA) or the allele within current populations that least differs from the LCA [10].
In a study of autosomal single nucleotide polymorphisms (SNP) in pig, ancestral and derived allelic states of SNP were inferred using four Sus species (Sus celebensis, Sus barbatus, Sus cebifrons, and Sus verrucosus) and one outgroup species of African warthog for focal species of Sus scrofa [11]. In human studies, the out-group species for inferring AA are primates, namely orangutan (Pongo sp.), macaques (Macaca sp.), gorilla (Gorilla sp.), and bonobos (Pan paniscus) [12]. In a cattle study of Utsunomiya et al. (2013) using HD-SNP, Gaur (Bos gaurus), water buffalo (Bubalus bubalis) and Yak (Bos grunniens) were utilized as focal species for cattle.
De ning the ancestral and derived states at polymorphic nucleotide sites is required to test proposed hypotheses regarding molecular evolution processes, such as estimation of allele ages, formation of linkage disequilibrium (LD) patterns and genomic signatures as a result of selection pressures [13,14].
Human WGS studies bene t from AA database for population analysis, but such a database is lacking in cattle. Consequently, each study repeatedly generates its own putative AA list [5,12,15]. Therefore, the goal of this study is to ll this gap and to determine a xed set of AA in cattle by using outgroup species in the Bovinae subfamily, namely gaur, yak, bison, wisent, banteng, and gayal sequences. In addition, we scanned the list of AA for physical regions linked to conserved and mutated traits in taurine and zebu cattle.

Read alignments and Principal Component Analysis
We evaluated alignment results of different species within the Bovinae subfamily against the latest cattle reference sequence ARS-UCD1.2 [16]. On average, the genome was covered by ~ 5x for banteng, taurine cattle, European bison, gayal, and yak, ~4x for American bison and zebu cattle, and ~ 3x for aurochs.
Principle component analysis (PCA) formed clusters and separation of individuals among these nine groups (Fig. 1). Four principal components (PC) explained 36.7%, 24.9%, 20.5%, and 17.7% of the variance for rst, second, third, and fourth PC, respectively. Projected by the PC1 and PC2, these Bovinae individuals are clustered together with its closest relatives evidencing genetic relatedness within its subspecies. PC1 explains divergence of cattle (aurochs, zebu, and taurine), from the rest. PC2 gives divergence between cluster containing gagaba from clusters containing yak and bison. Thus, we can group these individuals into four, namely cattle-aurochs cluster, gagaba cluster, bison cluster, and yak cluster. Outlier individuals, i.e. two gayals and the American bison, may indicate individuals carrying introgression from cattle.

Phylogenetic trees
Maximum Likelihood phylogenetic trees were constructed for each chromosome [see Additional le 1]. Inferred trees were all similar with Fig. 2 below displaying the tree from chromosome one. In concordance with the principal component analysis, 13 yak individuals are situated together in the top clade of the tree. European bison and American bison have the same node of ancestor, with American bison perceived to be more ancestral. This is in line with a previous study where sister relationships were indicated between American bison and European bison and also between bison clade and yak [17]. Banteng-gaurgayal share a clade together, however, variations in the order within these three species exist in trees inferred from different chromosomes [see Additional le 1]. Zebu cattle reside on the same upper node with the taurine cattle group. Each breed of taurine cattle is well clustered together except for several Holstein individuals. Based on all trees, we de ned yak as the most distant relative as it is positioned on the furthest node from cattle.

Inferring ancestral allelic states
The main output of this paper is a list of de ned ancestral alleles for cattle [see in Additional le 2]. This list is necessary for several tools used for studying selection signature such as iSAFE, iHS, xp-EHH, EHHST, and hapFLK [18][19][20][21][22][23] which were built for human population genetics study. We provide this dataset as a foundation for future comparisons of selection signatures in various cattle breeds. It is stored in a simple format of .txt and comprised of 6 columns of chromosome, position, number of alleles, de ned ancestral allele, frequency, and which groups agree on the de ned ancestral allele. AA were determined as alleles that are xed in two of three outgroup lineages. Using allele frequency over all individuals in outgroup, we de ned ~ 32.4 million variants that are xed across 29 chromosomes as AA corresponding to 1.2% of the total genome. A full list of these alleles with their positions is provided [see Additional le 2]. As in Fig. 1, 3.75 million alleles were de ned as ancestral from all three lineages of bison, yak, and gayal-gaur-banteng (gagaba). GC content percentage of ancestral alleles is 58 percent, which is higher than the GC content of the reference genome (~ 42%). Yet, it is worth noting that 22% of these AA are within active transcript regions.
Windows with high ancestral allele counts in taurine and zebu cattle We counted AA by non-overlapping windows of 10 Kb in taurine and zebu cattle separately. Ancestral counts for the top 0.1% are beyond the mean plus three standard deviations. For taurine cattle, the lowest chromosome speci c threshold for ancestral count was 122 on chromosome 25 while the highest was 302 on chromosome 14, while for zebu cattle, it was 102 in chromosome 1 while the highest 200 on chromosome 12. The trends for both groups were similar as shown in Fig. 6. Taurine cattle has mostly higher thresholds implying there are more windows with higher counts of AA compared to zebu cattle.
Windows without the occurrence of ancestral alleles We found 3306 windows without AA in taurine and 2189 windows in zebu. The highest ratio of windows with null AA counts to total windows was 2.9% on chromosome 29 in taurine and the lowest is 0.14% in chromosome 25 of zebu cattle (Fig. 7). Overall, taurine has more windows without AA except for chromosome 1, 8, 10, and 27. Windows without AA could be explained by a lack of de ned AA from outgroups, meaning, there were no xed alleles that can be found in at least two lineages. Another reason could be that derived alleles are now the major alleles on polymorphic sites, therefore we could not nd AA within these windows. In taurine cattle, 65% of windows without AA are due to the latter reason, while in zebu it is 46%.
Annotation of scanning windows with high number of ancestral alleles We annotated each scanning window passing the respective threshold of top 0.1%, corresponding to 255 regions in taurine and 258 regions in zebu across 29 chromosomes. These regions contained 20 genes in taurine and 40 genes in Zebu. Both groups retained genes functioning in arachidonic acid secretion (GO:0050482), phospholipid metabolic process (GO:0006644), and lipid catabolic process (GO:0016042) indicated by LOC100125947 and PLAG2A, as shown in Table 1. These three terms are mainly functioning in primary metabolic process of lipid. Function of defense response to bacterium (GO:0042742) was exclusive to taurine. DEFB genes family in GO:004742 were secreted by leukocytes and epithelial tissues.
It is known for its function similar to antimicrobial defense by penetration to microbial's cell membrane and cause microbial death [24]. While calcium ion imports (GO:0070509), represented by SLC8A1 and CACNA1D, was exclusive to zebu de ned as function of maintaining and transporting cellular entity in a speci c location. GO terms de ned for taurine and 7 GO terms for zebu. Among those, three terms were found in both, i.e. two antigen processing terms (GO: GO:0002474 and GO:0019882) and negative regulation of endopeptidase activity (GO:0010951).
In taurine cattle, apart from terms related to immune system process and cellular function, there are GO terms exclusive to taurine cattle that are related to production traits. For example, GO:0008654, GO:0043410, GO:0045725, GO:0060048, GO:0008016, are related to metabolic process of phospholipid, protein, glycogen, and regulation of muscle and heart contraction. GO:0007613 and GO:0035176 are related to mental information processing systems and is part of learning or memory abilities which can affect cognition and behavior as indicated by CRTC1, TH, ITPR3, DBH, SORCS3 genes. ITPR3 is known as well for process of sensory perception of taste. CRTC1 gene in human has highest transcript expression in brain compared to other tissues and is known for affecting eating behavior [25]. Regions without AA in zebu were mainly related to 5 GO terms in domain of immune response and one term related to cellular process of transmembrane transport. Figure 8 represented distribution of terms found in regions without AA. It is dominated by metabolism terms in taurine and immune response in zebu.

Discussion
We forced mapping short read sequences of different species within Bovinae subfamily into the latest cattle RefSeq ARS-UCD1.2 irrespective of their actual genome structure. Phylogenetic trees were built based on the SNP variants in autosomes. We used subsets of all variants per chromosome to comply with maximum 50000 markers/sequences per output of the analysis as directed by the software [26].
Despite an unequal number of individuals representing each group, we could infer relationships based on variant similarity and de ned four lineages of yak, bison, gagaba and cattle. Even though still related, none of outgroups were in ancestor-descendant relationships apparently.
De ning AA by only single lineage was not an option since any of the current lineages could have undergone independent evolutionary events and might have diverged from the initial ancestral state. Alleles were set to be ancestral strictly if it is xed and shared by at least two lineages of yak, bison and gagaba complying to other similar studies [9,15].
Scanning windows of 10Kb was chosen after a preliminary comparison between 1Kb, 10Kb, and 50 Kb windows and considering the average gap between high density markers of 4Kb in identifying different types of selection in a previous study [27]. Ancestral allele counts within scanning windows in taurine and zebu cattle varied in the genome. We took two extreme ends of the occurrence distribution; one is windows with the top 0.1% highest count and second is windows without ancestral allele count. Based on the knowledge that mutation occurs across autosomes with different rates on different scales [28], we expected ancestral allele frequency to be changing as the mutations emerge. Thus, we assumed windows with highest count of AA are the conserved ones while windows without AA are the ones containing relevant mutations, considering important traits or genes that were retained along evolutionary process [6,8].
Regions with high ancestral counts have GO terms related to primary metabolic process of lipid in both cattle. Genes within these GO terms are likely retained in ancestral states because their basic function are still bene cial. Despite different environments, both cattle need to store energy e ciently in form of lipids. Although cattle diet usually contains two to four percent lipid, it contributes up to 50% of fat in milk and the most concentrated source of energy. In contrast to human, where liver is the primary site, fatty acid synthesis occurs at adipose tissue in ruminants [29,30]. Adipose tissue acts as reservoir for e cient energy storage in allowing cattle and mammals in general for surviving adversities such as food shortages during severe winter for taurine or drought for zebu [31]. Defense response to bacteria (GO:0042742) was detected from regions with high ancestral counts in taurine, but found in regions without AA in zebu. In taurine high count regions, DEFB7 and DEFB3 are within this term, while regions without AA in zebu are DEFB6, LOC781146, DEFB1, DEFB3.
For regions without AA where expected mutation occurs, GO terms may have correlated and not necessarily independent from each other as pointed by its function. For grouping, we used the prevalent terms within ancestor charts in quickGO. In taurine, terms are related to behavior, cellular functions, tissue development, immune system, metabolism, and stimulus response. These are in line with suggestion from previous study for likelihood of genes function without AA and positive selection [32]. Within this scope, more GO terms found in taurine cattle compared to zebu possibly due to more intensive selection for production traits. Aiming for higher growth rate, carcass quality, feed e ciency, calving interval, milk production and body conformity has directed animals to be more e cient with higher metabolism rates [33][34][35]. These selection events might not only be affecting a narrow-region of genome. Instead, it altered several regions simultaneously as production traits are complex involving many QTLs or regions across chromosome with small contribution by each for the expression [36,37].
In zebu, mutated regions were mainly linked to GO terms of immune response and little to cellular functions and metabolism. Concordance to suggested previously where zebu has been bred to adapt with more marginal production environments compared to taurine [38,39]. Evidences showed different in relative importance on innate and adaptive immune response towards cattle tick Rhipicephalus microplus infestation between zebu and taurine. Skin in ammatory response by high secretion of granulocytes and T-lymphocytes in taurine is not necessary could cease tick invasion. But, an earlier in ammatory response and secretion of an alternate non-volatile T-cell in zebu were more e cient in repel this tick invasion [40,41].
Nevertheless, not all genes within previously mentioned GO terms can be linked directly to positive selection. As mentioned in previous study, BOLA gene families, which we found also in regions without AA, are a result of balancing selection aiming for preserving genetic diversity as heterozygous animals have more advantage than the homozygous ones [27]. Similarly, we cannot con rm whether genes here are main targets of selection or as hitchhiking effect from genes of interests. For example, genes within GO:0007613, related to behavior memory and taste preferences, could be intended for selection because breeder preferences of tame, good mothering ability and non-picky animals in terms of feed and housing. Alternatively, it could be indirectly selected because animals have to cope with commercial environment as suggested that behavioral patterns were altered for animals in pasture and con nement cases [42,43].
Our ndings suggest that retaining and losing AA in some genes or regions are varied and made it species-speci c with possibility of overlapping as it depends on the selective pressure they had to experience. Future work in nding overlapped domains detected by different tools for selection signatures would con rm speci c regions/functions peculiar for each various cattle breeds.

Conclusions
We inferred ancestral alleles by combining xed alleles in three lineages of cattle outgroups. Regions conserving more primitive functions indicated by high count ancestral alleles were linked to lipid metabolism in taurine and zebu. Meanwhile, regions undergone mutation indicated by no preserved ancestral alleles were found more on taurine than zebu. These regions were linked to production traits in taurine and robustness traits in zebu.

Methods
Dataset WGS of different (sub)species were obtained from NCBI database as in raw format of fastq le ( Table 2). Taurine cattle group was represented by several commercial breeds, i.e. Holstein, Angus, Jersey, and Simmental. Work ow of the ancestral analysis pipeline is shown in Fig. 9. Alignment and Variant calling Following Best Practice procedure by Genome Analysis Tool Kit [49][50][51], single interleaved data sets of FASTQ from each individual were not trimmed based on phred score, because GATK tool takes care of these low quality reads on later step during recalibration process. Datasets were mapped against the cattle reference sequence ARS.UCD-1.2 [16] using BWA-MEM [52] with default parameters. The raw mapped reads were sorted by chromosome position using SortSAM function. Sorted BAM les then underwent duplicates marking using Picard MarkDuplicates. Base Quality Score Recalibration (BQSR) was carried out to adjust the base scores towards various possibly systematic errors. BQSR required supporting les, such as known variant sites in vcf format [44], index and dict les of reference sequence created by using Samtools [53]. Report le in table form was needed for the next step of ApplyBQSR with an output of analysis ready BAM les. Analysis ready BAM les were individually called for variants using HaplotypeCaller with GVCF mode for preparation in cohort analysis work ow. Individual VCFs then combined using CombineGVCFs and went through joint-call cohort for GenotypeGVCFs. SplitVCFs tool was used to split SNPs and Indel variants from cohort VCF le. SNP variants were ltered out for parameter of mapping quality less than 40, QUAL less than 30 and quality by depth less than 30. Header editing of vcf les and splitting by each chromosome were done using bcftools and vcftools.
Principal Component Analysis Multisample VCF le was converted to binary plink format using VCFtools. The indep algorithm in PLINK [54] was used with default parameters of 50 variants window size units shifting for every 5 variants with pairwise r 2 threshold of 0.7. This step selected a set of independent variants for reducing redundancy. Then, we set four components to reduce dimension of the whole independent variants and plotted the species based on the rst two components.

Phylogenetic trees
We constructed phylogenetic trees from autosomes of our species similar to other studies, so called phylogenomes [55,56]. SNPhylo [26] processed original multisample VCF les of chromosome 1 to 29 separately to reduce redundant variants based on LD. Parameters were set to 0.1 Low Coverage Samples (PCLS), depth coverage of two, 0.9 LD threshold, 0.1 minor allele frequency and 0.1 missing rate. These parameters were set to meet the maximum variants output by the program and roughly reduce the variants to 10 percent in output fasta. MEGA X built initial tree using Maximum Parsimony method and inferred nal phylogenetic trees for each chromosome by using Maximum Likelihood method and Jukes-Cantor model with 200 bootstraps [57,58].

Inferring ancestral allelic states
VCFtool was used to call allele frequency spectrum from un-prunned VCF les. Considering branches in phylogenetic trees and clusters of PCA, we de ned three lineages of cattle outgroup, i.e. Yak, Bison(American bison and European bison), and Gagaba (Gayal-Gaur-Banteng). For each site, frequency of two alleles of A and a represented by p(A) and q(a) frequency. If p(A) frequency of 1 and found in at least two lineages, we de ned "A" allele as ancestral for that site.
We used R [59] to create list of these de ned AA for all autosomes. Following packages in R were used to support data analysis and visualization: dplyr [60], ggplot2 [61], and stringr [62]. R functions for calling the ancestral allele in this study are provided in [63].
Comparison to cattle groups A custom script was used to compute summary statistics of allele frequencies and to compare which AA are still intact in zebu and taurine cattle. Notation 1 below, de ning how we calculated , the changing frequency of ancestral allele compared to cattle group: Annotation region of interest Physical regions indicated by previous step were taken as input for ANNOVAR [64]. We then excluded regions that are fall in the intergenic, downstream and upstream of known genes, leaving only regions that overlapping with functional genes. We ltered out genes de ned by highest count regions if were found also in regions without ancestral counts. We used this list of genes for GO analysis using DAVID 6.8 [65,66]. We report GO of biological process with the Bonferroni corrected P-values. De nition and supporting information related to GO were retrieved from database of European Bioinformatics Institute [67]. Availability of data and materials Accession numbers for raw sequencing reads can be found in Additional le 5. Custom R functions are available through repository [63]. Protocol of analysis is provided [68].

Competing interests
The authors declare that they have no competing interest.
Authors' contributions GM conceived and designed the study. BDR and MMN coordinated the input dataset. MMN run the analysis and drafted the manuscript. YTU, BDR, JS, and GM interpreted the analysis results and critically revised the manuscript. All authors reviewed and approved the nal manuscript.