A genome-wide scan of copy number variants in three Iranian indigenous river buffaloes

Background In Iran, river buffalo is of great importance. It plays an important role in the economy of the Country, because its adaptation to harsh climate conditions and long productive lifespan permitting its farming across the Country and to convert low-quality feed into valuable milk. The genetic variability in Iranian buffalo breeds have been recently studied using SNPs genotyping data, but a whole genome Copy Number Variants (CNVs) mapping was not available. The aim of this study was to perform a genome wide CNV scan in 361 buffaloes of the three Iranian river breeds (Azeri, Khuzestani and Mazandarani) through the analysis of data obtained using the Axiom® Buffalo Genotyping Array 90 K. Results CNVs detection resulted in a total of 9550 CNVs and 302 CNVRs identified in at least 5% of samples within breed, covering around 1.97% of the buffalo genome. and A total of 22 CNVRs were identified in all breeds and a different proportion of regions were in common among the three populations. Within the more represented CNVRs (n = 302) mapped a total of 409 buffalo genes, some of which resulted associated with morphological, healthy, milk, meat and reproductive traits, according to Animal Genome Cattle database. Conclusions This work provides a step forward in the interpretation of genomic variation within and among the buffalo populations, releasing a first map of CNVs and providing insights about their recent selection and adaptation to environment. The presence of the set of genes and QTL traits harbored in the CNVRs could be possibly linked with the buffalo’s natural adaptive history together to a recent selection for milk used as primary food source from this species. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07604-3.


Background
The Asian water buffalo (Bubalus bubalis) and the African wild buffalo or Cape buffalo (Syncerus caffer) are the two main species of buffalo in the world [1]. The domestication of the Asian water buffalo with two subspecies i.e. the river (Bubalus bubalis bubalis 2n = 50) and the swamp (Bubalus bubalis carabanensis 2n = 48) buffalo, occurred about 3000-6000 years before present in the Indo-Pakistani area and in the vicinity of borders of China, respectively [2,3]. River buffalo is common in India, Egypt, Southwest Asia and Europe, and swamp buffalo is common in China and Southeast Asia [3,4].
According to the FAOSTAT data (http://www.fao.org/ faostat/en/?#data/QA accessed 2020/10/08) the proportion of buffalo population in Iran respect to cattle species was 2.5%. A very similar data is reported by Beldman et al. [5] who analyzed the dairy farming sector in Iran indicating a proportion of 3% of buffalo heads over the total population of cattle.
In the developing countries including Iran, river buffalo breeding is nowadays recognized of great importance because of (i) the ability of buffalo to convert lowquality feed to valuable milk, (ii) adaptation to harsh climate conditions and resistance to local parasites, (iii) long productive lifespan [6], and (iv) their potential in milk and meat production performances [7]. As described by Safari et al. [7] the main breeding activities in Iranian Buffalos are carried by the Animal Breeding Center of Iran and envisage milk recording and genetic evaluation of reproducers. In Iran, there are three main buffalo breeds including Azeri (from the north-west and north), Khuzestani from west and south-west, and Mazandarani (from north) [4,6]. The recently released buffalo SNP genotyping array has been reported as a suitable tool for studying genetic diversity of river buffalo breeds as well as a potential starting point for genome-wide association and genomic selection programs [1,8]. The genetic variability in these three breeds have been recently studied using SNPs markers. Davoudi et al. [9] investigated the haplotypic structure and genetic diversity in Khuzestani river buffalo, while Mokhber et al. [4] evaluated the genetic structure of the Azeri and Khuzestani breeds to identify genomic regions associated to different environmental conditions and production goals. Additionally, Ghoreishifar et al. [6] identified ROH in the Azeri and Khuzestani breeds.
Copy number variants (CNVs) are a source of structural variability that have been utilized to identify genetic variability among breeds in several species. Redon et al. [10] defined a CNV as a DNA segment of one kilobase (kb) or larger that is present at a variable copy number in comparison with a reference genome. Several studies have been performed in different species, finding that CNVs are also related to phenotypic variability [11][12][13][14][15][16] as well as disease susceptibility [17,18] describing up to 30% of the genetic variation in gene expression. Studies regarding the CNVs detection based on SNP chip data in buffalos were not available, consequently, the aim of this research was to perform a genome wide CNV mapping in samples of the three Iranian river buffalo breeds through the analysis of data obtained using the Axiom® Buffalo Genotyping Array 90 K.

Results
The PCA and the F ST results based on the SNPs genotypes confirms that the three populations are clearly differentiated (Fig. S1).
A total of 9550 CNVs (5154-53.97%deletions and 4396-46.03%gains) on all 24 autosomes were detected in the 361 samples (Supplementary Table S1). Table 1 reports the descriptive statistics for CNVs for each of the three buffalo populations, as well as the loss/gain ratio calculated considering the number of loss on number of gains. The largest loss/gain ratio was found in the MAZ breed (1.32) while the KHU showed a value close to the unit.
The graphical representations of CNV statistics are shown for each breed in Fig. 1. In details, the relationship between CNV count and the averaged total length of CNV for each individual is shown in Fig. 1a The graphical distribution allowed to identify a similar pattern of distribution for the samples belonging to the three buffalo breeds with few individuals counting a low number of CNVs with a high average length. Even when CNVs are classified according to classes of length (5 classes as in Fig. 1b legend), the three populations showed similar structure in CNVs. An exception was observed for the proportion of CNVs for the longer class of length (> 500 Kb), resulted higher in MAZ respect those identified for AZE and KHU breeds.
Overlapping CNVs across samples within population were summarized into 1678, 1060, and 257 copy number variable regions (CNVRs) for AZE, KHU and MAZ, respectively ( Table 2). The Supplementary Table S2 reports the list of CNVRs found in each population, together with the samples CNV count and states.
The total number of regions identified, as reported in Table 1, is possibly linked to the sample's population size, especially for singleton_CNVRs that in the MAZ breed result to be a very large proportion (73%) of identified CNVRs: it is more likely, in fact, to identify singleton_CNVRs in small populations as they are those not in common among individuals.
A total of 101, 133 and 68 CNVRs (n. 302 in total, 203 non redundant regions) are the regions defined by CNVRs mapped in at least 5% of samples (n. 12 -AZE, n. 5 -KHU, and n. 2 -MAZ). Non redundant regions include pop_CNVRs (i.e. non singleton regions within population) plus the common_CNVRs (i.e. common CNVRs across populations), counting these latter only once when found in more than one population. All the following statistics and graphical representation have been obtained using the pop_CNVRs, covering 28.40 Mb (1.14%), 29.91 Mb (1.21%), and 11,44 Mb (0.46%) of the buffalo autosomes total length, respectively. For each population, a graphical representation of CNVRs frequencies on autosome is shown, together with the mean CNVRs coverage length (Fig. 2). The number of detected CNVRs among chromosomes is uneven, and no correlation between chromosomes length and mean CNVRs length resulted in these populations. The mean CNVRs length is not uniform along all chromosomes, mainly for MAZ breed. Two peaks (mean length) are evident for all breeds: on chromosomes 13 the mean CNVRs lengths were about five (AZE), and three (KHU and MAZ) times higher respect to the total mean length calculated for all autosomes. The second peak, found on chromosome 11, is lower respect to the previous one, but observable in all breeds.
Within the 203 non redundant CNVRs a different proportion of common_CNVRs were observed among the three populations (Fig. 3 Table 3).
The pop_CNVRs represented about half of total CNVRs number, except for AZE for which the com-mon_CNVRs are more than twice that the pop_CNVRs.
A total of 234, 365, and 158 genes were annotated within AZE, KHU and MAZ CNVRs, respectively (corresponding to 409 non redundant buffalo genes) (Supplementary Table S3) and their functional classification according to DAVID database is reported in Supplementary Table S4 and in Table 4 (Nominal Pvalue < 0.10).
The Animal Genome Cattle Database was accessed to reveal the potential relationships between CNVRs and QTL. As shows in Fig. 3, in Table 3, and in Table 5, 59 genes resulted associated with a total of 80 different "Trait Name", grouped in 23 "Trait Type" (Anatomy, Blood parameters, Chemistry, Conformation, Disease, Fatness, Feed intake, Fertility, General reproduction parameters, Growth, Life history traits, Lifetime production, Limb traits, Mastitis, Milk compositionfat, Milk compositionprotein, Milk processing trait, Milk compositionother, Milk yield, Organ disorders, Semen  Results of PCA showed a spatial distribution of samples ( Fig. 4) due to the difference in CNV across the three pop_CNVRs. This is reflecting the overlapping among common_CNVRs and proprietary CNVRs or each breed as shown in Fig. 3 (A) in the Venn diagram.

Discussions
In Iran, breeding of river buffaloes plays an important role in the economy of the country, also impacting on social and cultural activities [7]. In this country the most common local buffalo breeds are the Azeri and Khuzestani. The Mazandrani breed is also farmed in Iran, but its numeric consistency of about 4000 individuals is lower respect to those reported for AZE and KHU breeds (i.e. 119,000 and 81,000 buffaloes, respectively) [4]. All these three breeds, different in morphology, have been undergone a different selection processes and are well adapted to different environments proper of the geographical areas of the country in which they are reared: Azarithe north-west-north of Iran (70%), Khuzestaniwest and south-west (22%) of the Country while the Mazandarani is farmed in the north (8%) region (http://www.fao.org/3/ah847e/ah847e00.pdf).
The genetic variability in Iranian buffalo breeds have been recently studied and findings based on the Linkage Disequilibrium (LD) obtained from the Affymetrix 90 K SNP genotypes [19] showed a close genetic relation between AZE and KHU due to a high LD consistency across the two populations and a lower similarity when the comparison involved MAZ with both AZE and KHU due to variability in the LD trend within breed. These results are consistent with those found in an additional Principal Component Analysis (Supplementary Figure  S1) based on SNPs genotypes showing a clear clustering of the three breeds. Additionally, the F ST statistics here calculated as in [3], shows a differentiation among the three breeds with similar F ST values: the lowest between AZE vs KHU (F ST = 0.017 our study, F ST = 0.021 [3]); an intermediate between AZE vs MAZ (F ST = 0.045 our  [3]); the highest between KHU and MAZ (F ST = 0.058 our study, F ST = 0.045 [3]).
To improve the knowledge on these populations, we investigated the genomic structure of the Iranian buffaloes through the analysis of the CNVs, in order to provide additional information that could be used for breeding and conservation programs of these populations, as concluded by Clop et al. [20].
A genome wide CNVs detection has been here performed resulting in a total of 9550 CNVs and 2.995 CNVRs in the three breeds. The AZE and KHU show about 50% of their CNVR as singleton, while MAZ breed 73%. The large singleton_CNVRs proportion for MAZ is most likely due to the reduced sample size that does not permit to identify a larger proportion of pop_CNVRs in this breed. Excluding singleton_CNVRs the CNVRs resulted 1605 in the three breeds. Among the 1605 pop_ CNVRs, 302 were identified across breeds in at least 5% of individuals within breed, covering about 1.97% of the buffalo genome (calculated considering the no redundant 203 CNVRs). Out of these latter CNVRs, 22 (10.8%) are those shared by all populations, identifying a common genomic structural background of among these buffaloes. In fact, when considering the common_ CNVRs counts (those shared among breeds), the highest number of common_CNVRs are found in AZE having the lowest proportion of pop_CNVRs (30.7%) respect to those identified for KHU (48.9%) and for MAZ (44.1%). MAZ breed shared a higher number of CNVRs with AZE respect to those resulted in common with KHU. This result suggest that the moderate level of admixture identified for MAZ and AZE by [6], is confirmed here and graphically visualizable in the PCA (Fig. 4), where MAZ buffaloes are mainly distributed close to AZE individuals. The PCA showed in Fig. 4 is also showing that CNVs are not the same across the three breeds as they do partially overlap: in fact only 22 CNVRs are in common among the three breeds, while 31, 65 and 30 were found only in the AZE, KHU and MAZ respectively, as showed by Venn diagram in Fig. 3. CNVs are in fact a different class of markers respect to SNP and it is expected that PCA is discriminating the population not in the same exact manner.
The most represented common_CNVR, defined by the CNVs identified in 209 buffaloes of all the three breeds (Table 3), mapped on chr13. It does not show a specific     CNV pattern of state in a particular breed, i.e. only gain or only loss, but is having a complex behavior. A part of this region harbors 21 LOC gene IDs annotated consecutively over the genome (i.e. from 14,252,993 to 17, 645,518 bp), all correspond to the same multidrug resistance-associated protein 4-like protein coding gene, also known as ATP-binding cassette sub-family C member 4 (ABCC4). This gene appears to be involved in several basic metabolic pathways including resistance/ susceptibility to intestinal nematodes [21], feed efficiency [22], and marbling score [23]. Recently, in a study conducted on dairy buffaloes, for the ABCC4 gene it has been also highlighted its possible contribution on reproduction traits, resulting among the top genes associated with number of services per conception [24]. Additionally, the expression of ABCC4 gene increases in pregnant cattle and pigs' endometrium and it could be important to support pregnancy given its role in prostaglandin efflux from cells [25,26].
Other two common_CNVRs were defined by CNVs called in more than 100 individuals. The first region mapped on chr8 (at 38,899,124-39,035,018 pb) and the second one on chr16 (at 49,791,674-50, 216,803 bp). Only this latter region, resulted complex in AZE and KHU and gain in MAZ, harbors genes (ABCC8, USH1C, MYOD1, OTOG, KCNC1, SERGEF), which all resulted close (i.e. surrounding 200 kb) to the significant SNPs affecting Nellore age at first calving [27]. Also, the MYOD1 gene is a component of myogenic regulatory factors involved in myoblast differentiation and in concordance with [28,29] results, this gene may potentially affecting meat production.
The CNVRs in which genes or QTL were not annotated are those identified in lower number of individuals (within breed). For these samples the CNVs defining CNVRs also resulted with a large proportion of complex state (21% for AZE; 17% for KHU, and 3% for MAZ, excluding singleton_CNVRs. Recent literature support that CNV may have a role in selection mechanism in addition to SNPs and that CNV changes may indicate that artificial selection may cause difference in genetics and phenotypes among breeds [30,31]. We could speculate that the lack of a strong directional selection for a specific trait did not favor the increase in copies for specific genes as occurred in human [32], dogs [33] or polar bears [34], where the dietary shift produced an increase in DNA copy number where the AMY gene family was involved, an example of positive selection on CNV. In livestock populations where the directional selection is not focused towards a unique direction or for a specific trait, the proportion of complex CNVRs is comparable to the ones here found. This is the case of creole cattle in Mexico where the proportion of complex CNVRs was 16% [35] and of the Aosta Red Pied (Valdostana Red Pied) where the selection goal contemporarily focuses on milk, meat and adaptation to summer pasture practice, i.e. adaptation to harsh environment [36]. A further evidence is provided in the avian species as for strongly selected populations, as in chicken, where a very low proportion of complex CNVRs, 0 to 5% was found, while in a the non-selected Mexican creole poultry population the proportion of complex CNVRs was up to 14% [14].
Within the CNVRs here found in gain in state (i.e. defined by CNVs all duplicated), several annotated genes with a well-known associated phenotype in cattle or in other species were found in addition to those reported in Tables 3 and 5. Regarding the PCSK6 gene there are evidences of an association with follicle development in human [37], while for GALNTL6 with feed efficiency and growth traits in cattle [38], with saturated fatty acids profile in intramuscular fat of the longissimus thoracis muscle of Nellore [39], and with cow and heifer conception rate [40]. For the same breeds, gain regions also harbored GALNT6 and FYN genes, both involved in reproduction traits in cattle (https://www.teagasc.ie/ m e d i a / w e b s i t e / p u b l i c a t i o n s / 2 0 1 0 / FertilityGeneExpression_5517.pdf) and in mouse [41], respectively. The FYN gene, considered by [42] as thermotolerant gene, seems to have a role in cow conception and early embryo development in cattle. Finally, PCSK6 and SNRPA1, together with PLXNA2, mapping within an AZE gain CNVR, resulted lying within a positive selection signature region identified using SNP as genetic markers in creole breeds [30]. The PLXNA2 gene has been also associated with cattle temperament [43]. For the MAZ breed, there are few gain CNVRs harboring genes: this can be related to the very low number of samples available in this study. Among these regions, the one located on chr1 overlaps EPHB3, a gene resulted associated with muscling at weaning (MW) and muscling at yearling (MY) in Bos indicus populations [44]. Also, EPHB3 maps in a selection signature region identified by V ST analysis based on CNVs performed through a comparison between Valdostana Red Pied vs Italian Brown Swiss, a double proposal and a dairy cattle breed, respectively [36].
According to the annotation analysis performed with DAVID Database, buffalo CNVRs are enriched in genes (n = 334 recognized IDs) mainly involved general biological processes (Supplementary Table S3). Also, a total of 80 different "QTL_Trait-Terms" associated with 59 genes have been identified and classified in 6 major QTL trait categories (Fig. 4 and Tables 3 and 4), of which the most represented are Milk (i.e. Milk compositionfat and -protein) and Production Traits (i.e. Growth) ( Tables 3 and 4, Supplementary Table S3). We did not observe a prevalence of a particular QTL_Trait Term (taking in to account the differences in CNVRs and annotated genes counts) in one of the three breeds, except for AZE -General Milk-and for MAZ -Reproductive QTL_Trait Terms.

Conclusions
The knowledge of genomic variation in the water buffalo species is still very limited and most of the recent findings still rely on comparison with cattle species. This work provides a step forward in the biological interpretation of genomic variation in the buffalo species. As the CNVs are known to be mostly non-neutral markers, these results may contribute to interpret genomic variation within and among the buffalo populations, that can be used to provide insights about their recent selection and adaptation to environment.
We may speculate that the presence of the set of genes and QTL traits harbored in the CNVRs here mapped could be linked with the buffalo's natural adaptive history, i.e. to their ability to adapt to diverse and severe environmental conditions (different for AZE and MAZ respect to KHU) and may be occurred because these bovid in recent time have started a selection program for milk yield, that is a primary food source from this species.

Sampling
A total of 384 Iranian Buffalo raw genotyping data (i.e. cel.files) obtained using the Axiom® Buffalo Genotyping Array 90 K (Affymetrix, Santa Clara, CA, USA) were available for the three breeds [4,6,9] and utilized as input files in order to obtain the Log R Ratio (LRR) values for each sample. As extensively described in [4] and in [19], the SNPs data comes from populations sampled with the aim to be representative of each breed. The AZE breed (252 samples) was sampled in the East and West-Azarbaijan, Ardebil and Gilan provinces, the KHU (110 samples) in Khuzestan and Kermanshah provinces, and the MAZ (22 samples) from Miankaleh wildlife sanctuary of Mazandaran province.
A quality control of raw intensity files using the standard protocol in the Affymetrix Power Tools package (www.affymetrix.com) was performed in order to guarantee a high quality of obtained data. Individuals with a value of call rate less than 97% and Dish Quality Control less than 82% were removed. After quality control a total of 9 low quality samples have been identified and not used.
The marker positions of the Genotyping Array 90 K array were recently updated and based on the newly released University of Adelaide water buffalo assembly (UOA WB v. 1; https://www.ncbi.nlm.nih.gov/assembly/ GCF_003121395.1).
A total of 35,114 SNPs was filtered out, in part for their undefined chromosomal locations, and in part because a proportion of SNPs had more than one nearby probeset: for these latter the Axiom Analysis suite picked the best performing probeset per each SNP. A total of 70,230 SNPs was then retained in the analysis mapped on the Bubalus bubalis (UOA_WB_1) genome assembly.
To confirm the clustering of individuals to the three populations, the genetic diversity within and among breeds was explored using SNP genotypes by Principal Component Analysis (PCA) and by the pairwise Fixation Index obtained according to the pipelines of Golden Helix (SVS) 8.8.4 software (Golden Helix Inc., Bozeman, MT, USA).
This study did not require approval from the Animal Care and Use Committee as we use already available data obtained in previous researches compliant [4,6,9] with rules and regulations for animal sampling.

CNVs, CNVRs detection and subsequent analyses
The CNV detection was performed on the 24 autosomes, using the Copy Number Analysis Module (CNAM) of SVS (Golden Helix, Bozeman, MT, USA), through the univariate analysis based on LRR values obtained using the Axiom® CNV Summary Tool software (www.affimetrix.com). A quality assurance of LRR raw data and filtering of outlier samples was performed before CNV calling with the SVS software through: i) the overall distribution of derivative log ratio spread (DLRS) values as described by [45]; ii) the GC-wave factor (GCWF) that measures the GC-content causing the fluctuation of a signal intensity file [46]. A total of 14 samples were excluded during the quality assurance because of their high DLRS and GCWF values. Consequently, the CNV mapping was performed on a final dataset of 361 samples (n. 242 -AZE, n. 100 -KHU, and n. 19 -MAZ) including as parameters in CNAM: maximum 100 segments per 10,000 markers; minimum of 3 marker per segment; 2000 permutations per pair with a p-value cut off of 0.005.
CNV regions (CNVRs) at population level were obtained by merging CNVs that overlapped by at least 1 bp in at least two animals using the -mergeBed command of Bedtools [47]. CNVRs were then cataloged as gain, loss and complex (i.e. CNVRs comprising both gain and loss CNVs) regions. A CNV identified in only one individual was classified as singleton_CNVR. Finally, the "-intersectBed" command of Bedtools software was employed to catalogue CNVRs as "pop_CNVRs" and "common_CNVRs" if they have been mapped in only one population (no intersection among CNVRs identified in more than one breed) or they resulted in common among breeds (partat least 50%or full overlapping), respectively. Only CNVRs found in at least 5% of the individuals of a breed were considered to infer statistics at population level and for the gene annotation.
The complete list of buffalo protein coding genes was downloaded from NCBI online Database (https://www. ncbi.nlm.nih.gov/genome/browse/#!/proteins/791/374666 |Bubalus%20bubalis/). Genes with official "gene name ID" and LOC genes associated with a protein coding gene name (excluding uncharacterized ones) were annotated within the detected CNVRs using the Bedtools "-inter-sectBed" command. Gene Ontology terms (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were performed using the DAVID Bioinformatic Database (https://david.ncifcrf.gov).
As the Quantitative Trait Loci (QTL) database for the buffalo species is not available, the QTL associated to the genes here found in the CNVRs were identified in the cattle QTL database (QTLdb: https://www. animalgenome.org/cgi-bin/QTLdb/BT/search) by gene name, using the "Search by associated gene" option of QTLdb.
In order to disclose diversification of the three buffalo breeds based on CNVs, a Principal Component Analysis (PCA) using Past software [48]. Sample-CNV genotypes were coded as "deletion", "duplication", and "normal" states for each of the identified CNVRs and used in the PCA analysis.