Genomic population structure and prevalence of copy number variations in South African Nguni cattle

Background Copy number variations (CNVs) are modifications in DNA structure comprising of deletions, duplications, insertions and complex multi-site variants. Although CNVs are proven to be involved in a variety of phenotypic discrepancies, the full extent and consequence of CNVs is yet to be understood. To date, no such genomic characterization has been performed in indigenous South African Nguni cattle. Nguni cattle are recognized for their ability to sustain harsh environmental conditions while exhibiting enhanced resistance to disease and parasites and are thought to comprise of up to nine different ecotypes. Methods Illumina BovineSNP50 Beadchip data was utilized to investigate genomic population structure and the prevalence of CNVs in 492 South African Nguni cattle. PLINK, ADMIXTURE, R, gPLINK and Haploview software was utilized for quality control, population structure and haplotype block determination. PennCNV hidden Markov model identified CNVs and genes contained within and 10 Mb downstream from reported CNVs. PANTHER and Ensembl databases were subsequently utilized for gene annotation analyses. Results Population structure analyses on Nguni cattle revealed 5 sub-populations with a possible sub-structure evident at K equal to 8. Four hundred and thirty three CNVs that formed 334 CNVRs ranging from 30 kb to 1 Mb in size are reported. Only 231 of the 492 animals demonstrated CNVRs. Two hundred and eighty nine genes were observed within CNVRs identified. Of these 149, 28, 44, 2 and 14 genes were unique to sub-populations A, B, C, D and E respectively. Gene ontology analyses demonstrated a number of pathways to be represented by respective genes, including immune response, response to abiotic stress and biological regulation processess. Conclusions CNVs may explain part of the phenotypic diversity and the enhanced adaptation evident in Nguni cattle. Genes involved in a number of cellular components, biological processes and molecular functions are reported within CNVRs identified. The significance of such CNVRs and the possible effect thereof needs to be ascertained and may hold interesting insight into the functional and adaptive consequence of CNVs in cattle. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2122-z) contains supplementary material, which is available to authorized users.


Background
Copy number variants (CNVs) are segments of DNA that are 1 kb or larger in size and display a variable copy number relative to a reference genome, hence comprising deletions, duplications and insertions [1]. A number of recent studies demonstrated CNVs to be prevalent in bovine genomes [2,3]. CNVs are reported to affect a greater percentage of genomic sequences and have been identified in regions covering a number of genes that are recognized to play a role in cattle environmental responses and adaptation [4]. CNV region (CNVR) incidence also demonstrates some tendancy to parallel breed history and breed formation patterns [4,5].
The development and focus on intense selection programs have greatly enhanced the genetic improvement of a number of domesticated cattle breeds worldwide. Understanding the multiple components of functional breed diverstiy have important implications for breed management and genetic improvement practices, especially in breeds that are locally adapted and have not undergone intense artificial selection. South African Nguni cattle represent such a distinct, conserved, Sanga type cattle breed that has undergone little synthetic breeding [6,7]. Having endured natural selection pressures from a variety of disease agents and harsh climatic conditions, Nguni cattle have proven to prevail in suboptimal environmental circumstances [8]. These indigenous South African cattle are also recognized for their small frame size and diversely patterned and multicoloured hides.
The availability of two cattle reference genomes (Btau_4.0 and UMD3.0) (The Bovine Genome sequencing and analysis consortium, [9]) and the development of genomewide single nucleotide polymorphism (SNP) genotyping arrays has enabled new avenues of research in bovine genomics. Although SNPs have been the primary focus of variant screening and association analyses, the recent development of CNV discovery tools utilizising both sequencing and SNP data hold opportunity for the in depth investigation into the prevalence of additional types of genomic variation [10][11][12].
The role that CNVs play within breeds to ensure diversity and adaptation has not yet been investigated. Nguni cattle have undergone scant synthetic breeding and are well adapted to their primary environment. With CNVs demonstrating a possible correspondence with breed diversity and adaptation, Nguni cattle present a valuable breed in which to investigate CNV prevalence and distribution.
This study investigated the population structure, haplotype block structure and the occurance and distribution of CNVs in Nguni cattle of South Africa using genotype data from the Ilumina Bovine SNP50K panel. Extensive linkage disequilibrium studies have been performed in cattle [13,14]. Haplotype block (HPB) structure studies are however not as widespread [15]. The characterization of HPB structures at the population level contribute towards understanding the nature of non-linear association between phenotypes and genes [15]. This study determined the prevalence of CNVs within Nguni cattle followed by an analysis of their distribution within the different ecotypes inferred by population structure analysis. The prevalence of HPB structures in CNV formation was also investigated.

SNP quality control
The Illumina Bovine SNP50 beadchip v2 comprising of 54,609 markers was utilized in the study (Illumina Inc., San Diego, CA). Of these 54,609, 54,060 SNP probes map to the most current UMD 3.0 bovine reference genome. After genotyping, a total of 1,340 variants were removed due to missing genotypes, and a further 11,232 variants were removed due to having a minor allele frequency of less than 0.02 and an additional 1,724 variants with a call rate of less than 95 % in the sampled population. In summary, 40,313 SNPs remained after applying extensive quality control (QC) pruning.
Population structure analysis Population structure QC The 40,313 SNPs that remained after QC were further pruned for linkage disequilibrium (LD) using a threshold of r 2 = 0.1. LD trimming resulted in another 29,836 SNPs pruned from the dataset, resulting in a final set of 10,477 SNPs used in the downstream analysis. Of the 492 animals sampled, 230 demonstrated an identy by descend (IBD) value of greater than 0.25 with animals within the dataset and were subsequently removed. Two hundred and sixty two unrelated animals thus remained for population structure analyses. Previous research suggests Nguni cattle populations to comprise of up to 9 different eco-types [6]. This estimation was then used to perform for a cross validation for 10 different K values. Standard error estimates for K ranged from 0.545 for K = 1 to 0.527 for K = 5 (Fig. 1).

Population structure statistics and classification
Organization of the data according to ancestry percentages, demonstrated 5 distinct sub-population clusters (Fig. 2). Instead of exhibiting the typical "v" shape graph which congests at the optimal K, the K graph demonstrated a "w" type of formation, with K equal to 8 (K8) following closely behind the optimal of K5. Admixture between sub-populations was evident. Sub-populations were assigned alphabetical tags. Nguni cattle have only recently been incorporated into synthetic breeding schemes, and for many years subsisted under natural selection pressures [16]. It can thus be expected that crossing between eco-types would be evident. The observed clustering may therefore be subsequent to such crossing between ecotypes or an indication of subpopulations that diverged more recently from one another. It is however, important to note that the ecotype structure of the studied animals was unknown upon sampling of animals used in the analyses. Discriminant Principle component analyses (DPCA) also demonstrated 5 clusters within the 262 Nguni animals and is presented in Figs. 3 and 4.

Haploblock analysis Haploblock statistics
A haplotype block is a combination of allelles that are linked on a common chromosome and inherited concurrently from a single ancestor [17]. Five hundred and fourty one haplotype blocks were identified across all 492 animals. Of these, 297 covered 3 or more SNPs. HPBs ranged in length from 84 base pairs on chromosome 8 to 199,730    Table 1). The average length of the haplotype blocks was 79,686.68 (SD ± 67,651.42) base pairs across chromosomes with a total HPB length of 41.5Mbp. Large amounts of variation in haplotype structure and size between chromosomes were observed. Chromosome 1, 2, 3 and 8 exhibited the most haplotype blocks at 43, 33, 37 and 30 respectively (Table 1). Althought the largest HPB was found on chromosome 1, chromosome 10 contained the highest average HPB length of 123 kbps and the second highest percentage of its genome comprising of HPBs (Table 1). Previously, a negative correlation was reported between the average HPB length and recombination rate [18], and there also exists evidence of differences in recombination rates between cattle breeds [19].
The smallest number of haplotype blocks were identified on chromosomes 22, 27 and 28, with chromosome 22 exhibiting the smallest percentage of its length consisting of HPBs. The exact boundaries of HPBs are not resilient to variations in SNP density as the average size of HPBs may decrease with the greater sequence coverage of the HPB that results from elevated marker density [20]. Khatkar et al. [21] reported 727 haplotype blocks covering more than 3 SNPs in 1000 Holstein-Friesian bulls using 9195 SNPs in Hardy-Weinberg equilibrium mapped to the Btau 3.1 bovine assembly. Haploblocks reported in this study were on average 1 kb larger than those reported by Khatkar et al. [21].

Haploblock gene ontology
Haplotype blocks have discrete boundaries that are defined by recombination hotspots [22]. In the past HPB analyses were primarily used to identify tag SNPs [23]. In this study 232 genes were present within the 541 HPB identified (Additional file 1). Five genes, including Bos taurus fat mass and obesity associated (FTO), family with sequence similarity 155 (FAM155A), Glypican (GPC5), Na+/K+ transporting ATPase interacting 2 (NKAIN2), UDP-N-acetyl-alpha-D-galactosamine:polypeptide N-acetylgalactosaminyltransferase-like 6 (GALNTL6) and cysteine conjugate-beta lyase 2 (CCBL2) were covered by two separate haplotype blocks lying in close proximity to each other (Additional file 1). We used gene ontology (GO) terms to classify these genes into a number of biological process, molecular functions and cellular components. Furthermore, we used the PANTHER database to identify protein features associated with GO terms (Additional file 2). A total of 122 genes involved in metabolic processes and 143, 226 and 188 genes involved in biological regulation and biological process and cellular processes respectively were positioned within HPB regions ascertained. Of interest were genes involved in immune system process (18), immune response (7), immune system development (9) and positive regulation of response to stimulus (17) (Additional file 2). Gibson et al. [24], utilised exome-chip data to demonstrate patterns of linkage disequilibrium and subsequent haplotype structure to be informative of gene function and possible relationships between genes and specific phenotype clusters. Nguni cattle are suited to survive in harsh environmental conditions with enhanced disease and parasite resistance as well as heat tolerance [25]. It is therefore not surprising that genes involved in processes like immunity and stimulus responses lie within the HPBs identified.

CNV identification CNV model quality control
As with all current CNV detection methodologies deducing copy number variations from SNP data encompasses a number of areas at which error can be introduced and ascertainment biases presented [26; 27]. The bovine SNP50 beadchip is limited to detected variations in the copy numbers of sequences present in the reference population that was used to design the probes, while it does not provide details regarding the location of duplicated copies [28]. A number of factors influence the accuracy of CNV breakpoint detection, including batch effects, population stratification, experimental differences and the robustness of the statistical model [29]. SNPs utilized were also selected to have a minimum minor allele frequency and tend to be those that segregrate within multiple breeds [30]. The tendency of SNP arrays to demonstrate greater sensitivity to deletions than duplications is particularly note worthy in areas with insufficient probe density to use B allele frequency measurements which may result in the majority of the smaller CNV events being deletion events partially owing to an ascertainment bias [28]. With this in mind, four models utilizing different filtering stringencies were used to identify CNVs in Nguni cattle (see Methods) and are presented in Table 2. Four hundred and thirty three CNVs were identified by all four filtering techniques in 231 animals ( Table 2). Discrepancies in the number of CNVs identified by each of the models was evident. Model 1 identified 353 CNVRs in the 379 animals that had a average length of 259 kb ( Table 2). Inclusion of the gcmodel enabled additional animals to pass QC filtering and subsequently corresponded with an elevated number of CNVs being identified. Great variation in the size and number of CNVRs has been reported in cattle [31,32]. CNVs in this study ranged from 30 kb to 1 Mb in size (Table 3). All models demonstrated a similar pattern of CNV numbers across animals, although models 3 and 4 determined a number of novel CNVs. All CNVs identified by models 1 and 2 were identified by either model 3 or 4 or by both (Fig. 5).

CNV statistics
Only those CNVs identified by all models were utilized for further analyses, to ensure validity of variable regions. Only 326 animals passed the PennCNV filtering. A total of 334 CNVRs were identified across models in 231 of these animals ( Table 2). CNVR were between 30 kb and 1.2 Mb in length (Table 3). We identified 90 animals that contained a single copy number variation in their entire genome. One animal contained 22 CNVs in its genome. The average number of CNVs per animals was 2.61 (SD ± 2.63) which is similar to the 3.2 CNVs  (Table 4). Single copy deletions were identified in most of the animals while only 1 animal had a double copy duplication. This discrepancy in copy number of CNV may be an artifact of the PennCNV algorithm which has been seen to identify many more deletions than duplications [33]. SNP array platforms tend to also demonstrate reduced precision in detecting single copy gains relative to deletions, of which this may be an artifact [28]. Jiang et al. [32]      demonstrated the potential for certain regions of the genome to be more susceptible to copy number variations within cattle breeds. The form and exact locality of these CNVs may be what contributes to the nature and degree of variation exhibited by gene expression of adjacent genes. Fadista et al. [36] reported CNV distribution in cattle to reflect chromosomal size with the most CNVs being identified on the largest chromosomes. Our data, however does not follow this pattern entirely. Chromosome 6 had the greatest number (18) of CNVs while chromosome 18 contained no CNVs (Table 3). This reflects findings of Guryev et al. [37], who reported chromosome 18 to be a "cold spot for CNVs" in rats. Chromosome 18 together with chromosomes 5, 27 and 29 are reported to demonstrate a preponderance of segmental duplications in the bovine genome [2]. A noticeable feature of CNVs, particularly larger CNVs, is their prevalence in regions with known segmental duplications [35]. Also known as low copy repeats (LCRs), these segmental duplications are duplicated fragments of DNA that are more than 1 kb in size and can be found either on the homologous chromosome or on a separate, nonhomologous chromosome with a minimum of 90 % sequence identity [38]. In this study we identified 11, 0, 6 and 4 CNVRs on chromosomes 5, 18, 27 and 29 respectively. SNPs were reported as being sparse in regions of segmental duplications and may explain the comparatively lower numbers of CNVs on these chromosomes [39]. Segmentally duplicated domains are known to encode protein products that play a prominent role in species adaptation [40], which makes identification of CNVs in these regions crucial. Techniques such as next generation sequencing may be more suitable for the detection of CNVs, particularly on chromosomes previously reported to harbour low number of CNVs.

Gene ontology
Four hundred and fifty eight genes located within 10 Mb of CNVRs were identified. A number of genes including Milk fat globule-EGF factor 8 protein (MFGE8), collagen type XIII alpha 1 (COL13A1), cystic fibrosis transmembrane conductance regulator (CFTR), Bradykinin receptor B1 (BDKRB1), prostaglandin-endoperoxide synthase 2 (PTGS2), major histocompatibility complex, class Irelated (MR1), Platelet/endothelial cell adhesion molecule 1 (PECAM1) and leucine rich repeat and fibronectin type III domain containing 5 (LRFN5) involved in immune system response or B-cell mediated immunity were overrepresented within identified CNVs (Additional file 3). Copy number variations in immune related genes have previously been linked to disease [36]. Variation in the genes comprising the major histocompatibility complex have been reported to play a pivotal role in the predisposition of cattle to diseases such as dermatophilosis, mastitis and tick born infections [41]. Stothard et al., [42] reported CNVs that are closely associated with immune and lactation genes. Bickhart et al. [35] reported that 15 of the 25 most variable copy number genes they identified, had functions associated with immune response and host defense, such as defensin, interferon and GIMAP (GTPase and IMAP) families. Anhidrotic ectodermal dysplasia in cattle is associated with a deletion that may range between 2 and 160 kb of the genome and includes third exon of the EDA gene [43]. Flisikowski et al. [44] demonstrated a 110 kb microdeletions in the MER1 repeat containing imprinted transcript 1 ( MIMT1) gene region to be linked to the incidence of abortions and stillbirths in cattle. A 2.8 kb deletion in the solute carrier family 4 (anion exchanger), member 2 ( SLC4A2) gene was reported by [45] to cause osteopetrosis in Red Angus cattle. Two causal deletions in the claudin 16 ( CLDN-16) gene were linked to renal tubular dysplasia in Japanese black cattle [46]. Sixteen CNVRs were detected in 8 or more animals in this study (Table 5 and Additional file 4). These CNVRs contained a number of genes involved in immune system processes, cell communication, response to toxic substances and cell communication. The CNVR on chromosome 1 located between base pair 104,798,012 and 105,264,358 observed in multiple animals contained the sucarseisomaltase (SI), intestine-specific gene (Additional file 4). Nguni cattle are reported to exhibit a superior feed conversion rate when compared to other indigenous breeds [47].
CNVs have potential to not only change gene dosage and structure, but may modify gene regulation as well as expose recessive alleles [48]. A total of 458 genes were located adjacent to (within 10 Mb), or within an identified CNV. Comparison of those genes contained within CNVRs identified within this study with those identified within other breeds [5,29,30] revealed 402 (87 %) genes that were unique to the Nguni (Additional file 5). The only gene identified close to a CNVR in all four studies was immunoglobulin lambda-like polypeptide 1 (IGLL1). IGLL1 is one of the polypeptides of the immunoglobulin light chain gene pool in domestic cattle that play a role in B cell production [49]. This gene lies adjacent to its associated colute carrier (SLC) polypeptide [49]. Immunoglobulins are the molecular mediators of the adapative humoral response of jawed vertebrates (Gnathostomata). The evident variation in copy number at this gene in a number of bovine breeds may explain the variation in the adaptive immunity evident between breeds, but further investigations into the role of this CNV needs to be ascertained. The Bos taurus pregnancy-associated glycoprotein (MGC157405) gene is the only gene represented across CNVRs of Hou et al. [5], Bickhart et al. [35] and this study and forms part of the cellular defense response. Ten genes are shared between this study and that of Hou et al. [5] and Bae et al. [34], including O-fucosylpeptide 3-beta-N-acetylglucosaminyltransferase (LFNG) and ADP-ribosylation factor-like 6 (ARL6) that are both involved in metabolic and cellular processes. B cell mediated immunity, mesoderm development and cell communication pathways also demonstrate representation by genes shared (Additional file 3). Twenty nine genes located within the Nguni CNVRs were also reported to be associated with CNVRs in Korean cattle [34] (Additional file 5). Overlapping genes were associated with a number of biological processes including positive regulation of cell proliferation, cell communication, detection of stimulus, cellular process, metabolic process and susceptibility to natural killer cell mediated cytotoxicity (Additional files 3 and 5). Thirteen of the genes associated with CNVRs in this study overlap with genes covered by CNVRs reported by Hou et al. [5] in a variety of cattle breeds, including African Breeds. The funtional annotation of these 13 genes were associated with immune system processes, cell communication and lipid metabolic processes (Additional files 3 and 5).
Five of the genes identified within CNVs in this study were also identified by Bae et al. [34] in 265 Korean cattle (Additional file 5) while another 5 corresponded to findings of Hou et al. [5] in multiple different Indicine, Taurine, Composite and African breeds. Bickhart et al. [35] speculated that the distinctions in selected breeds for specific traits could be linked to specific CNVs and that discrepancies in CNVs and subsequent CNVRs between different breeds could thus be expected. The greatest amount of gene overlap was between this study and that by Hou et al. [5]. This corresponds with the proposition of CNVs segregating within breeds as they analysed the greatest variety of cattle breeds (366 Taurine, 46 Composite, 70 Indicine and 39 African cattle) within their study.
Additional file 6 demonstrate biological process, cellular component and molecular functions that were represented by genes covered within CNVRs or lying within close proximity of CNVRs identified by all four models. The biological pathways with the greatest number of genes represented included biological process, primary metabolic process, cellular metabolic process, primary to stimulus and cellular process. Nervous system development (p = 0.008), single-organism behaviour biological pathways (p = 0.003) and dendrite cellular component (p = 0.05) demonstrated significant (p ≤ 0.05) overrepresentation. Genes involved in these processors were evident in CNVRs identified in all ecotypes. Hansen [50] denoted metabolic regulatory ability that results in a reduction in body temperature to be one of the factors that contribute to superior thermotolerance within cattle species. Whether the presence of CNVs at these genes may relate to the enhanced ability of Nguni cattle to handle harsh environmental conditions needs further investigation. Nonsignificant overrepresentation by CNV genes in 3055 biological processes, 593 molecular functions and 391 cellular components was evident. These systems included cellular response to transforming growth factor beta stimulus, regulation of B cell proliferation and positive regulation of viral release from host cell functions. Previous findings have demonstrated CNVRs to be located in areas containing genes associated with environmental responses like sensory, defense and immunological functions and regulatory processors [31,51]. Similar patterns are evident within Nguni cattle and suggest CNVs to potentially play an important role in the adaptative traits evident in Nguni cattle populations.

CNVs and population structure
CNV characteristics for each subpopulation are presented in Table 6. Sub-population A had the highest average number of CNVs per animal while sub-population D had the smallest average CNV length. Sub-population A had the greatest number of animals in the study (n = 103) and also presented with the most CNVRs (n = 121) ( Table 6). A number of CNVRs were shared between populations. The most widespread CNVR was identified on chromosome 6, covering the protocadherin 7 (PCDH7) and cysteine-rich hydrophobic domain 2 (CHIC2) genes and present in subpopulations A, B, C and E (Table 7). Increasing evidence has suggested that CNVs play a primary role in interindividual diversity [52], attributing to both normal phenotypic variation and major variations in complex traits such as susceptibility to disease [53,54]. Within Nguni cattle sub-populations a broad array of phenotypes are evident with great variations in coat colour, behaviour and immune response being evident [6]. As little research into the genotypic makeup of the Nguni ecotypes has been performed, little is known about what differentiates these ecotypes on a genomic scale. Eighteen CNVRs were identified in multiple animals and are reported in Table 5. On closer inspection of these CNVRs, some noteworthy association can be seen. The CNVR located on chromosome 1 (chr1:104798012-105264358) was identified in 7 animals. Four of the animals belong to sub-population A while 10 of the 11 animal genomes containing the CNVR on chromosome 4 (chr4:108834886-109130345) belonged to sub-population A. CNVR chr6:71910076-72118486 was present in 13 animals with 6 and 5 animals from sub-populations A and C respectively.
Two hundred and eighty eight genes were identified to be associated with CNVRs in sub-populations A, B, C, D and E (Table 7). A number of genes only identified within specific sub-populations were present (Table 7). Sub-population A has the most (149) unique genes that are not recorded in the other sub-population groups. The ataxin 7-like 3B (ATXN7L3B) and tumor necrosis factor and alpha-induced protein 8 (TNFAIP8) genes were present in CNVRs in sub-populations B, C and E and A, C and E respectively and play a role in the immune system process, and the response to stress.

CNVs and haplotype blocks
Thirty four HPBs lay either within, across or adjacent to CNVRs identified within the Nguni cattle population (Additional file 7). Half of these occurances were at CNVR sites that were present in multiple individuals, with one such CNVR on chromosome 1 that was present in 17 animals (Additional file 7). Another HPB overlaped a CNVR associated with genes Ly1 antibody reactive homolog (LYAR), neuron-specific protein family member 1 (NSG1), otopetrin 1 (OTOP1), syntaxin 18 (STX18), transmembrane protein 128 (TMEM128), WD repeat domain 1 (WDR1) and zinc finger and BTB domain containing 49 (ZBTB49) was present in 12 animals. Genes present in CNVRs that overlap or share cut-off points with HPBs contributed to a number of biological, cellular and molecular pathways (Fig. 6). Of the biological pathways, metabolic processes demonstrated the greatest gene representation. Other interesting biological pathways represented by genes covered by both HPB and CNVR were the immune system processes, biological regulation and cellular processes. Four cellular component pathways demonstrated representation. Of the molecular pathways represented, protein binding transcription factor had the greatest number of genes denoted within CNVR-HPB  CNVs have been reported to be in LD with surrounding SNPs, demonstrating conserved long-range haplotypes [55]. Meiotic crossing over hotspots flanked by recombinationally inert DNA is thought to be a major contributing factor in the presence of haplotype block structures [56]. Whether the mechanisms involved in meiotic recombination crossing-over may play a role in the variations in copy number is something that could be looked into as the exact mechanisms of CNV formation is yet to be fully understood.

Conclusions
Population structure analyses revealed the presence of 5 subpopulations with some degree of admixture occuring between groups. A total of 334 CNVRs were identified and characterized within the genome of 492 Nguni cattle. Different filtering techniques were modelled. The inclusion of the gcmodel with the higher waviness stringency proved to demonstrate the greatest repeatability with CNVs identifed across models.
Eighteen CNVRs were identified in multiple animals. Among these regions, segregation within as well as across sub-population groups was evident. Specific CNVRs may play a role in the variation exhibited among Nguni ecotypes. Some of these CNVRs may also be distinct to Nguni cattle, contributing towards some of the distinctive phenotypic traits for which they are recognized. Until the twentieth century, Nguni cattle were primarily exposed to natural selection pressures and subsequently exhibit enhanced adaptive traits together with broad phenotypic diversity. Genes within CNVs demonstrated overrepresentation in a number of biological, molecular and cellular pathways and may therefore be potential contributors to the phenotypic diversity evident in Nguni cattle populations.

Sample collection and data generation
Blood samples were collected in 10 ml EDTA VACU-ETTE® tubes by means of venal puncture of the caudal vein from 492 Nguni animals distibuted across South Africa (Fig. 7). Genomic DNA was extracted by means of the Qiagen DNeasy Blood and Tissue Kit from the blood samples. The quantity and quality of extracted DNA was assessed by means of the Qubit and those samples exhibiting a minimum concentration of 50 μl were subsequently genotyped with the Illumina BovineSNP50 BeadChip (Illumina Inc., San Diego, CA) containing 54,001 highly informative markers that uniformly span the bovine genome. Illumina BovineSNP50 BeadChip SNP markers were designed based on the Btau 4.0 reference genome. Markers were clustered and genotyped by means of Illumina GenomeStudio v2.0 software. Fifty four of the genotyped samples were derived from a previous study [7] and were approved for this research by the University of Pretoria Ethical Committee (E087-12).

SNP quality assessment
SNP quality control and sample pruning was performed by means of Plink (version 1.9) [57]. SNPs with a minor allele frequency of greater than 0.02 and/or genotype rate of less than 0.95 were filtered from the dataset.

Determination of population structure
One of the SNPs was removed for each pair of SNPs demonstrating an LD of greater than 0.1 in a sliding window of 30 SNPs. Relationship-based pruning was performed and one member of each pair of animals with an observed genomic relatedness of greater than 0.25 was removed from further analyses to correct for population stratification [58]. ADMIXTURE analyses software [59] were subsequently used to determine population structure of unrelated animals. ADMIXTURE was run from K = 2 to K = 10 and a cross-validation procedure was used to ascertain the best k. That k-value that generated the lowest cross-validation standard error was determined as being the most probable population substructuring. Q estimate matrices barplots were generated with R [http://cran.r-project.org] for each value of k, and animals were sorted according to ecotypes based on this population structure. A discriminant analysis of principle components (DAPC) was performed using adegenet 2.0.0 in R [60]. In the absence of group priors, DAPC infers genetic clusters from sequential K-means and model selections. The find.clusters script was utilized to determine clusters with a maximum of 9 groups. The cumulative variance against the number of retained principle components (PCs) (Fig. 4), demonstrated the greatest amount of variance being explained by 100 PCs which were therefore utilized in conjunction with 2 discriminant functions (Fig. 4) to determine group clustering. A scatterplot of the DPCA was subsequently generated.

Analysis of HPB
PLINK software (http://pngu.mgh.harvard.edu/purcell/ plink, [57]) was utilized to impute haplotypes based on single SNP tests for each of the 29 bovine autosomes of 492 Nguni animals. Variants were pruned for LD using an independent pairwise parameters of window size 30, step size 5 and a r 2 threshold of 0.1. Haplotype blocks were estimated using Haploviews interpretation of Gabriel et al. [61] for each of the 29 bovine autosomes under PLINK's default block settings. Gene ontology analyses of HPB regions was performed against the Bos Taurus reference gene list by means of the PANTHER databases [62].

Generation of CNV calls and CNV filtering
The Log R ratio, B allele frequency, G type, chromosome and position were exported from GenomeStudio for each animal for analyses using PennCNV [12]. PennCNV has outperformed a number of CNV detection packages on multiple occasions demonstrating a greater specificity , sensitivity for CNV calling and reasonably little bias [26,63]. PennCNV utilizes a first order Hidden Markov Model, which assumes that the hidden copy number state at each SNP is subject to the copy number state of the most preceding SNP for high resolution CNV discovery with whole genome SNP genotyping data [12]. The Viterbi algorithm is subsequently utilized to determine the most probable sequence of hidden states chromosome by chromosome [12]. A dynamic programming algorithm, the Viterbi algorithm was applied to predict the Viterbi path which generates the most probable sequence of hidden states representing discrete copy numbers along the chromosomes [64].
The PennCNV compile_pfb script [12] was utilized to create a pfb file from the data. The detect_cnv.pl was run to detect CNVs on 29 autosomes. A number of animals (125) exhibited an absolute genomic waviness factor of greater than 0.04. GC content within 1 Mb region (500 K per side) surrounding each marker was calculated and utilized to create the bovine gcmodel. A second analyses including the -gcmodel option was also run for comparative purposes.
In order to minimize the rate of false positives, extensive quality control was applied by means of the fil-ter_cnv.pl script [12]. Two separate filtering criterions were utilized. By means of Golden Helix SVS software, the median DLRS and GCWF values, were utilized to determine the upper outlier threshold set at 1.5 interquartile range (IQRs) from the third quartile of all DLRS and GCWF values respectively. Upper outlier thresholds of 0.318 and 0.072 for DLRS and GCWF were thus determined. The second filtering was also performed utilizing more stringent standards where only those CNVs that demonstrated a standard deviation (SD) less than 0.3, B allele drift of less than 0.01 and waviness factor [65] of less than 0.04 were kept.

Statistical analyses
Bioinformatic tools together with Microsoft Excel software were utilized to organize and analyse the data. A python script developed in house merged overlapping and adjacent CNVs to form CNVRs. Pivot tables summarized data statistics.

Gene ontology analyeses
RefGene and RefLink annotations (USCS, downloaded on http://genome.ucsc.edu/goldenpath/gbdDescription-sOld.html) were used to identify genes located within a 10 Mb window surrounding a CNV. Norris & Whan [66] have shown that CNVs have a demonstrated effect on surrounding genes in a number of species. Overlapping CNVs were aggregated to delineate a set of copy number variation regions (CNVRs) [27]. The coincidence of CNVs and corresponding genes identified by the different models was visualized by means of the Pangloss Venn diagram generator (VENNY [67], http://www.pangloss.com/seidel/Protocols/venn4.cgi and GeneVenn http://genevenn.sourceforge.net/vennresults.php). The hypothesis that genes were over or under represented in PANTHER pathways, biological processes, cellular components and molecular functions was tested by means of the bonferoni correction on the pantherdb.org. Bos taurus gene ontologies were ascertained by means of the Ensembl and PAN-THER databases.