Polymorphism identification and functional annotation
A genome-wide SNP and INDEL identification in the Brazilian white-egg layer (CC) and broiler (TT) lines was performed to obtain a detailed map of genetic variation in these lines. The initial variant call resulted in approximately 16 million SNPs and 2 million INDELs with an average density of 16.84 SNPs/kb and 1.72 INDELs/kb. Our results are consistent with recent studies, which reported similar densities in chicken [6,7,8, 23].
After filtration, a total of 13.93 million SNPs and 1.36 million INDELs were retained, with more variants detected from the broiler line, including heterozygous variants.
The transition and transversion (TS/TV) ratio for SNPs increased after the filtration process, showing that SNP filtration improved the TS/TV ratio, as higher ratios indicate better accuracy and lower false-positive rates [24]. In chicken, similar TS/TV ratios were reported [25].
Heterozygous variants are more difficult to be identified than homozygous ones. It has been estimated that a sequencing coverage of 6–10 X is sufficient to detect 99% of all variants (homozygous and heterozygous); however, some studies reported that a higher coverage (> 20 X) is necessary for detecting 99% of heterozygous SNPs [26, 27].
A recent study [28] analyzing the same Brazilian layer and broiler lines using reduced-representation-sequencing also reported a higher proportion of heterozygous SNPs in the broiler line. The greater variability and heterozygosity observed in the broiler line was likely due to its broad genetic background, which is similar to other commercial chickens and includes the White Plymouth Rock, New Hampshire and White Cornish breeds [15]. On the other hand, the layer line utilized in this study is a white-egg laying hen, which originated solely from the White Leghorn breed [15].
Even though there are millions of SNPs and INDELs along the genome, only a few are expected to explain the phenotypic differences observed between the two lines. As expected, annotation of the variants showed that most of them belonged to non-coding regions of the genome, and just 1.2% of the SNPs and 0.2% of the INDELs were located in exonic regions, including 122,502 non-synonymous SNPs. Non-synonymous SNPs and SNPs located in regulatory regions are known to have the highest phenotype impact [29].
Although, most of the variants were located in non-coding regions, we identified 7255 intolerant non-synonymous SNPs, 512 stopgain/loss SNPs, 1381 frameshift and 1094 non-frameshift INDELs that may alter protein functions.
Both frameshift and non-frameshift INDELs affect protein sequence by being located within coding regions. Frameshift mutations, however, can also have a major deleterious impact on protein function as these alter the amino-acid reading frame. On the other hand, non-frameshift INDELs do not change the reading frame, but they can still alter the protein function by inserting or deleting one or more amino-acids in the protein sequence. The impact of INDELs on transcription has not been well investigated, but it has been suggested that this may be considerable [30]. Non-coding INDELs, e.g. located in 1 kb down/upstream regions, can affect gene regulation by modulating transcription or silencing genes [30].
The chicken genome is still poorly annotated for functional non-coding regions [31], which can be involved in important biological functions as was shown in humans [32]. Because of this, it is currently difficult to predict the potential effect of variants located in non-coding regions, particularly in potentially important regions e.g. UTR, up/downstream, splicing and ncRNA.
Metabolic pathway analyses of polymorphisms of potential functional impact
In order to gain insights into how the polymorphisms identified could explain the phenotypic changes observed in the two Brazilian chicken lines, metabolic pathways of genes harboring variants of high potential functional effects were investigated.
The analysis of all genes containing intolerant SNPs from the white-egg layer and the broiler lines combined resulted in eight networks related mainly to connective and metabolic disorders, embryonic development, cardiovascular diseases, and carbohydrate metabolism.
The effects of the potential disruptions of these important pathways caused by intolerant SNPs may explain various disorders observed in commercial broiler and layer chickens. Broiler chickens, for instance, have been selected for weight gain, feed efficiency, breast and carcass yields, which results in a substantial increase of growth in a short period. This particular scenario for meat-type chickens can result in heart ventricular hypertrophy and overload of the cardiopulmonary system, leading to pulmonary hypertension syndrome (ascites), which can cause chicken mortality or whole carcass condemnation during slaughter [33].
On the other hand, layer chickens, which have been mainly selected for high egg production, have a reduced body weight than the broiler chickens and can develop diseases such as fatty liver hemorrhagic syndrome (FLS) and osteoporosis [3]. FLS is a result of excessive accumulation of fat in the liver when lipoprotein transport is disrupted during high egg production and cause haemorrhage and sudden death [34].
When the analysis was performed using genes harboring intolerant SNPs specific to a line, clearly different networks were obtained for each line. In the layer line, networks were identified mainly related to reproductive system development, cellular functions, and the endocrine system. The latter in layers is particularly important for reproductive aspects, which involve different hormones and their receptors e.g. growth hormone (GH), prolactin (PRL), thyroids, luteinizing hormone (LH), melatonin, and others [35, 36]. For instance, GH is a regulator of ovarian and oviductal functions in chicken, acting in the modulation of LH and the synthesis of insulin-like growth factors [35]. Also, prolactin is involved with different physiological processes that are important for poultry reproduction; hence, PRL is a candidate gene for egg production in chicken [36].
In the broiler line, networks were obtained mainly related to lipid and carbohydrate metabolism, metabolic and dermatologic diseases, neural development, and organism injury. It is well known that commercial broiler lines gain much more weight and muscle mass than layer chickens. It was observed that at 41 days of age, the Brazilian broiler chicken line weighed on average 2395 g, while the layer chicken line weighed only 513 g, when reared as broilers, resulting in about a five-fold difference between these two lines. Also, the breast yield for the broiler line was 6% higher compared to the layer line [15, 16]. This difference in muscle development and growth between broilers and layers can be observed in the early stages of chicken embryos as previously reported [3].
Another important difference between these two lines is lipid metabolism and fat deposition. Broiler chickens accumulate more fat than layers, and lipid metabolism differences were detected during chicken embryogenesis, e.g. higher triglyceride content in the liver of broilers [3]. It was observed that at 41 days of age, the Brazilian broiler chicken line presented 2.41% of abdominal fat percentage, while the layer chicken line displayed 0.16% [16]. However, so far, the differences between broiler and layer chicken lines regarding adipose growth are not well understood [37].
Frameshift and non-frameshift INDELs are very important as well, because they are located in coding regions, and both can affect protein sequence. The metabolic pathways obtained from the analysis of coding INDELs showed pathways mainly associated with diseases and disorders. Many diseases in humans have been found to be associated with INDELs (e.g. different types of cancer), and it was estimated that INDELs are responsible for 24% of the inherited diseases in humans [38]. Previous studies in chicken have also found INDELs to be associated with various health-related problems, such as retinal degeneration and embryonic mortality [39], production traits, such as egg production [40], and performance and carcass traits [41]. The frameshift and non-frameshift INDELs identified in the present study could also be involved with health problems, but further association studies need to be performed to elucidate their effects.
Genome-wide scan for selection signatures
Intensive artificial selection can cause high degree of genetic differentiation between populations in specific genomic regions, which can result in selection footprints. There are different methods and tools available to detect evidence of selection. Traditionally, the Fst method has been a suitable choice based on allele frequency between populations [5].
Fst values were obtained from SNP and INDEL datasets separately using an overlapping sliding window of 20 kb, which was also utilized in previous studies in chicken [42, 43]. Although some studies used 40 kb windows [4, 7, 44], we decided to use 20 kb size-windows because previous studies reported that haplotype blocks in chicken have around 10 kb. The haplotype blocks described in chicken have different sizes depending on the line analyzed. In one study, haplotype block size was different in commercial and non-commercial broiler and layer chickens, but the length was typically less than 10 kb [42]. Also, traditional and village chickens were evaluated, and a median block size of 11–12 kb was observed [43]. Another reason to use 20 kb windows was to obtain a better resolution of the regions combined with a sufficient number of variants in each window.
We obtained a sufficient number of markers per window with an average of 268 SNPs/window and 26 INDELs/window. In similar studies in chicken, Qanbari et al. [44] obtained a mean of 199 SNPs per 30–40 kb window; Stainton et al. [10] had < 30 SNPs per window, and Gholami et al. [11] analyzed windows of 40 SNPs. Detecting regions under selection with Fst methods requires at least 10 samples [45]. Moreover, presence of a large number of markers significantly increases the power of the analysis [45]. Even though we only analyzed 14 individuals per line, our method is robust as NGS data were used with dense marker sets.
A majority of the windows had low Fst values (< 0.3), although a few windows showed extremely high Fst values (> 0.9). Our results corroborate with similar Fst-based studies in chicken. For instance, Gholami et al. [11] obtained Fst values using the Wright method in commercial layer chickens and observed average values, depending on the breed, between 0.09 and 0.27. Stainton et al. [10], like in our study, utilized the Weir and Cockerham’s pairwise Fst estimator and obtained mean values between 0.015 and 0.17 depending on the broiler line analyzed, reflecting the fact that each broiler line had slightly different selection criteria which resulted in a range of broiler lines with different characteristics.
Separate Fst analyses, with SNPs and INDELs, were performed to check if different putative signature regions would be detected. Most of the signature regions identified were found to be common between the two analyses (SNPs and INDELs), with a high level of correlation. This shows that even though the INDEL dataset was only ~ 10% of the SNP dataset in size, both datasets detected practically the same regions under selection. Our study is the first in chicken to use INDEL variants to detect footprints of selection. There are, however, a few studies in other species, for example in humans [46]. In this study, INDELs have been used for identification of selection signatures, and they suggested that regions surrounding INDELs are more frequently involved in recent selective sweeps [46]. We believe that INDEL variants are also a good option for selection signatures analyses even with a reduced density than SNPs, but further studies are necessary to evaluate it in more detail.
Genetic drift is a well-known factor that can also cause divergences in genomic regions. It is almost impossible to determine if a putative selection signature was caused by drift [47]. Besides, a combination of genetic drift and selection could be responsible for eliciting signals similar to selection footprints [48]. Despite that, there are different ways to minimize the false discovery of selective sweeps, such as the use of a stringent Fst cut-off or checking the overlap of those regions with QTLs [47, 49]. In our study, different steps were utilized to minimize the detection of false positive regions due to drift: (i) stringent Fst cut-off was applied (> 99% Fst); (ii) windows with less than 10 SNPs or 5 INDELs were excluded and weighted Fst values were used; (iii) putative selection signature regions were overlapped with known QTL regions from chicken QTLdb and permutation tests were applied to check the significance of these overlaps (results presented and discussed in sections below); (iv) putative selection signature regions were further interpreted based on different downstream analyses such as annotation and pathway enrichment analysis of genes within the signature regions (presented and discussed in sections below); and (v) Fst analyses were performed utilizing SNP and INDEL datasets separately to obtain more accurate results.
Candidate genes under selection for fat deposition and muscle development
We analyzed all genes under putative selection using network and gene enrichment analyses, and gene function by literature search, and we identified candidate genes for fat deposition and muscle development. There were 307 from SNP-based analysis and 363 from INDEL-based analysis. Candidate genes associated with fat deposition were ADCY2/9, APOA1/4/5, AKAP6, CLPS, IGFBP2, ITPR2, LPGAT1, LPIN1, and PLA2R1. Candidate genes associated with muscle development were ACTC1, AKAP6, IGFBP2, IGF1R, NOS1, MAPK13/14 and MYOCD.
The AKAP6 gene is present in both muscle development and lipid metabolism pathways. A recent study suggests that this gene is an important regulator of muscle regeneration and myoblast differentiation in mice [50], in addition to its role in cardiac functions [51]. This gene seems to play a major role in the regulation of different processes, and further studies in chicken are necessary to investigate its role in greater details.
The 30 kb selective sweep region (GGA7:22,790,001–22,820,000), covering the gene IGFBP2 identified in our study (with variants mainly fixed in the layer line), was previously reported in a comparison of various layer chickens [11]. This gene is known to be related to growth [52] and fat deposition [53] from previous studies in chicken.
A putative selection signature region of 20 kb size was identified (GGA10: 16,210,001–16,230,000) with 24 INDELs overlapping the gene IGF1R with variants mainly fixed in the layer line. Recent studies in chicken, using the same method described here, also identified the same gene under selection in a commercial broiler [10] and layer [12] lines. This gene plays an important role in chicken growth [54] and may also be related to reproduction in layers [55].
One selective sweep of 40 kb (GGA28: 3,860,001–3,900,000) overlapping the insulin receptor gene (INSR) was identified in our study with variants mainly fixed in the broiler line (Table 4). This region has been previously reported in studies with layers [44], broilers [4], and both broilers and layers [7]. Also, INSR was associated with growth in chicken in a previous study [56].
Candidate genes under selection for other important traits in poultry
Apart from fat deposition and muscle development traits, genes within the putative signature selection regions were found to be candidates for other important traits in chicken, such as cardiac, skeletal and embryonic development, broodiness, energy metabolism, and reproduction. For instance, the MEIS2 gene is related to the development of brain, heart, eyes, cartilage and hematopoiesis [57]. In our study, 17 adjacent windows were detected as putative selection signatures (GGA5: 30,490,001–30,690,000) harboring the MEIS2 gene, which were merged into one large region of 200 kb, possibly indicating a strong selection region. In the study of Gheyas et al. [7], they identified broiler-specific SNPs in the MEIS2 gene.
In addition, two genes, ITPR2 and VIPR1, were identified in the putative selection signature regions, which are possibly related to layer reproductive traits. One recent GWAS study in chicken revealed that the ITPR2 gene was associated with eggshell ultrastructure [58]. On the other hand, the VIPR1 gene was found to be associated with broodiness [59], and egg number in chicken [60].
Certain regions or genes are known to be selected in domesticated chickens, such as TSHR (affecting reproductive behaviors) and the BCDO2 (for yellow skin) as have been discussed in a number of previous studies [4, 7, 44]. These genes, however, were not picked up in our study as we applied Fst-based approach for detection of selection footprints between meat-type (broiler) and white-egg layer lines. However, both regions were checked in more detail in both lines. In the TSHR and BCDO2 genes, there were 812 and 187 SNPs, respectively and 57 and 23 INDELs, respectively. Most of these variants were fixed (homozygous) in both lines, especially in the BCDO2 gene, showing that these selection signatures are clearly present in both chicken lines.
Overlap of selection signatures with known QTLs
In order to gain confidence in detecting selection signatures and linking these regions to potential phenotypes, we compared the identified selection signature regions with known QTL regions according to the Animal QTLdb [21].
Approximately 97% of the selection signature regions overlapped with one or more QTLs for different traits. This provides an independent support for the validity of the selection signature regions detected in this study. These QTLs corresponded to 190 traits related to fat deposition, growth, carcass, egg production and quality, bones, blood parameters, and organs, among others.
In addition, a statistical test was performed to check if overlaps of the selection signature regions with QTLs were significant and not just by chance. This comparison revealed that a higher proportion of the putative signature regions identified in this study were located in QTL regions associated with relevant traits in chicken, and should be further investigated.
The results showed that almost all the selection signature regions overlapped with different QTLs, providing additional confidence of the putative selection regions identified, and also may be an indication of genetic correlation between traits due to pleiotropy.
SNP and INDEL validation
Availability of variant information from previous NGS studies and high throughput genotype data from 600 K arrays offer an excellent opportunity to carry out cross-study and cross-platform (NGS vs array) comparisons to achieve greater confidence in the variant and genotype calls. We took this opportunity to validate SNPs and INDELs detected in the present study.
SNPs were validated by comparing NGS data against two datasets: 15 M SNPs detected in a previous study by Gheyas et al. [7] and SNP data from 600 K Affymetrix® Axiom® HD genotyping array from two chickens.
We validated more than 78% of our SNPs by comparison with the catalog from Gheyas et al. [7]. This provides a strong validation of the SNPs detected in our study.
In addition, we observed high levels of concordance with 600 K SNP array data. We noticed a slightly higher percentage of genotype concordance for homozygous SNPs than for heterozygous ones, which were also observed previously in a study with bovine variants [61].
The majority of the discordant genotypes checked had poor genotype quality in our NGS data even though their variant qualities were high. A possible explanation of this could be that the sequencing coverage was not sufficient in these cases for accurate prediction of the genotype.
We observed low error rates in our study (considering the non-concordance of genotypes between NGS and array data for SNPs), showing that alignment and SNP calling were efficient.
The 600 K array allowed us to validate a large number of functional variants, intolerant non-synonymous and stopgain/loss SNPs, thereby providing confidence in their detection.
Moreover, INDELs identified in this study were also validated by comparison with two major studies [8, 22], which have generated the largest catalogs of INDELs in chicken. We observed that 77% of the INDELs reported in this study were common with the two previous studies evaluated, and 62% had the same alleles. Similar results were observed before based on the same type of comparison [8]. The lower percentage of concordance, when allele information was considered, is an indication that the INDEL alleles differ among different chicken populations, probably because a large proportion of INDELs actually consists of microsatellite like tandem repeats. This comparison, with the two major INDEL studies in chicken [8, 22], shows that the methodologies used in this study were efficient for detecting and filtering INDELs, and ~ 23% of novel INDELs were detected.