- Research article
- Open Access
Genome wide signatures of positive selection: The comparison of independent samples and the identification of regions associated to traits
© Barendse et al; licensee BioMed Central Ltd. 2009
- Received: 04 February 2008
- Accepted: 24 April 2009
- Published: 24 April 2009
The goal of genome wide analyses of polymorphisms is to achieve a better understanding of the link between genotype and phenotype. Part of that goal is to understand the selective forces that have operated on a population.
In this study we compared the signals of selection, identified through population divergence in the Bovine HapMap project, to those found in an independent sample of cattle from Australia. Evidence for population differentiation across the genome, as measured by FST, was highly correlated in the two data sets. Nevertheless, 40% of the variance in FST between the two studies was attributed to the differences in breed composition. Seventy six percent of the variance in FST was attributed to differences in SNP composition and density when the same breeds were compared. The difference between FST of adjacent loci increased rapidly with the increase in distance between SNP, reaching an asymptote after 20 kb. Using 129 SNP that have highly divergent FST values in both data sets, we identified 12 regions that had additive effects on the traits residual feed intake, beef yield or intramuscular fatness measured in the Australian sample. Four of these regions had effects on more than one trait. One of these regions includes the R3HDM1 gene, which is under selection in European humans.
Firstly, many different populations will be necessary for a full description of selective signatures across the genome, not just a small set of highly divergent populations. Secondly, it is necessary to use the same SNP when comparing the signatures of selection from one study to another. Thirdly, useful signatures of selection can be obtained where many of the groups have only minor genetic differences and may not be clearly separated in a principal component analysis. Fourthly, combining analyses of genome wide selection signatures and genome wide associations to traits helps to define the trait under selection or the population group in which the QTL is likely to be segregating. Finally, the FST difference between adjacent loci suggests that 150,000 evenly spaced SNP will be required to study selective signatures in all parts of the bovine genome.
- Single Nucleotide Polymorphism
- Residual Feed Intake
- Allele Substitution Effect
- Taurine Breed
The goal of genome wide analyses of polymorphisms is to achieve a better understanding of the link between genotype and phenotype. The study of a large number of polymorphisms spread across the genome will reveal aspects of the genetic structure of the population, including, in some cases, evidence of adaptive selection across the genome [1, 2]. Furthermore, if the individuals in the sample are measured for a range of traits, genome wide association (GWA) studies between the polymorphisms and the trait values can lead to the genetic dissection of traits [3, 4]. This applies in particular to complex traits, where genetic and environmental factors combine to produce the phenotype [5–7]. A concordance between SNP showing evidence of genetic selection and association to a trait may help define the phenotype that is under positive selection and may provide some evidence to support the association , assuming that samples from populations that segregate the genetic variability in question have been included. There are studies of a few genes that give credibility to the approach [9, 10].
Genome wide studies of genetic selection (GWGS) are generally performed separately to GWA despite the potential advantages of combining the information. The reasons for this separation are mainly operational. GWA studies are sampled to study a set of traits, and population stratification is avoided or tightly controlled . Often the studies are restricted to a particular population. Much effort goes into replication of results in an independent sample of the same or similar populations, with a strong effort to perform meta-analyses across data sets. Alternatively, studies of selection are recommended to use the most highly differentiated populations available . These are not necessarily the best populations for gene discovery for any particular trait. One example is the Human HapMap project, which reported the sampling of 3.1 × 106 single nucleotide polymorphisms (SNP) across the genome , studied in three highly divergent population samples. Many genomic regions showed signatures of selection, some identified using more than one method of analysis. However, it is not known what confidence could be placed in these signatures of selection if a different set of divergent populations was used, or if intermediate populations were included.
The Bovine HapMap project was started with the goal of understanding the genetic structure and history of an important livestock species. The expectation was that this study would provide information on the processes of domestication, insights that are not currently available from the study of humans or other animal species. To do this, animals from 19 breeds and two outgroup species were sampled from five continents and all three major branches of cattle . This large number of groups is quite different to the Human HapMap project. The 501 animals were genotyped for a panel of approximately 4 × 104 SNP and a variety of analyses were performed, including a genome wide assessment of positive selection. Several methods were used to infer evidence of positive selection with many locations identified by more than one method. These methods were 1) the analysis of population divergence using FST [2, 15], 2) the analysis of ancestral states in conjunction with extended haplotype homozygosity  of derived alleles using the iHS method , and 3) the modelling of the distribution of allele frequencies along a chromosome under the expectation of a selective sweep analysed using a composite likelihood ratio (CLR) . However, these methods respond to different signals in the data. The lack of concordance is therefore not evidence of the absence of selection. It is unknown what proportion of these signals would be reproducible in another set of population samples.
This study had two main aims. The first was to compare genome wide signatures of selection found in the Bovine HapMap project with genotypes collected from a similar set of animals for which phenotypes are known . The second was to explore the relationship between signals of selection and associations to traits. The traits were intramuscular fatness, residual food intake and meat yield, which have biological and economic importance. They relate to obvious characteristics of the phenotype and are under artificial selection in cattle. The Australian samples include eight of the breeds used in the Bovine HapMap study. We used FST to study selection partly because 1) it is a robust easily calculated measure with a long history [2, 12, 15, 20, 21], 2) the amount of ancestral information on SNP is low in cattle , resulting in sparse information for iHS, and 3) the joint minor allele frequency distribution is not known in all breeds making the CLR method complex to implement in cattle.
The number of animals per breed, the breed codes and coding schemes
data code no2
FST group code3
British taurine purebred
British taurine purebred
British taurine composite
British taurine purebred
European taurine purebred
European taurine purebred
European taurine crossbred
Brown Swiss Holstein
Channel Isle purebred
Channel Isle purebred
Channel Isle European crossbred
Channel Isle European crossbred
British taurine purebred
British European Scandinavian crossbred
Brown Swiss cross
Channel Isle crossbred
Australian Friesian Sahiwal
excluded from FST
repeats and unknown crossbred
The beef cattle samples and their trait measurements have been described previously [19, 23, 24]. In short, the beef sample consisted of 189 individuals from 142 sires chosen to avoid close relatives. There were one to four offspring per sire and a median of one offspring per sire. This sub-sample was chosen from a larger group of 1472 animals in half-sib pedigrees representing 308 sires. The detailed description of the method of choosing the sub-sample has been published . The dairy samples were collected as part of an industry survey of cattle tick burden of 2494 cattle in North Eastern Australia. The sub-sample of 189 animals consisted of offspring of 138 sires and 174 dams, with a median of one offspring per sire and a range of 1–4 offspring per sire. They were chosen for genotyping from 5 properties where tick counts had been obtained over several seasons. The animals were chosen to be from as many breeds and sire lines as possible, irrespective of their tick burdens. They were also chosen because they had full pedigree records in the Australian Dairy Herd Information Service (ADHIS) database with full milk yield performance. Because of this, obtaining equal numbers for each breed with such a specification was not possible, because the breeds do not have equal representation in Australia. Aliquots of 200 μL of blood were processed to DNA using the QIAamp DNA mini kit using the manufacturer's instructions (QIAGEN GmbH, D-40724, Hilden, Germany).
The Bovine HapMap sample, genotypes and quality control have been described previously . In brief, samples were obtained for the taurine beef breeds Angus, Charolais, Hereford, Limousin, Piedmontese, Red Angus, and Romagnola; the taurine dairy breeds Brown Swiss, Guernsey, Holstein, Jersey, and Norwegian Red; the zebu Brahman, Gir, and Nelore; the sanga N'Dama breed; the zebu-taurine composite breeds of Beefmaster and Santa Gertrudis; and the sanga-zebu composite Sheko breed. These samples were genotyped for approximately 4 × 104 SNP derived from the Bovine Genome Sequencing Project. Each breed sample consisted of 24 animals including two trios of sire, dam and offspring. The rest were animals as unrelated as possible using a pedigree analysis. The exceptions were the Holstein, the Limousin and the Red Angus, which had samples of 53, 42 and 12 respectively. For FST calculation the offspring were excluded. The correlation of FST with and without these offspring across all loci was r = 0.997. A link to the SNP can be found at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/snp/Btau20060804 associated with the Baylor College of Medicine and the description of the SNP is subject to the Fort Lauderdale agreement . The order and distance of SNP along the bovine genome was taken from the current assembly in Genbank (Btau4.0).
The genotyping of the Australian sample was performed using the MegAllele™ Genotyping Bovine 10 K SNP Panel , a fully described set of SNP, by ParAllele Inc. on an Affymetrix GeneChip Scanner 3000, yielding an average spacing of 325 kb between SNP. Further details of the SNP can be found at the link ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/snp/Btau20050310/. The bulk of the SNP on the SNP array were obtained by comparing the genome sequence of the Hereford animal to the partial sequence of the Holstein (72.4%) and the Angus (15%) animal, with 7.5% cSNP (coding SNP) obtained from the Interactive Bovine in silico SNP database , and the rest from the partial sequence of the Limousin (3.1%) and the Brahman (2%) animal. The major difference in breed origin of SNP in the two SNP sets is an increased proportion of SNP obtained by comparing the genome sequence of the Hereford animal to the partial genome sequence of the Norwegian Red (33%), the Brahman (13%) and the Jersey (2%) animal, with a consequent decrease in the percentages of SNP obtained from an animal of another breed . In summary, the SNP in both SNP sets are mainly the pairwise differences between a taurine beef and dairy animal, with either a small or a moderate percentage of SNP that are differences between a taurine and a zebu beef animal. The beef and dairy samples had been genotyped separately so the combined data set was rescored by ParAllele to ensure that genotypes were consistent. Data quality assurance was investigated using duplicate samples (unknown to the genotyper) and the genotyping of a selection of loci using two alternative technologies. Repeat genotyping of the same individual showed a 99.72% concordance rate. All samples with more than 10% missing data were excluded and then all loci with more than 10% of missing data were excluded. The rationale for this is that DNA that is not of high quality will be more likely to have incorrect genotype calls and have more missing data. The same will apply to SNP with poor assays, which are more likely to have false or indeterminate calls, and so are also more likely to have missing data. The rejection of 10% was not arbitrary; the number of loci excluded was plotted against percent missing data and the cut off point was determined as the point in round numbers along the curve where the rate of change in the number of excluded loci became linear with every increased percent of missing data. This left 8859 SNP out of the panel of 9276 scored SNP. These SNP were tested for departures from Hardy-Weinberg equilibrium within breed, and 4.99% of comparisons were significant at the 5% level.
To determine the clustering of individuals in this study, a principal component analysis (PCA) was performed on the individual genotypes of the 378 animals using the 1604 SNP that had no missing data. The data were analysed using a covariance matrix in SPLUS [28–31]. Full records, without missing data, were used because individuals with missing data may appear as outliers . The reduction in the number of loci has the benefit of reducing the correlation between genotypes due to linkage disequilibrium (LD); the average gap between adjacent loci increased from 357 kb to 1.74 Mb and the number of SNP separated by less than 10 kb dropped from 2462 to 187. The first principal components (PC) were highly significant and these were plotted against each other, encoding each coordinate with the breed or crossbred identity of the animal as contained in the database. During the analysis we found an Angus and a Murray Grey individual clustered with the zebu-taurine composite animals. Those animals were found to have a mismatch between the DNA sample and breed designations in the Beef CRC database. They were excluded from the FST analysis although we do not expect that the exclusion would make much difference to the analysis because of the small numbers involved and the amount of variation explained by each principal component. The clustering was performed for two reasons. Firstly, lumping of breeds was necessary in the dairy animals because some of the sample sizes were small (N < 10). It has been shown that it is best to avoid small samples and to use samples of equal size when comparing estimates of FST . Having similar groups will generate less variable estimates of FST, while lumping groups that have divergent genotypic frequencies should act to reduce FST. Nevertheless, it is the distribution of FST across the genome not the absolute value that is important in identifying evidence of positive selection . Secondly, the clustering identifies the most divergent breed samples in a convenient graphical output. Comparison of the most divergent two to four groups will encompass most of the variability captured by FST . The analysis of a common set of divergent breeds will allow a baseline comparison between the Bovine HapMap and this study.
To determine a genome wide pattern of positive selection, and to compare this to the Bovine HapMap sample, the FST at each locus was calculated  in the same way as in the Bovine HapMap sample. All FST values in this study are per locus values. Means and standard deviations were calculated not to find an unbiased average across all the data but to measure the dispersion of per locus FST in the sample. SNPs that were genotyped in all breeds were used. FST, the average heterozygosity and average allele frequency for each SNP were plotted (Additional File 1). Examination of this figure allows one to see whether the FST values were likely to be affected by significant patterns in genotypic frequencies. The FST values were then plotted against genome location and values were averaged over 8 SNP in a sliding window, as in the Bovine HapMap project. Signatures of selection can be recognized when adjacent SNPs all show high FST , due to the hitch-hiking effect , implying divergent selection between breeds, or where adjacent SNPs all show low FST, implying balancing selection between breeds [15, 21]. Several comparisons were made, varying the breeds and the SNP that were included. The baseline comparison was for the three breeds Angus, Brahman and Holstein, which are the most divergent samples in the PCA in this study, and which were sampled in the Bovine HapMap study. The Pearson correlation coefficient was calculated between the FST values from this study and those from the Bovine HapMap study. The FST values in the top and bottom 2.5% of the distribution in both data sets were compared. To determine whether these loci were significant, the confidence limits for FST were calculated for both data sets. To obtain confidence limits for the per locus value of FST, 1,000 bootstrap samples were taken each consisting of 1,000 FST values sampled at random from the loci within each data set. The average FST was calculated for each bootstrap sample and the standard deviation of the resulting average FST values was used to set confidence limits [34, 35]. To describe the relationship between the location and diversity along a chromosome, the difference between the FST (δFST) of pairs of immediately adjacent SNP (SNP pairs) was compared to the distance in base pairs between them. Each interval between SNP pairs was used once. SNP pairs were binned into those below 1 kb, then into 10 kb bins up to 100 kb, and then into 100 kb bins up to 1.2 Mb. The mean δFST and standard error were plotted against mean distance between the SNP pairs of each bin.
The additive and allele substitution effects for the traits residual feed intake (RFI), percent meat yield (yield), and percent intramuscular fat (IMF) in beef cattle were calculated as previously described and results published elsewhere [19, 36, 37]. In brief, residual trait values were obtained from a mixed model that incorporated fixed environmental effects as well as the random effect of breed (herd of origin) and sire, thereby accounting for possible stratification in the data, using ASReml . The additive effect (a) is half the distance between the two homozygotes, the dominance effect (d) is the difference between the heterozygotes and the average of the two homozygotes, and the allele substitution effect (α) = a + d(q-p) where p+q = 1 and p, q are the allele frequencies . Standard errors were estimated using 1000 bootstrap samples . In this study the estimates for beef traits were reported within taurine breeds (TEM temperate ANG, HFD, MGY, SHN) and within zebu plus tropical composites (TRO tropical BEL, BRM, SGT). Breed type was used because the sample sizes within each breed would be too small for meaningful analysis. As the trait values were adjusted for the effects of breed and contemporary group, a combined analysis should not produce spurious results. These results were searched using the list of SNP that had extreme FST values.
The number of association tests with high extreme FST versus low extreme FST was compared to the goodness of fit expectations for the same sample size using a chi-square test. The expected frequency was taken as the ratio of the observed number of SNP in the low versus the high extreme.
Comparing this plot of PC1 and PC2 to the plot of the first two components in the Bovine HapMap study, PC1 also separated the zebu from the taurine breeds in that study while PC2 partly separated the taurine breeds as a series of partly overlapping clusters. PC1 clearly separated the one sanga breed, the N'Dama, from the zebu breeds, to the same distance as that between the zebu and taurine breeds. PC2 clearly separated the taurine from sanga breeds. There were no purebred sanga animals among the Australian samples. The one major difference was the Hereford breed in the Bovine HapMap study, which was located well away from the other breeds as a loose, flat cluster. Of the other taurine breeds in the Bovine HapMap study, the Angus and Holstein were the furthest apart in the plot of PC1 and PC2.
The per locus FST values in different data sets
To determine the effect of the specific loci on the distribution of FST, we compared the windowed FST values in this study to the windowed FST values in the Bovine HapMap study. The windows are for 8 adjacent loci so the composition and density of loci contributing to each reference point along the genome differed in the two data sets. For all loci the correlation was r = 0.346 for the all breed FST values and r = 0.391 for the three breed FST values. Comparing the correlation of three breed windowed FST values of the Bovine HapMap data and the Australian data to the correlation of the un-windowed values for those data shows a reduction in the variance explained of 76% (r = 0.391 vs r = 0.787). Comparing the all breed windowed to un-windowed FST values in the same way shows a reduction in the variance explained of 68% (r = 0.346 vs r = 0.615). This reduction was smaller than the three breed comparison but the full breed comparison includes not only differences in SNP between studies but also differences in the number and composition of breeds.
To determine the importance of differences in SNP density between the two studies, windowed FST values for BTA 6, 14 and 25 were compared to the other chromosomes. These three chromosomes have 2–3 times higher density than the other chromosomes in the Bovine HapMap data, which in their turn have a 2–3 times higher density in the Bovine HapMap data than in the Australian data. For all breeds, comparing the Australian to the Bovine HapMap data, the correlations were r = 0.382 for comparisons of BTA6, 14 and 25 combined and r = 0.344 for the other chromosomes combined. If the loci that are windowed are only the common ones between the two studies, that is, no difference in density, then the correlation between windowed FST values is r = 0.640, essentially the same as the un-windowed values (r = 0.615).
We calculated the confidence interval for the per locus FST to be considered significant using bootstrap sampling. The 99.9% confidence interval for the per locus FST in our data was 0.094 ± 0.062 while the confidence interval in the Bovine HapMap data for the same loci was 0.141 ± 0.121. In the Australian sample, the top 2.5% corresponded to a threshold FST = 0.224 and the bottom 2.5% corresponded to a threshold FST = 0.015, both outside the confidence interval for that dataset. In the Bovine HapMap data, the top 2.5% corresponded to a threshold FST = 0.284, which is outside the 99.9% confidence interval and the bottom 2.5% corresponded to a threshold FST = 0.039, which is outside a 99% confidence interval for that dataset. There were 94 loci that had an FST above the upper thresholds in both data sets, or 1.28% of 7298 SNP. There were 35 loci that had an FST below the lower thresholds in both data sets. For the SNP above the threshold in both data sets, the 94 SNP were located in 71 genomic regions of 1 Mb containing one or more SNP with high FST (Additional File 2). For the SNP below the threshold in both data sets, almost none of those SNP were close to another SNP with low FST. Some of the low FST values are negative, which may occur when estimating an FST value near zero, but which may also occur in some cases where the expected variance calculated from the average allele frequency for the entire sample is sufficiently less than the sample variance in allele frequencies across populations, or which may occur in some cases where the heterozygote frequencies within populations are higher than expected given the allele frequencies.
Trait associations and high FST values
Significant associations to traits across 10 Mb of BTA2
In this study we report a high level of concordance between the FST values calculated from this study and the Bovine HapMap data for the same SNP and breeds. When the same divergent breeds, namely, the Angus, Brahman and Holstein, are compared, the correlation in per locus FST values between the Australian data and the Bovine HapMap data was high. In both data sets we found that calculating FST using a large number of breeds resulted in per-locus FST values that were not as broadly dispersed around a mean value as those FST values calculated using only the three divergent breeds. This suggests that using many breeds or population groups acts to reduce the variability in estimated per-locus FST values.
However, there are major differences between the two studies in which regions of the genome show the most divergence when all breeds and all loci are used. Many of the strongest signals of selection in the Bovine HapMap data were not found in our data, and the pattern of the signal across chromosomes was visibly different. The similarity between the FST values declined significantly when all breeds were used, amounting to 40% of the variance compared to when only the three divergent breeds were used; there are only eight breeds in common to both studies. The lower correlation with different breeds may suggest that each study had identified signals of divergence particular to the genetic history of those breeds, some of which may be due to selection. The relationship between FST values also decreases substantially when the composition of SNP differed, amounting to 76% of the variance in these data sets when the three divergent breeds are compared.
As there were only 7298 loci in common to both studies, a large proportion of the difference in FST values between two studies will be due to different SNP compositions in the data. For the common loci, the identity and density are the same, and, because the common loci form most of the loci in the Australian set, density and identity are confounded when there is no difference in density between the two studies. Nevertheless, above that baseline, the three higher density chromosomes of the Bovine HapMap data (BTA6, 14 and 25), showed essentially the same relationship to the Australian data as the rest of the data set. This suggests that SNP identity was more important than density in explaining the differences between the two studies. The similarity between the FST values between the two datasets declined significantly when all breeds and SNP were used, amounting to 68% of the variance in that comparison. This is less than the 76% when only the Angus, Holstein and Brahman breeds of the two data sets are used. This difference suggests an interaction between breeds and SNP. The composition of SNP in the two panels is neither of uniform ascertainment nor of exactly the same ascertainment as each other. It is not uniform because SNP were primarily generated by comparing a small number of animals, each of a different breed, to the Hereford sequence, and it was different, because there were slightly more breeds used to define SNP in the Bovine HapMap data and a much larger percentage of zebu SNP .
The first two axes of the PCA separated breeds in a similar way in the Australian and the Bovine HapMap data with one major exception. The similarities are that the first axis clearly separated zebu from taurine and the second axis partially separated meat and dairy within European breeds in both data sets. The first axis also clearly separated zebu from purebred sanga in the Bovine HapMap data, which were at the same co-ordinates on PC1 as the European taurine group. The second axis in the Bovine HapMap study clearly separated the sanga (African taurine) from the European taurine. The differences between European cattle were minor compared to the differences between European and African cattle. Based on the Bovine HapMap data, for diversity studies, a minimum three breed panel would probably have to consist of one taurine, one sanga and one zebu breed. Sanga breeds are not common in the developed world. For breeds that are commonly available one could identify the Angus, Brahman and Holstein as a minimum group for studies of breed or locus divergence based on the similarity between the two studies. However, the greater difference between Angus and Holstein compared to other European breeds may be a consequence of the SNP ascertainment, and further study might be needed before a minimum panel of common breeds could be specified. The major difference between the two studies was that we found that the reference breed for SNP discovery, the Hereford, was located exactly where it was expected, at the intersection of PC1 and PC2, because all breeds would in effect be compared to it (cf. above). This location of the Hereford cluster differs substantially from the analysis reported in the Bovine HapMap study. There the Hereford formed a flattened outlier group well separated from the other breeds. In that report, when only the loci originating from the Jersey breed were analysed in the PCA, the Jersey animals were clustered outside the main group of taurine animals. This may indicate that the shifting clusters could be a generic factor associated with the analysis. While we have not explicitly modelled causes for the difference by performing a PCA on the Bovine HapMap data ourselves, if that explanation is correct then it may suggest that the use of a very large number of loci may lead to the unusual clustering of animals that acted as the source of the polymorphisms.
Higher densities of SNP will be required to study genome wide signatures of selection in cattle. LD causes correlation between genotypes , so FST values that are more similar at a close distance are likely to be similar due to LD. Our results show that SNP separated by more than 20 kb have δFST the same as those separated by more than 1 Mb. This is consistent with the decay in LD as measured by r2 for these  as well as the Bovine HapMap data , where r2 reaches a lower asymptote at around 20–30 kb. These results suggest that for cattle, FST values that are consistently high or low over distances much greater than 20 kb are likely to be due to factors other than baseline LD, and presumably some of these will be due to positive selection . The density of markers required to demonstrate similarity beyond 20 kb implies SNP spaced at less than 20 kb, so that the relationship of δFST and distance between SNP can be quantified. It will require at least 150,000 evenly spaced SNP to quantify genome wide signals in cattle. As SNP are not evenly spaced in current SNP chips, it would require several fold more SNP than that to explore all regions at that minimum density.
The list of the SNP with the most divergent FST values is a resource for the further investigation of the effects of positive selection on the bovine genome. Comparing ends of the distribution in two studies does not necessarily mean that those values are more divergent than expected . Due to the correlation of genotypes in two samples from the same population, very similar FST values can be expected unless there are differences in the populations or errors in the way the genotypes are collected. Instead, the loci in this list have FST outside the bootstrap confidence limits  for both data sets. This indicates that they are unlikely to be merely the result of stochastic events during the evolution and domestication of cattle. One would predict that the process of domestication and breed formation would result in more divergent selection than balancing selection. This is because breeds would be formed and then selected for particular characteristics. Breeds are selected to be different to each other. We found that there were more than twice as many divergent than convergent loci when the Australian and Bovine HapMap data were compared to each other, which is consistent with that prediction. These SNP represent genomic regions that should be of general interest, because they represent the overlap of two studies that do not have the same breed composition. Many of these loci are located within 1 Mb of each other. One question that greater SNP density will be able to answer is the detail of differences in FST around these SNP. That should help define groups of divergent or convergent FST, which in turn would help to define the size and shape of these prospective signatures of positive selection.
There appears to be significantly more loci with very high FST that are associated with RFI, yield and IMF than those with very low FST. IMF is strongly correlated to overall fatness of the animal and is negatively correlated with yield, larger RFI values have a positive relationship to IMF and a negative relationship to measures of muscle area in cattle, while the mean values for these traits show large differences between breeds [42–44]. Yield, fatness and efficiency have been valued for hundreds of years in some breeds . Much of this divergence is expected to be the result of artificial selection, because major differences in size and shape are known to have occurred during the domestication of cattle when compared to the extinct wild aurochs . These breeds each have a specification of what the animals should look like and how they should perform that includes yield and fatness. One would therefore expect that some of the loci associated with these traits should show strong differences in genotype frequency between breeds, which would show up in a calculation of FST. It is interesting that of the SNP with divergent FST five out of six of those with effects on more than one of these traits have effects that are consistent with the known correlations between RFI, yield and IMF.
Examination of a region with enough density of SNP shows that association to the trait is not just found for markers with high FST. Associations are found for SNP that do not have highly skewed allele frequencies, and associations are not just limited to a particular subsection of the sample. This suggests that these results are not merely a spurious intersection of high FST and trait association. Functional evaluation of these and other SNP will be needed to show where the causative mutations lie in this region. Full genetic evaluation of these regions will require more SNP, larger samples and more breeds. Nevertheless, our data show that signatures of selection are a useful adjunct to trait mapping.
The region associated with the peak in FST on bovine chromosome 2 contains several genes that have been associated with selective sweeps in humans. The genes R3HDM1 (R3H domain containing 1) and ZRANB3 (Zinc finger, RAN binding domain containing 3) are associated with these cattle SNP. The functions of these genes are inferred based on their sequence structures, and references to them point to the bioinformatics literature (e.g. Gene Ontology AmiGO http://amigo.geneontology.org/ R3HDM1 GO:0004866 endoproteinase inhibitor checked 5 May 2008 ). For these particular SNP with high FST values, most breeds are homozygous for the same allele, but the Hereford, the Santa Gertrudis and the Belmont Red have a moderate allele frequency of the alternative allele. The Hereford breed has been selected for efficiency and growth from its inception as a breed in the 18th century and continues to be more efficient than other British beef breeds . Polymorphisms in this region may therefore contribute to the differences seen in the Hereford. This region has long been known to be associated with positive selection in humans due to the lactase gene (LCT) and human adaptation to drinking milk in adulthood [9, 10]. It is unlikely that cattle are selected for lactase persistence because breeding stock are invariably weaned. Recently, R3HDM1  was also shown to be under positive selection in European humans, it is not divergent just through hitchhiking to LCT. These results in cattle provide a starting point for the kinds of trait that might be associated with the signal of selection in humans.
Comparing the Australian and Bovine HapMap samples we found differences in the presumptive selective signatures when different breeds or SNP are used. This leads to the conclusion that a full description of signatures of selection in a species will require a large number of populations to be sampled, not just those that have the most divergent genotypes or phenotypes. In addition, SNP obtained from different genetic sources should also be used. A large amount of the variance in FST across the genome is due to differences in SNP composition between studies, which suggests that evidence for selection in a region depends on the SNP that are included in a study. It also suggests that when the same genetic region is compared in different studies the same SNP should be used otherwise a large difference between selection signatures may be reported. These results also suggest that genome wide studies of selection are a useful adjunct to genome wide association studies. We found that useful signatures of selection can be obtained even when many of the sub-samples are not particularly divergent or cannot be clearly separated using a principal component analysis. If a sample used in a genome wide association study was genetically stratified, this stratification could be used to analyse gene frequency or FST distributions in sections of the genome in different breeds or populations. Such an analysis may provide evidence for a selective signature or may even help to identify which populations are likely to contain the QTL even before functional nucleotide polymorphisms were identified. While many real QTL will not be accompanied by a signal of selection, such a signal will indicate a clear relationship between selection on some aspect of the phenotype and the effects of variation at a locus. The identification of the trait under selection will require animals of the appropriate breed or group measured for a wide range of traits. The use of evidence of presumptive selection on a genome wide scale means that panels of SNP will need to be several fold larger in cattle than currently available. Although several panels are now available that include 3.4–5.4 × 104 SNP in cattle (eg. http://www.illumina.com bovinesnp50, [14, 50]), given the variation in FST with distance between SNP, two to three times as many evenly spaced SNP will be needed to study signals of selection in all regions in detail.
This manuscript is a companion paper to the Bovine HapMap project. We thank the Bovine HapMap Consortium for access to the Bovine HapMap genotypes prior to publication. The comments of three anonymous reviewers greatly improved the analysis and manuscript. We thank the owners of the Beef CRC database, Commonwealth Scientific and Industrial Research Organization, the Queensland Department of Primary Industries and Fisheries, the New South Wales Department of Primary Industries and the University of New England, for access to the records and DNA of the Australian samples. Dairy Australia, and the members of the Cooperative Research Agreement, CSIRO, QDPI&F, NSWDPI and UNE, Meat and Livestock Australia and Catapult Genetics, co-funded the collection of genotypes.
- Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu FL, Yang HM, Ch'ang LY, Huang W, Liu B, Shen Y, et al: The International HapMap Project. Nature. 2003, 426 (6968): 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG: Measures of human population structure show heterogeneity among genomic regions. Genome Res. 2005, 15: 1468-1476. 10.1101/gr.4398405.PubMed CentralView ArticlePubMedGoogle Scholar
- Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, Sato H, Sato H, Hori M, Nakamura Y, et al: Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction. Nat Genet. 2002, 32 (4): 650-654. 10.1038/ng1047.View ArticlePubMedGoogle Scholar
- Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet. 1999, 22 (2): 139-144. 10.1038/9642.View ArticlePubMedGoogle Scholar
- Boerwinkel E, Chakraborty R, Sing CF: The use of measured genotype information in the analysis of quantitative phenotypes in man. I. Models and analytical methods. Ann Hum Genet. 1986, 50: 181-194. 10.1111/j.1469-1809.1986.tb01037.x.View ArticleGoogle Scholar
- Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273 (5281): 1516-1517. 10.1126/science.273.5281.1516.View ArticlePubMedGoogle Scholar
- Geldermann H: Investigations on inheritance of quantitative characters in animals by gene markers. I. Methods. Theor Appl Genet. 1975, 46: 319-330. 10.1007/BF00281673.View ArticlePubMedGoogle Scholar
- Ross-Ibarra J, Morrell PL, Gaut BS: Plant domestication, a unique opportunity to identify the genetic basis of adaptation. Proc Natl Acad Sci (USA). 2007, 104 (suppl 1): 8641-8648. 10.1073/pnas.0700643104.View ArticleGoogle Scholar
- Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes M, Reich DE, Hirschhorn JN: Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet. 2004, 74: 1111-1120. 10.1086/421051.PubMed CentralView ArticlePubMedGoogle Scholar
- Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K, Mortensen HM, Hirbo JB, Osman M, et al: Convergent adaptation of human lactase persistence in Africa and Europe. Nat Genet. 2007, 39 (1): 31-40. 10.1038/ng1946.PubMed CentralView ArticlePubMedGoogle Scholar
- WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7 June 2007): 661-678.Google Scholar
- McDonald JH: Detecting natural selection by comparing geographic variation in protein and DNA polymorphisms. Non-Neutral Evolution. Edited by: Golding B. 1994, New York: Chapman and Hall, 88-100.View ArticleGoogle Scholar
- Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851-U853. 10.1038/nature06258.View ArticlePubMedGoogle Scholar
- Gibbs RA, Taylor JF, Van Tassell CP, Barendse W, Eversole KA, Gill CA, Green RD, Hamernick DL, Kappes SM, Lien S, et al: Genome wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science. 2009, 324 (5926): 528-532. 10.1126/science.1167936.View ArticlePubMedGoogle Scholar
- Cavalli-Sforza LL: Population structure and human evolution. Proc R Soc Lond Ser B-Biol Sci. 1966, 164: 362-379. 10.1098/rspb.1966.0038.View ArticleGoogle Scholar
- Sabeti PC, Reich DE, Higgins JM, Levine HZP, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald GJ, et al: Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002, 419: 832-837. 10.1038/nature01140.View ArticlePubMedGoogle Scholar
- Voight BF, Kudaravalli S, Wen XQ, Pritchard JK: A map of recent positive selection in the human genome. PLoS Biology. 2006, 4 (3): e72-10.1371/journal.pbio.0040072.PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamente C: Genomic scans for selective sweeps using SNP data. Genome Res. 2005, 15: 1566-1575. 10.1101/gr.4252305.PubMed CentralView ArticlePubMedGoogle Scholar
- Barendse W, Reverter A, Bunch RJ, Harrison BE, Barris W, Thomas MB: A validated whole genome association study of efficient food conversion. Genetics. 2007, 176 (3): 1893-1905. 10.1534/genetics.107.072637.PubMed CentralView ArticlePubMedGoogle Scholar
- Wright S: The genetical structure of populations. Ann Eugen. 1951, 15: 323-354.View ArticlePubMedGoogle Scholar
- Lewontin RC, Krakauer J: Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics. 1973, 74 (1): 175-195.PubMed CentralPubMedGoogle Scholar
- Hanotte O, Bradley DG, Ochieng JW, Verjee Y, Hill EW, Rege JEO: African pastoralism: Genetic imprints of origins and migrations. Science. 2002, 296 (5566): 336-339. 10.1126/science.1069878.View ArticlePubMedGoogle Scholar
- Upton W, Burrow HM, Dundon A, Robinson DL, Farrell EB: CRC breeding program design, measurements and database: methods that underpin CRC research results. Aust J Exp Agr. 2001, 41: 943-952. 10.1071/EA00064.View ArticleGoogle Scholar
- Perry D, Shorthose WR, Ferguson DM, Thompson JM: Methods used in the CRC program for the determination of carcass yield and beef quality. Aust J Exp Agr. 2001, 41: 953-957. 10.1071/EA00092.View ArticleGoogle Scholar
- Sharing data from large-scale biological research projects: A system of tripartite responsibility. [http://www.genome.gov/Pages/Research/WellcomeReports0303.pdf]
- Hardenbol P, Yu FL, Belmont J, MacKenzie J, Bruckner C, Brundage T, Boudreau A, Chow S, Eberle J, Erbilgin A, et al: Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res. 2005, 15 (2): 269-275. 10.1101/gr.3185605.PubMed CentralView ArticlePubMedGoogle Scholar
- Hawken RJ, Barris WC, McWilliam SM, Dalrymple BP: An interactive bovine in silico SNP database (IBISS). Mamm Genome. 2004, 15 (10): 819-827. 10.1007/s00335-004-2382-4.View ArticlePubMedGoogle Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S-PLUS. 2000, New York: Springer Verlag New York, Inc, 3Google Scholar
- Morrison DF: Multivariate Statistical Methods. 1976, Tokyo: McGraw-Hill Kogakusha, Ltd, SecondGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2 (12): e190-10.1371/journal.pgen.0020190.PubMed CentralView ArticlePubMedGoogle Scholar
- Anon: S-Plus 5 for Unix Guide to Statistics. 1998, Seattle: Mathsoft Data Analysis Products DivisionGoogle Scholar
- Weir BS, Cockerham CC: Estimating F-statistics for the analysis of population structure. Evolution. 1984, 38 (6): 1358-1370. 10.2307/2408641.View ArticleGoogle Scholar
- Maynard-Smith J, Haigh J: The hitch-hiking effect of a favourable gene. Genet Res. 1974, 23: 23-35.View ArticleGoogle Scholar
- Weir BS: Genetic Data Analysis II. 1996, Sunderland, MA: Sinauer Associates, Inc, SecondGoogle Scholar
- Efron B: Better bootstrap confidence intervals. J Am Stat Assoc. 1987, 82 (397): 171-185. 10.2307/2289144.View ArticleGoogle Scholar
- Barendse W, Harrison BE, Hawken RJ, Ferguson DM, Thompson JM, Thomas MB, Bunch RJ: Epistasis between calpain 1 and its inhibitor calpastatin within breeds of cattle. Genetics. 2007, 176 (4): 2601-2610. 10.1534/genetics.107.074328.PubMed CentralView ArticlePubMedGoogle Scholar
- Barendse W, Reverter-Gomez A: A method for assessing traits selected from longissimus dorsi peak force, intramuscular fat, retail beef yield and net feed intake in bovine animals. WO2007012119. Australia. 2007, 1-224.Google Scholar
- Gilmour AR, Thompson R, Cullis BR: Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 1995, 51 (4): 1440-1450. 10.2307/2533274.View ArticleGoogle Scholar
- Fisher RA: The correlation between relatives on the supposition of Mendelian inheritance. Trans Roy Soc Edin. 1918, 52: 399-433.View ArticleGoogle Scholar
- Efron B, Tibshirani R: Statistical data analysis in the computer age. Science. 1991, 253: 390-395. 10.1126/science.253.5018.390.View ArticlePubMedGoogle Scholar
- Hill WG, Robertson A: Linkage disequilibrium in finite populations. Theor Appl Genet. 1968, 38: 226-231. 10.1007/BF01245622.View ArticlePubMedGoogle Scholar
- Reverter A, Johnston DJ, Ferguson DM, Perry D, Goddard ME, Burrow HM, Oddy VH, Thompson JM, Bindon BM: Genetic and phenotypic characterisation of animal, carcass, and meat quality traits from temperate and tropically adapted beef breeds. 4. Correlations among animal, carcass, and meat quality traits. Aust J Agric Res. 2003, 54 (2): 149-158. 10.1071/AR02088.View ArticleGoogle Scholar
- Gregory KE, Cundiff LV, Koch RM: Genetic and phenotypic (co)variances for growth and carcass traits of purebred and composite populations of beef cattle. J Anim Sci. 1995, 73: 1920-1926.PubMedGoogle Scholar
- Robinson DL, Oddy VH: Genetic parameters for feed efficiency, fatness, muscle area and feeding behaviour of feedlot finished beef cattle. Livest Prod Sci. 2004, 90 (2–3): 255-270. 10.1016/j.livprodsci.2004.06.011.View ArticleGoogle Scholar
- Allen RL: Domestic Animals. 1856, New York: C.M. Saxton & CompanyGoogle Scholar
- Zeunier FE: A History of Domesticated Animals. 1963, London: Hutchinson & Co., LtdGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Gregory KE, Swiger LA, Sumption LJ, Koch RM, Ingalls JE, Rowden WW, Rothlisberger JA: Heterosis effects on growth rate and feed efficiency of beef steers. J Anim Sci. 1966, 25: 299-310.Google Scholar
- Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie XH, Byrne EH, McCarroll SA, Gaudet R, et al: Genome-wide detection and characterization of positive selection in human populations. Nature. 2007, 449 (7164): 913-U912. 10.1038/nature06250.PubMed CentralView ArticlePubMedGoogle Scholar
- Charlier C, Coppieters W, Rollin F, Desmecht D, Agerholm JS, Cambisano N, Carta E, Dardano S, Dive M, Fasquelle C, et al: Highly effective SNP-based association mapping and management of recessive defects in livestock. Nat Genet. 2008, 40 (4): 449-454. 10.1038/ng.96.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.