A genome-wide SNP-association study confirms a sequence variant (g.66493737C>T) in the equine myostatin (MSTN) gene as the most powerful predictor of optimum racing distance for Thoroughbred racehorses

Background Thoroughbred horses have been selected for traits contributing to speed and stamina for centuries. It is widely recognized that inherited variation in physical and physiological characteristics is responsible for variation in individual aptitude for race distance, and that muscle phenotypes in particular are important. Results A genome-wide SNP-association study for optimum racing distance was performed using the EquineSNP50 Bead Chip genotyping array in a cohort of n = 118 elite Thoroughbred racehorses divergent for race distance aptitude. In a cohort-based association test we evaluated genotypic variation at 40,977 SNPs between horses suited to short distance (≤ 8 f) and middle-long distance (> 8 f) races. The most significant SNP was located on chromosome 18: BIEC2-417495 ~690 kb from the gene encoding myostatin (MSTN) [Punadj. = 6.96 × 10-6]. Considering best race distance as a quantitative phenotype, a peak of association on chromosome 18 (chr18:65809482-67545806) comprising eight SNPs encompassing a 1.7 Mb region was observed. Again, similar to the cohort-based analysis, the most significant SNP was BIEC2-417495 (Punadj. = 1.61 × 10-9; PBonf. = 6.58 × 10-5). In a candidate gene study we have previously reported a SNP (g.66493737C>T) in MSTN associated with best race distance in Thoroughbreds; however, its functional and genome-wide relevance were uncertain. Additional re-sequencing in the flanking regions of the MSTN gene revealed four novel 3' UTR SNPs and a 227 bp SINE insertion polymorphism in the 5' UTR promoter sequence. Linkage disequilibrium was highest between g.66493737C>T and BIEC2-417495 (r2 = 0.86). Conclusions Comparative association tests consistently demonstrated the g.66493737C>T SNP as the superior variant in the prediction of distance aptitude in racehorses (g.66493737C>T, P = 1.02 × 10-10; BIEC2-417495, Punadj. = 1.61 × 10-9). Functional investigations will be required to determine whether this polymorphism affects putative transcription-factor binding and gives rise to variation in gene and protein expression. Nonetheless, this study demonstrates that the g.66493737C>T SNP provides the most powerful genetic marker for prediction of race distance aptitude in Thoroughbreds.


Background
Thoroughbred horses have been selected for structural and functional variation contributing to speed and stamina during the three century development of the breed. The International Federation of Horseracing Authorities recognizes five distance categories: Sprint (5-6.5 furlongs [f], ≤ 1,300 m), Mile (6.51-9.49 f, 1,301-1,900 m), Intermediate (9.5-10.5 f, 1,901-2,112 m), Long (10.51-13.5 f, 2,114-2,716 m) and Extended (> 13.51 f, > 2,717 m) races (http://www. horseracingintfed.com) and it is widely recognized among horse breeders that variation in physical and physiological characteristics are responsible for variation in individual aptitude for race distance [1]. Although environment and training may contribute to the race distance for which a horse is best suited, the genetic contribution to the ability to perform optimally at certain distances is large; the heritability of best distance among Australian racehorses has been estimated as 0.94 ± 0.03 [2].
A principal characteristic contributing to the ability of a Thoroughbred to perform well in short distance, sprint races is the extent and maturity of the skeletal musculature. Sprinters are generally shorter, stockier animals with greater muscle mass than animals suited to endurance performance, and generally mature earlier.
Performance aptitude for speed and stamina has also been associated with muscle fibre type phenotypes [3,4] and metabolic adaptations to training [5]. Variation in cardiovascular function contributing to aerobic capacity may also play a role in distinguishing individuals suited to shorter or longer distance races.
We have previously reported a sequence polymorphism (g.66493737C>T) in the equine myostatin (MSTN) gene strongly associated (P = 4.85 × 10 -8 ) with optimum racing distance in Thoroughbred racehorses [6]. In several mammalian species, including cattle, sheep, dogs and horses, muscle hypertrophy phenotypes are associated with sequence variants in the MSTN gene [7][8][9][10][11]. Among horses that compete preferably in short distance (≤7 f ) races requiring exceptional speed, the C allele is twice as common than among horses that perform optimally in longer distance (>8 f ) races that require more stamina (0.72 and 0.36 respectively). On average the optimum racing distance for C:C horses was 6.2 ± 0.8 f, for C:T horses was 9.1 ± 2.4 f and for T:T horses was 10.5 ± 2.7 f. Furthermore, C:C horses have significantly greater muscle mass than T:T horses at two-years-old.
Skeletal muscle phenotypes clearly play a role in distinguishing distance aptitude, and there is a strong effect of MSTN genotype on distance [6]. However, heretofore, the effects of additional nuclear gene variants that may contribute to equine performance-related phenotypes have not been investigated. Therefore, we performed a genomewide SNP-association study using the EquineSNP50 Bead Chip genotyping array in a cohort of elite race winning Thoroughbred horses. Animals were separated into two distinct phenotypic cohorts comprising short distance (≤8 f ) and middle-long distance (>8 f ) race winners and genetic associations were evaluated using best race distance as a quantitative phenotype. This study was designed to identify additional genetic loci as indicators of race distance aptitude and to establish whether variation at the g.66493737C>T SNP was associated with inter-locus epistatic effects for race distance performance.

Ethics
This work has been approved by the University College Dublin, Ireland, Animal Research Ethics Committee.

Study animals and cohorts
A repository of registered Thoroughbred horse blood or hair samples (n > 1,400) was collected from stud farms, racing yards and sales establishments in Ireland, Great Britain and New Zealand during 1997 to 2008. Each sample was categorized based on retrospective racecourse performance records. Only horses with performance records in Flat races were included in the study. The study cohort comprised elite Thoroughbreds that had won at least one Group race (Group 1, Group 2 or Group 3) or a Listed race-the highest standard and most valuable elite Flat races are known as Group (Stakes) races and Listed races are the next in status. Only elite race winning horses were included as elite races are most likely to reflect the truest test for distance. Race records were derived from three sources [Europe race records: The Racing Post on-line database http://www.racingpost.co.uk; Australasia and South East Asia race records: Arion Pedigrees http://www.arion.co. nz; North America race records: Pedigree Online Thoroughbred database http://www.pedigreequery.com].
Each sample was assigned a best race distance which was defined as the distance (furlongs, f) of the highest grade of race won [note: 1 furlong = 1/8 mile = 201.2 meters]. When multiple races of the same grade were won, then the distance of the most valuable race, in terms of prize money, was used. A set of elite Thoroughbred samples (n = 118) was selected from the repository, mostly comprising samples procured in Ireland and Great Britain (i.e. n = 5 samples [n = 3 ≤ 8 f, n = 2 >8 f ] were collected in New Zealand); though some had won their best race in North America. Animals with excessive consanguinity (within two generations) were avoided and over-representation of popular sires within the pedigrees was minimized as far as possible. One hundred and seven sires were represented in the total sample set. Genomic DNA was extracted from either fresh whole blood or hair samples using a modified version of a standard phenol/chloroform method [12] or the Maxwell 16 automated DNA purification system (Promega, WI, USA). DNA samples were quantified using Quant-iT PicoGreen dsDNA kits (Invitrogen, Carlsbad, CA) according to the manufacturer's instructions and the DNA concentrations were adjusted to 20 ng/μl.
For the case-control investigation we compared two cohorts: samples were subdivided into short (≤ 8 f, n = 68) and middle-long (> 8 f, n = 50) distance elite race winning cohorts (Table 1).

Genotyping and quality control
Samples were genotyped using EquineSNP50 Genotyping BeadChips (Illumina, San Diego, CA). This array contains approximately 54,000 SNPs ascertained from the EquCab2 SNP database of the horse genome [13] and has an average density of one SNP per 43.2 kb. Genotyping was performed by AROS Applied Biotechnology AS, Denmark. The samples that were genotyped for this study were a subset of n = 187 samples genotyped in two separate batches (Batch 1, n = 96; Batch 2, n = 91). We included four pairs of duplicate samples in Batch 2 for QC purposes and observed greater than 99.9% concordance in the four pairs. In total, we successfully genotyped 53,795 loci. All samples had a genotyping rate of greater than 90%. We omitted SNPs which had a genotyping completion rate of less than 90%, were monomorphic or had minor allele frequencies (MAF) less than 5% in our samples from further analysis. We omitted 12,818 SNPs leaving 40,977 SNPs in our working build of the data and the overall genotype completion rate was 99.8%.

Statistical analyses
All statistical analyses, including tests of association were performed using PLINK Version 1.05 [14]. We compared genotype frequencies in short and middlelong distance cohorts, testing for trait association using χ 2 tests with two degrees of freedom. To test for population stratification, the pairwise identity-by-state (IBS) distance was calculated for all individuals. A permutation test was performed to investigate IBS differences among the short and middle-long distance cohorts. The linear regression model was used to evaluate quantitative trait association using best race distance (f) as the phenotype. We report uncorrected P-values (P unadj. ) and P-values following correction for multiple testing using the Bonferroni method (P Bonf. ). Manhattan and Q-Q plots were generated in R using a modified version of code. The regional association plot was generated in R using a modified version of code available at http:// www.broadinstitute.org.
Cohort-based association (short vs middle-long distance) and quantitative trait association tests were also performed for the g.66493737C>T SNP [6] and a novel 5'UTR MSTN SINE insertion (Ins227bp) polymorphism identified in this study. In addition, an analysis of genome-wide epistasis was performed in which the g.66493737C>T SNP was tested against all SNPs on the EquineSNP50 Genotyping BeadChip for epistatic interactions influencing best race distance. This test involved a linear regression analysis to investigate whether gene by gene interactions had a significant influence on best race distance. Linkage disequilibrium (LD) between g.66493737C>T and Ins227bp and between g.66493737C>T and all chromosome 18 SNPs on the EquineSNP50 Genotyping BeadChip was quantified as r 2 . A visual representation of haplotype blocks across a 1.7 Mb region on chromosome 18 was generated using Haploview [15,16].

Re-sequencing MSTN flanking sequences
PCR primers were designed to cover~2 kb of the 5'UTR and~2 kb of the 3' UTR of MSTN genomic sequence using the PCR Suite extension to the Primer3 web-based primer design tool [17,18] (Table S1). Fifteen unrelated Thoroughbred DNA samples (g.66493737C>T, n = 5 C:C; n = 5 C:T, n = 5 T:T) were included in a resequencing panel to identify novel sequence variants. Bidirectional DNA sequencing of PCR products was performed by Macrogen Inc. (Seoul, Korea) using AB 3730xl sequencers (Applied Biosystems, Foster City, CA). Sequence variants were detected by visual examination of sequences following alignment using Consed version 19.0 [19].

Bioinformatics
The software tool MatInspector [20] was used to search for transcription factor binding site consensus sequences present in 300 bp of the MSTN 5' UTR region in which a novel SINE insertion (Ins227bp) polymorphism was detected. To investigate possible microRNA (miRNA) regulation of MSTN gene expression we screened the equine MSTN gene and flanking sequences for putative miRNA binding sites. A list of 407 predicted equine miRNAs [21] were inputted into the online tool DIANA microtest http://diana.pcbi.upenn.edu/cgi-bin/micro_t. cgi and a 14.7 kb segment containing the equine MSTN gene and~5 kb of upstream and downstream sequence was inputted as the target sequence. SNPInspector [20] was used to investigate transcription factor binding sites at the g.66493737C>T locus.
Genotyping the Chr18g.66495327Ins227bp66495326 (Ins227bp) polymorphism A PCR-based assay for allele size discrimination was used to genotype the Ins227bp polymorphism in n =

Genome-wide SNP-association study
In a cohort-based genotype-phenotype investigation we compared two cohorts: short (≤8 f ) and middle-long (>8 f ) distance elite race winners. The GWAS results, sorted by chromosome, are shown in Figure 1. The most significant SNP was on chromosome 18 (BIEC2-417495, P unadj. = 6.96 × 10 -6 ) and five of the top ten SNPs were located together spanning a 2.4 Mb region on chromosome 18 (chr18:64725066-67186093). However, no SNP in this analysis reached genome-wide significance following correction for multiple-testing.
Pairwise IBS values were used to investigate population stratification between the short and middle-long cohorts. While on average phenotypically concordant pairs of individuals were more similar than phenotypically discordant pairs (P = 0.034), the overall difference between the two groups was negligible (< 0.0002).
Using the linear regression model, we considered best race distance as a quantitative phenotype and observed the same peak of association on chromosome 18 (chr18:65809482-67545806) (Figure 2; Additional File 1). The top eight SNPs encompassed a 1.7 Mb region on chromosome 18 ( Figure 3) and seven reached genomewide significance following correction for multiple testing (P Bonf. < 0.05). The most significant SNP was also the most significant in the cohort-based analysis: BIEC2-417495 (P unadj. = 1.61 × 10 -9 ; P Bonf. = 6.58 × 10 -5 ).

Candidate performance-associated genes
We investigated candidate genes in the 1.7 Mb (Chr18:65809482-67545806) region on chromosome 18 that encompassed the seven SNPs that reached genomewide significance. Eleven protein coding genes were

Polymorphism detection in equine MSTN flanking sequences
We previously identified SNPs in intron 1 of the equine MSTN gene by re-sequencing the coding and intronic sequence [6]. However, genomic sequence or structural variation in the flanking regions was not investigated. Therefore, for the present study we re-sequenced 2,251 bp (chr18:66494683-66496834) of the 5' UTR and 2,155 bp (chr18:66488052-66490207) of the 3' UTR of the MSTN gene (Additional File 2) and identified four novel SNPs in the 3' UTR and a SINE insertion polymorphism in the 5' UTR. An overview of sequence and structural variation in the equine MSTN gene and flanking sequences is provided in Additional File 3. Polymorphisms in the 3' UTR of the MSTN gene have been associated with muscle hypertrophy in sheep and are considered likely to function via creation of de novo target sites for the microRNAs (miRNA) miR-1 and miR-206 [22]. Therefore, using a set of equine miRNAs (n = 407) described by Zhou and colleagues [21] we investigated the presence of putative miRNA binding sites within~5 kb upstream and downstream flanking sequences of the MSTN gene. Five putative miRNA binding sites were identified, though none was polymorphic: i.e. no putative miRNA binding site was associated with any of the eight SNP alleles.
Following re-sequencing in the 5' UTR region of the MSTN gene, we identified a 227 bp insertion polymorphism at chr18:66495327-[Insertion227bp]-66495326 (henceforth referred to as Ins227bp), located 146 bp from the start of exon 1 (Exon1Start: 66495180). The insertion sequence is as follows: Figure 2 Manhattan plot of P-value for quantitative trait GWAS using best race distance as phenotype. The y-axis plots -log 10 (P-values) and the x-axis plots the physical position of the SNPs sorted by chromosome and chromosome position. A peak of association on chromosome 18 (chr18:65809482-67545806) encompassed a~1.7 Mb region (shown in Figure 3). Seven of the chromosome 18 SNPs remained significant following correction for multiple testing. The most significant SNP was BIEC2-417495 (P Bonf. = 6.58 × 10 -5 ).
A BLAST search identified the insertion sequence as a horse-specific repetitive DNA sequence element (SINE) known as ERE-1 [23]. Also, MatInspector analysis indicated that the insertion may disrupt an E-box motif.
In 14 of the 15 sequenced samples, the Ins227bp allele was in concordance with the C-allele at g.66493737C>T. As complete concordance was not observed, we genotyped a set of n = 165 samples to determine the extent of concordance between the Ins227bp and g.66493737C>T polymorphisms. We performed parallel association tests for the same set of samples to evaluate the relative performance of the two polymorphisms as predictors of optimum racing distance. The g.66493737C>T SNP performed better in an association test with best race distance (P = 5.24 × 10 -13 ) than the Ins227bp polymorphism (P = 5.54 × 10 -10 ). Analysis of the sequence surrounding g.66493737C>T indicated that alternate alleles may result in the gain of a putative Homeobox C8/Hox-3alpha transcription factor binding site and/or the disruption of putative Distal-less homeobox 3, E2F and Pdx1 transcription factor binding sites.

Linkage disequilibrium
Pairwise tests of linkage disequilibrium (LD) were performed between g.66493737C>T and Ins227bp, and between g.66493737C>T and the 1,373 chromosome 18 SNPs represented on the genotyping array. LD was highest between g.66493737C>T and BIEC2-417495 (r 2 = 0.86). LD between g.66493737C>T and Ins227bp was r 2 = 0.73. Seven discrete haplotype blocks were identified in the 1.7 Mb peak of association on chromosome 18. The g.66493737C>T SNP was included in block 3; BIEC2-417495 was included in block 6 ( Figure 4).

Discussion
We have previously described an association between optimum racing distance and a SNP (g.66493737C>T) in the equine MSTN gene in Thoroughbred Flat racehorses [6]. Candidate gene approaches are designed considering a priori hypotheses and do not allow the opportunity for evaluation of the effect of the gene in the context of the entire genome, nor do they allow for the identification of other genes contributing to the phenotype [24,25]. Therefore, employing a hypothesis-free approach we investigated genome-wide influences on optimum racing distance by conducting a genome-wide SNP-association study in a cohort of elite Thoroughbred racehorses.
The genomic region on chromosome 18 containing the MSTN gene was the highest ranked region in the and the Ins227bp polymorphism (circle) ranging from one SNP upstream and one SNP downstream of the seven SNPs significantly associated with optimum racing distance following correction for multiple testing. The y-axes plot -log 10 (P-values) for each SNP (diamonds) and r 2 (blue line) between g.66493737C>T and all other SNPs. The x-axis plots the physical position of each SNP in the region. The best SNP, g.66493737C>T, is indicated with a blue diamond. Each SNP is color coded according to the strength of LD with g.66493737C>T: r 2 ≥ 0.8, red; r 2 ≥ 0.5 < 0.8, orange; r 2 ≥ 0.2 < 0.5, yellow; r 2 < 0.2, white.
Hill et al. BMC Genomics 2010, 11:552 http://www.biomedcentral.com/1471-2164/11/552 GWAS for best racing distance, reaching genome-wide significance for a set of seven SNPs within a 1.7 Mb region. The best SNP (BIEC2-417495) and the second best SNP (BIEC2-417372) were 692 kb and 28 kb from the MSTN gene, respectively. We searched the region for other plausible candidate genes and identified the NGFI-A binding protein 1 (EGR1 binding protein 1) gene (NAB1) located~170 kb from BIEC2-417495. The product of the NAB1 gene is highly expressed in cardiac muscle and has been reported to be a transcriptional regulator of cardiac growth [26]. Its principal role is in its interaction with the early growth response 1 (EGR-1) transcriptional activator that is involved in regulation of cellular growth and differentiation [27].
We considered NAB1 as a strong candidate gene to influence an athletic performance phenotype as we have previously identified EGR-1 mRNA transcript alterations (+1.6-fold, P = 0.014) in skeletal muscle immediately following a bout of treadmill exercise in untrained Thoroughbred horses [28]. Twelve SNPs located within the NAB1 genomic sequence (chr18:g.66995249-67021729) are documented in the EquCab2 SNP database, and three are contained on the EquineSNP50 Genotyping BeadChip. After correction for multiple testing, there were no detectable associations between the three NAB1 SNPs and the trait (BIEC2-417453, P unadj. = 0.0007, rank 144; BIEC2-417454, P unadj. = 0.0012, rank 210; and BIEC2-417458, P unadj. = 0.0032, rank 421). Therefore, we did not further consider NAB1 as a potential major contributor to variation in optimum racing distance.
Results from analyses of gene expression generated since our initial report of an association between MSTN genomic variation and optimum racing distance in Thoroughbreds support the hypothesis that the MSTN gene is functionally relevant to racing performance variation. In a transcriptome-wide investigation using digital gene expression (DGE) technology, we identified the greatest alteration in mRNA abundance in transcripts from MSTN in Thoroughbred skeletal muscle following a ten-month period of exercise training. Seventy-four annotated transcripts were differentially expressed between pre-and post-training states and among the 58 genes with decreased expression, MSTN mRNA transcripts were the most significantly reduced (-4.2-fold, P = 0.0043) [29].
Consequently, we focused on comprehensively evaluating variation in the MSTN gene by re-sequencing~2 kb of the 3' and 5' flanking sequences. Four novel 3' UTR SNPs and a 227 bp SINE insertion (Ins227bp) polymorphism located 146 bp upstream of the coding region start site were identified. We investigated whether the 3' UTR SNPs may abrogate existing or create de novo putative miRNA binding sites, as has been described for MSTN influenced phenotypic variation in Texel sheep [22]. However, there was no evidence for alterations in putative miRNA binding sites. Next, because of the close proximity to the transcriptional start site, we considered the Ins227bp polymorphism as a strong functional candidate contributing to variation in racing performance. However, a comparative evaluation of association using the same set of samples (n = 165) demonstrated that the g.66493737C>T SNP displayed a stronger association (P = 5.24 × 10 -13 ) with best race distance than the Ins227bp polymorphism (P = 5.54 × 10 -10 ).
An evaluation of LD showed that the strongest association was between g.66493737C>T and the most significant SNP in the present study, BIEC2-417495. A comparison of trait association in the same set of samples (n = 118) confirmed the superior power of the g.66493737C>T SNP (P = 1.02 × 10 -10 ) in the prediction of best race distance when compared with BIEC2-417495 (P unadj. = 1.61 × 10 -9 ). The significance values and genotype frequencies for the top SNPs in the GWAS and the g.66493737C>T SNP are shown in Table 2. In addition, we investigated whether g.66493737C>T may interact with other SNPs represented on the EquineSNP50 genotyping array; however, no significant interaction was observed to influence best race distance (P > 0.0001 for all interactions). Therefore, the effect of genotype on racing phenotype is highly likely a result of the previously reported variation in the MSTN gene at locus g.66493737C>T.
It is important to note that the sample size used for the present study is relatively small. However, the results of the quantitative trait GWAS demonstrate that the sample size used was sufficient to detect a major genetic effect such as that manifested at the MSTN locus. A lower sample size requirement for GWAS in the Thoroughbred is supported by population genomics analyses of this population in comparison to other horse breeds. These demonstrate that the extent of LD in the Thoroughbred is significantly greater than that measured in other horse populations, being comparable to LD estimates in inbred dog breeds [13].The high LD in Thoroughbreds is a reflection of low effective population size, which enables detection of associations with smaller sample sizes.
The mechanism by which the g.66493737C>T sequence variant may affect the muscle phenotype in horses is not clear; however we propose a direct effect of the SNP on the control of myocyte development. Myostatin is a growth and differentiation factor (GDF8) that functions as a negative regulator of skeletal muscle mass development and results in hypertrophied muscle phenotypes in a range of mammalian species, including horse. Consistent with this role myostatin has been shown to repress the proliferation and differentiation of cultured myocytes [30][31][32]. The proliferation of myoblasts is determined by the control and progression of the cell cycle, a role which has been assigned to members of the E2F family of transcription factors [33]. The g.66493737C>T SNP is located within the sequence of a putative E2F transcription factor binding site in intron 1 of the MSTN gene. It may therefore be plausible to propose a mechanism by which allele-specific binding of E2F to myostatin influences the growth and development of myocytes following signalling from upstream effector proteins such as retinoblastoma protein [34]. Genotype-specific gene expression studies will shed light on the allele-specific effect on function.

Conclusion
This study represents the first genome-wide investigation of sequence variant association with an athletic performance phenotype in Thoroughbred racehorses. It complements a recent DGE transcriptome-wide investigation of functional responses to exercise training in Thoroughbred skeletal muscle, which identified MSTN mRNA transcripts as the most significantly altered following a ten month period of training. The present study provides clear evidence that the previously reported polymorphism in equine MSTN at locus g.66493737C>T is the most powerful genome-wide predictor of optimum racing distance in Thoroughbred horses.

Additional material
Additional file 1: Quantitative association test results for best race distance. Unadjusted and FDR corrected P-values are given.
Additional file 2: PCR and sequencing primers for re-sequencing MSTN flanking regions. Primers were designed to cover~2 kb of the 3' UTR and~2 kb of the 5' UTR of the MSTN gene.  A11: genotype frequency for homozygotes (allele 1) in the population (n = 118); A12: genotype frequency for heterozygotes; A22 genotype frequency for homozygotes (allele 2). Correction for multiple testing was not applied for g.66493737C>T; however, the association remains stronger (P Bonf. = 4.18 × 10 -6 ) after application of a correction factor.