Recent genomic heritage in Scotland
© Amador et al. 2015
Received: 7 November 2014
Accepted: 1 May 2015
Published: 6 June 2015
The Generation Scotland Scottish Family Health Study (GS:SFHS) includes 23,960 participants from across Scotland with records for many health-related traits and environmental covariates. Genotypes at ~700 K SNPs are currently available for 10,000 participants. The cohort was designed as a resource for genetic and health related research and the study of complex traits. In this study we developed a suite of analyses to disentangle the genomic differentiation within GS:SFHS individuals to describe and optimise the sample and methods for future analyses.
We combined the genotypic information of GS:SFHS with 1092 individuals from the 1000 Genomes project and estimated their genomic relationships. Then, we performed Principal Component Analyses of the resulting relationships to investigate the genomic origin of different groups. We characterised two groups of individuals: those with a few sparse rare markers in the genome, and those with several large rare haplotypes which might represent relatively recent exogenous ancestors. We identified some individuals with likely Italian ancestry and a group with some potential African/Asian ancestry. An analysis of homozygosity in the GS:SFHS sample revealed a very similar pattern to other European populations. We also identified an individual carrying a chromosome 1 uniparental disomy. We found evidence of local geographic stratification within the population having impact on the genomic structure.
These findings illuminate the history of the Scottish population and have implications for further analyses such as the study of the contributions of common and rare variants to trait heritabilities and the evaluation of genomic and phenotypic prediction of disease.
KeywordsGeneration Scotland Principal component analysis Genetic ancestry Admixture Rare variants Population structure
The Generation Scotland Scottish Family Health Study (GS:SFHS) is a family-based genetic epidemiology study which includes 23,960 participants in ~7,000 family groups from across Scotland. Participants were recruited by letter of invitation from general practitioner lists to provide a representative sampling of the population. There was no selection on the basis of medical status or history. All were interviewed and clinically assessed for a wide range of health-related traits and environmental covariates, and linked also to routine health records [1, 2]. Ten thousand of these participants, of whom ~6,000 are not known to be directly related, also have genotypic information for ~700 K SNPs. The cohort was designed as a resource for genetic and health-related research. So far, projects based on this cohort are underway to study the genetics of depression, the prediction of cardiovascular disease risk, or the role of specific genes in asthma, bronchitis and emphysema, but the potential uses of GS:SFHS are much wider (http://www.generationscotland.org/). The GS:SFHS cohort is a family-based study, and a fair proportion of individuals in the population will be related, some unknowingly. This structure will allow the shared variation between individuals within families to be disentangled into its genetic and environmental components and should facilitate accurate estimations of heritability. In addition, genome-wide association studies to be carried using the GS:SFHS cohort will need to adequately account for population substructure. Without proper correction, false-positive results can arise as a consequence of stratification differences, particularly between cases and controls, because of allele frequency differences or differences in LD patterns [3–5]. As genomic relationships are an integral part of the statistical methods used to unravel or utilise trait variation and affect their performance [5, 6] a detailed genomic description of the structure of GS:SFHS is a prerequisite to the application of these mapping and prediction methods. Furthermore, such description will shed some light on the demographic history, the existence and characterization of hidden ancestral structure and the amount and origin of the variability in the population.
Our aim was to provide an accurate genomic description of the GS:SFHS cohort, which could reflect as well the history of Scottish population. For that purpose, we developed several genomic approaches using the information of ~700 K SNPs in the individuals of the cohort. Our objectives were to: 1) place GS:SFHS in the context of other human populations by exploring their genetic variation and establish the ancestry of participants, 2) identify the extent and the origin of rare haplotypes in GS:SFHS individuals, 3) explore genetic differentiation within GS:SFHS, 4) analyse homozygosity in the GS:SFHS cohort, 5) identify the extent of geographic differentiation within the sample, and 6) apply the knowledge gained from these studies to identify a core set of samples to select the most appropriate for different future analyses.
These exhaustive analyses will not only reflect an accurate picture of the demographic history of Scotland, but also will have implications for our further studies using the GS:SFHS cohort in terms of the genomic differentiation found among its individuals, (e.g., introgression detected, homozygosity, etc.) which will help select the most appropriate groups of individuals for each future analysis. These studies could include a detailed analysis of the contributions of common and rare variants to trait heritabilities, haplotype mapping, and evaluation of genomic and phenotypic prediction of disease.
Placing Generation Scotland in a global genomic context
Origin, location, number of individuals and given values for latitude and longitude for the different populations in the GS + 1 kG data set
African ancestry individuals in SW US
Utah residents with N. and W. European ancestry
Han Chinese in Beijing
Han Chinese South
Colombian in Medellin, Colombia
Finnish individuals in Finland
British individuals in England and Scotland
Iberian populations in Spain
Japanese individuals in Tokyo, Japan
Luhya individuals in Webuye, Kenya
Mexican ancestry individuals in LA California
Puerto Rican in Puerto Rico
Tuscan individuals in Tuscany, Italia
Yoruba individuals in Ibadan, Nigeria
The first and second principal component separated the African (different coloured squares), East Asian (different coloured rhomboids) and European (different coloured circles) populations (Fig. 1a). American populations (different coloured triangles) spread over the plot between the three main groups (Africans, East Asians and Europeans). African ancestry individuals living in the US (ASW, red squares) were more spread towards the European populations than the two populations from Africa (Luhya and Yoruba) which remained separated in a more consistent group, owing to the history of European contributions to ASW. The East Asian populations formed a tight group clearly independent of the rest. When looking at the plot of GS:SFHS within this graph (Fig. 1b), it can be noticed that a very small but obvious proportion of GS:SFHS individuals leaned towards the African population, whereas a similarly small but also obvious proportion were closer to the East Asian populations. This analysis suggests that some GS:SFHS participants have mixed ethnic backgrounds. For example, 64 individuals have values for PC1 or PC2 more than 6 standard deviations away from the mean. They are plotted as open blue circles in Fig. 1b.
If we look at further PCs, we observed, as shown in Fig. 1, that the third and fourth PCs reflected some variation within the Mexican population from LA (MXL) and African origin populations, respectively (Fig. 1c). For both eigenvectors GS:SFHS individuals were located together with the 1 kG European populations.
The fifth PC separated the Italian population from Tuscany from the rest of the populations (Fig. 1d). However, some individuals from GS:SFHS clustered with this Italian group consistent with them having some similar ancestry.
In the sixth PC of the GRM of GS + 1 kG, a different pattern appears. For this PC the individuals from most of the populations are separated in three clear clusters, mostly detected in GS:SFHS because of its larger sample size. The SNPs causing this clustering are located in chromosome 8, in the 8p23.1 region. A known common inversion polymorphism is located in this region [7, 8] suggesting that it might be the cause of the clustering pattern we observe.
Generation Scotland in a European genomic context
Genetic structure due to rare alleles
To thoroughly explore the ancestries of the individuals in the GS:SFHS cohort and gain insight into the potential origins of the outliers we analysed the patterns of allele sharing described by genomic relationship coefficients. It is clear from the formulation of the standard estimator (1, see Methods and Additional file 1: Figure S2) that shared rare alleles can have a substantial influence on estimates in the GRM. To measure the influence of the rare alleles in the relationships between individuals of GS:SFHS we used three scores:  an individual marker score that measures how each marker contributes in an individual to its relationships with the rest of the population;  a pair marker score that measures how each marker affects the relationship between a particular pair of individuals; and  a rarity score to measure the overall amount of rare variants that an individual has. The details of these calculations are provided in the Methods section.
Genomic relationship coefficients between several pairs of individuals using different thresholds for the computation of the GRM
The peaks for the individual marker scores in each of the graphs (Fig. 3a and b) represent areas where the individuals carry some rare alleles (p ≤ 0.005). The SNPs causing the inflated relationship are represented by the pair marker score in Fig. 3c. The rare alleles that both individuals share are located in chromosomes 2, 4, 6 and 9 which demonstrate common peaks in Fig. 3a and b. Figure 3d shows the rarity score of individual 40280 plotted as a cumulative score, where number and magnitude of changes in the slope, as well as the total rarity score value, are greater than for individual 11786. The remaining pairs in Table 2 showed a similar pattern of sharing when plotting their individual and pair scores.
The same graphs are plotted in Additional file 1: Figure S4 for two randomly selected GS:SFHS individuals. The number of peaks is lower than in the previous plot, with a few sparse high values, and no peaks in the pair score. Both individuals’ rarity score are considerably smaller than the previous shown in Fig. 3.
Distribution of the peaks detected using the rarity scores in windows of 50 SNPs in the two groups of individuals (outliers and non-outliers)
Number of peaks per individual
Total coverage of peaks per individual (Mb)
Number of peaks per individual
Total coverage of peaks per individual (Mb)
These values can help characterise the two groups of individuals. The non-outliers have a few sparse and small peaks in their genome potentially reflecting a point mutation (in most of the cases the window score is due to one single marker at a low frequency). On the other hand, the individuals defined as outliers can reflect a different kind of ancestry: the size of the peaks suggests that these individuals have a more admixed background, due to having a recent foreign ancestor.
We checked the self-reported origin of the 74 individuals detected as outliers and although the amount of non-disclosed information is slightly higher than in the whole data set, the recorded origin of their grandparents did not indicate that they came from outside the UK.
Individual ancestries using ADMIXTURE
To provide additional evidence of our results, we analysed a subset of markers in approximate linkage equilibrium with the software ADMIXTURE  to estimate the proportion of ancestral populations in GS + 1 kG individuals assuming 3 ancestral populations, i.e., African, Asian and European (K = 3). Additional file 1: Figure S7 shows the proportion of each origin in the individuals of 1 kG and GS:SFHS. The correlation between the estimated proportion of African origin and the percentage of the genome covered by peaks was 0.94 showing that our estimates correspond well with percentage of African genes in the individuals (Additional file 1: Figure S8).
Areas of high linkage disequilibrium in Generation Scotland
The plots of PC three and/or four showed a three-cluster pattern for some of the chromosomes. In some cases the three groups were clearly distinguishable (chromosomes 4, 7, 8, 15), and in some others the pattern was created by combining both eigenvectors together or it became less clear (chromosomes 6, 10, 11, 12, 19, 21, 22). Sometimes two different clustering patterns could be seen in both the third and fourth PCs (chromosome 8, 12, 15) (Additional file 1: Figure S9).
The SNPs responsible for the PCs clustering were chromosome-wide for the first and second PCs for all chromosomes, showing a very similar pattern (except for chromosome 6 and 8) and to the clustering of individuals obtained for the original genome-wide PCA (Fig. 1b). This shows that the differences driven by the first two eigenvectors between all the individuals of GS:SFHS and the rest of the origins are genome-wide and consistent between chromosomes, representing differences across populations.
A summary of a literature review aiming to identify plausible causes of the other observed three-cluster patterns in the different PCs can be found in Additional file 2: Table S1, including genes located in the identified regions. These areas appear to be regions of high linkage disequilibrium, so that individuals inherited long haplotypes and that caused the clustering pattern observed. This linkage disequilibrium could be due to different reasons, such as selection, including past selective sweeps around new alleles/mutations, or long inversions, or chance. In chromosome 2, the SNPs causing the observed clustering for PC3 (Additional file 1: Figure S9) is likely due to the Lactase gene which has been shown to display clines of variation due to selection . In addition, the SNPs in the region 8p23.1 (chromosome 8 inversion) caused the pattern in PC 2 and 3 of chromosome 8; and in chromosome 6 the MHC was identified as the cause the clustering in PCs 1 to 6. This area has previously been suggested to display selection pressure associated variation . MHC associated SNPs showed a lot of influence in the whole genome PCA, accounting for the variation explained by several PCs. Some centromeric areas were also detected as responsible for the variance explained in several PCs. With respect of the other three group patterns detected, we found several genes located in the areas involved (Additional file 2: Table S1) that could be responsible for the clustering. In the case of chromosome 15 PC3, only one gene is located in the responsible area, ALDH1A2 (Aldehyde dehydrogenase 1 family, member A2).
Runs of homozygosity in Generation Scotland
In order to explore the patterns of homozygosity in GS:SFHS, we estimated the length and location of homozygous segments in each individual of the cohort (Additional file 1: Figure S10). Additional file 1: Figure S10a shows the distribution of the number of segments per individual and Additional file 1: Figure S10b the distribution of the total length of the homozygous segments per individual. Additional file 1: Figure S10c shows the number of ROHs compared to the total length of ROHs and Additional file 1: Figure S10d the proportion of individuals with one or more ROHs of a given length. The distribution of the homozygous segments in GS:SFHS is similar to those described in other studies in European populations [13, 14]. When comparing the results with those obtained for the controls of the Scottish Colon Cancer Study (SOCCS, 984 subjects from Scotland not known to have colon cancer) we observed that the percentage of large ROHs described in GS:SFHS is slightly higher than that found in SOCCS, which is also a mainland Scottish population .
The analysis also revealed that one female individual carried a completely homozygous chromosome 1 suggesting an instance of uniparental disomy. Genomic information from other related individuals pointed out that she had inherited two copies of the paternal chromosome 1.
To evaluate the possible association between ROH and the rarity score in the GS:SFHS individuals, we performed a linear regression between the total length of homozygous segments or the number of runs per individual and the values for the rarity score. The rarity score showed a negative correlation with both total length and number of ROH (Additional file 1: Figure S11) suggesting that more homozygous individuals would have less rare haplotypes.
Population structure within Generation Scotland
Areas in Scotland, number of individuals in the cohort born in each of the areas, number of individuals with the four grandparents coming from that area, and values of latitude and longitude used for each of the areas in the regression analyses
Argyll & Bute
Dumfries & Galloway
Perth & Kinross
Results of the multiple linear regressions between the PC and the values of latitude and longitude in 6739 unrelated individuals of GS:SFHS
Results of the multiple linear regressions between the PC and the values of latitude and longitude of the grandparents in 1113 individuals of GS:SFHS
Impact of population structure and geographic location on health-related traits
To explore the impact of the region of “genomic” origin (defined using predicted longitude and latitude) and the area of residence of the individuals in a variety of health-related traits, we used two different mixed linear models. We explored the variance explained by a similarity matrix constructed using the values of predicted latitude and longitude (i.e., the values in the matrix represent “genomic” proximity between individuals). We fitted a mixed linear model as in  including a GRM and the similarity matrix (Geo), estimated using GCTA  the proportion of the phenotypic variance captured by each of these matrices, and compared them when including or not the current area of residence as a fixed effect. The results of the analysis are shown in Additional file 1: Table S3 for body mass index (BMI), fat, waist-to-hip ratio (WHR) and high density lipoprotein (HDL). In all scenarios, the matrix Geo explained a small but significant amount of the trait variance and this remained the case when current area of residence was included in the model.
We also explored the effects of sex, age, age2, predicted latitude and longitude (as covariates) and the current area of residence when fitted in a mixed linear model with a GRM. Results are shown in Additional file 1: Table S4 for the four traits. The results show that longitude is significant for fat and WHR with the effect being reduced values further east in both traits. Some of the areas of residence show also significant effects.
Historic and demographic events leave their signatures in the genomes of populations. These genomic marks allow us to track relationships, reconstruct introgression events and find different patterns in the populations that can be linked with structural variation or reflect areas in high linkage disequilibrium. This study provides a detailed picture of the genetic structure within the GS:SFHS cohort and some historic demographic events that have shaped the genomes of its individuals.
Our analyses showed some expected patterns together with some valuable information regarding GS:SFHS individuals. The first and second principal components of the genomic relationship matrix in the GS + 1 kG population separated the European from the African and East Asian populations respectively. They also showed that a few individuals from the Scottish population have a variable pattern of genes from different ancestries, with part of this group of individuals having more than a 4 % of non-European origin according to the rare allele analysis. These non-European regions are too large to be explained as random mutations or genotyping errors. It is widely accepted that the Eurasian populations were originated from a single “out of Africa” event , and that the African populations are genetically more diverse than European and Asian samples because of this bottleneck . This means that rare alleles in Europe may be common in Africa. Gabriel et al.  estimated that half of the human genome exists in blocks of 44 kb or larger in European populations with a maximum of 173 kb blocks. Even considering that the blocks in GS:SFHS could be larger, the length of the regions described by Gabriel et al.  still suggests that the “out of Africa” bottleneck cannot explain haplotypes of up to 7 Mb with rare variants, indicating that the outliers have a relatively recent ancestor likely coming from an African population. Also, Wall and Pritchard  reviewed several studies of haplotype blocks in human populations and showed the correlation between the recombination rate and the length of the regions. We checked the recombination rate in some of the regions that we detected and it was close or larger to the 1.2 cM/Mb expectation , which indicates that the long non-European regions we uncover are not the result of regions of low recombination. These individuals should probably be removed from the data set for subsequent analyses, such as common and rare variants contribution to heritability, haplotype mapping, genomic prediction of disease, etc., otherwise allele frequencies differences could bias results.
The rest of the population appears to be more homogeneous when analysing the results of the PCA and the rare allele analysis. They show an average proportion of peaks of 0.07 % and a more compact clustering pattern. This 0.07 % of peaks is likely to reflect, at least partially, point mutations that occurred in recent generations. Some of these mutations could also be genotyping errors, but we cannot differentiate them.
The pattern of rare alleles detected is highly influenced by the SNPs genotyped so it is likely that the ascertainment bias of the chip has an impact on our results. Whereas with a different chip different variants might have been detected as coming from other populations, we do not expect spurious results arising because of this in the rare allele analysis.
The correlation between the estimated proportion of African origin using ADMIXTURE and the percentage of the genome covered by peaks was high (0.94) showing that our estimates correspond well with percentage of African genes in the individuals (Additional file 1: Figure S8). Our estimates of Asian ancestry seem to be underestimated when compared to the ADMIXTURE results.
Despite the homogeneity of the majority of the population (i.e., excluding the outliers), it would probably still be useful to correct for some subjacent population structure in further analyses by including the eigenvectors as covariates in the models in future association studies or heritability estimations .
Regarding the homozygosity patterns, the distribution of the number of homozygous segments and their length in GS:SFHS was similar to that described in previous studies in other European populations (Figure S10) [13, 14]. However, the proportion of individuals carrying large ROHs was slightly higher than found previously in the SOCCS population , particularly for those ROH between 2.5 and 5 Mb (frequency in GS:SFHS 50 %, frequency in SOCCS <30 %). Rarity scores showed a negative correlation with the number and total length of ROH as expected, since the more admixed individuals are in general less homozygous.
We identified a group of individuals with Italian ancestry in GS:SFHS through their similarity with the Tuscan individuals in the 1000 Genomes populations (Fig. 1d). This is consistent with a large influx of Italian immigration to Scotland at the end of XIX century and beginning of XX century (1880–1920 Scotland’s Census). The Italians in GS:SFHS appear as a distinct group from the rest of the population in the PCA, but they do not seem to differ much when analysing their rare alleles. The values of their rarity scores do not appear as outliers in the population. This implies that there are not big differences in their allele frequencies when comparing to the rest of GS:SFHS (consistently with the fact that they are also a European population). As the architecture of allelic frequencies is the same, we can assume that the underlying architecture of complex traits will be very similar as well. We consider that these individuals could be included for most of the future analyses if the population structure is accounted for using whole genome marker information. Regarding the rest of the European populations, GS:SFHS individuals are located between Britain and Orcadian populations as expected and further away from more southern populations.
The genome-wide information in GS:SFHS allowed us to discover some other variation. The inversion in chromosome 8 is one of the largest polymorphic inversions found in humans (~4.5 Mb) with a frequency of ~20–50 % in European populations, ~59 % in the Yoruba and ~12–27 % in Asians [7, 8]. It can be predicted from genotypes [7, 22] and it was previously identified via principal components by Zou et al. . This information suggests that selection against crossovers in the region maintains the linkage disequilibrium pattern . Previous studies that detected this inversion, were capable of detecting a smaller (900 Kb) inversion in chromosome 17 (17q21) with a 20 % frequency in Europeans [12, 22, 23]. We did not find any signal for any of the SNPs in any of the principal components in chromosome 17 pointing at that area.
The chromosome 8 inversion proves that a linkage disequilibrium region can be detected using genotypes and principal component analysis. The similar clustering patterns in the rest of the chromosomes discovered through the chromosomal PCA suggest regions of between 260 Kb (15q21.3) and ~7.5 Mb (6p22.3-p21.32) in high linkage disequilibrium. This long haplotype patterns could be informative of selective sweeps in the past that have not been broken down by recombination. For some of these areas expected results pointed out the variation due to selection of the lactase gene (chromosome 2) or the Major Histocompatibility Complex in chromosome 6 . The other areas identified contain several genes, and it is difficult to pinpoint more accurately the specific gene or region that causes the observed clustering pattern, which could suggest some selective pressure over it.
As others, we have shown that the geographic differentiation between populations from different origins (Africa, East Asia, and Europe) is detected in the PCA but we have also found that the geographic origin within Scotland has a significant impact on the principal components. We were able to predict with relative high accuracy the origin of the individuals from their genomic data. The small number of samples from the islands (Western Isles, Orkney and Shetland), and hence their limited influence in the PCA, made that prediction of the origin of these individuals was somehow poorer than that of individuals from regions with more genotyped participants. Most of the families participating in the study were recruited in big cities (which usually gather people from different origins creating a non-homogeneous sample) and the information from the grandparents’ origin is often not complete. This could affect the results obtained from our origin prediction analysis, for which prediction was more efficient at separating individuals from the South-West from those of the North-East of Scotland (separating mostly Glasgow, Dundee and Aberdeen).
When we explored the impact of predicted location based on information from the regions of origin of the individual’s grandparents (i.e., “genomic” origin) and current place of residence on four health-related traits (BMI, fat, WHR and HDL), we observed that genomic origin had an impact on these traits. In the four cases, the similarity matrix based on geographic proximity significantly captured some phenotypic variance, irrespective of including current region of residence in the model. Also, when examining the effects of predicted latitude and predicted longitude we observed that longitude had a significant effect for fat and WHR but latitude did not, when both variables were jointly fitted. It has to be noticed that the distribution of the samples over Scotland makes latitude and longitude correlated to each other (r = 0.63). After adding to the model the current region of residence of the individuals, the predicted longitude of genomic origin was still significant, and similar results were obtained when we replaced current region of residence with region of birth (results not shown). Although these associations with genomic origin as predicted by grandparental birthplace and are not removed by adjustment for either an individual’s place of birth or current place of residence, it is possible that they are the result of undetected stratification in the sample. This could include, for example, persistence of cultural transmission of lifestyle or dietary habits associated with grandparental origins.
All these results will have an important impact over future studies that use the GS:SFHS cohort and correspond well with known Scottish demographic history. We have performed a thorough analysis of genomic data in GS:SFHS, applying standard methods such as the use of PCA, but also inspecting the results of such analyses in depth We have discovered some individuals with mixed ancestries that should be removed for future studies (e.g., GWAS) and we have characterised some other that can remain in the cohort.
The data were obtained from the Generation Scotland: Scottish Family Health Study (GS:SFHS) . Ethical approval for the study was given by the NHS Tayside committee on research ethics (reference 05/s1401/89). Governance of the study, including public engagement, protocol development and access arrangements, was overseen by an independent advisory board, established by the Scottish government. Research participants gave consent to allow both academic and commercial research.
Individuals were genotyped with the Illumina OMNIExpress chip (706,786 SNPs). We used GenABEL version 1.7-6  and PLINK version 1.07  to exclude SNPs that had a missingness >2 % and a Hardy-Weinberg Equilibrium test P < 10−6. Duplicate samples, individuals with gender discrepancies and those with more than 2 % missing genotypes were also removed. After this quality control, the data set consisted in 9889 individuals (4085 males and 5804 females) with multiple degrees of kinship (~6000 non related), genotyped for 646,127 SNP spread over the 22 autosomes. The recorded information about the locations of origin of these individuals is shown in Additional file 1: Figure S14.
In order to investigate ancestry, the GS:SFHS data set was merged to the data of 1092 individuals of different origins (Table 1) from the 1000 Genomes Project (1 kG) . The resulting data set (GS + 1 kG) had 10,981 individuals genotyped for 635,190 markers common to all populations spread over the 22 autosomes.
To further investigate the relationships of GS:SFHS with other European populations, another data set was created. 200 individuals (randomly selected) from Generation Scotland and the samples with European origin in the 1 kG Project (CEU, FIN, GBR, IBS and TSI, see Table 1) were merged with two new populations. 150 individuals from a Croatian population from the Dalmatian island of Korcula (KOR)  and 150 individuals from another Scottish population from the Orkney Islands (ORK) . Korcula received ethical approval from the Ethics Committee of the Medical School, University of Split and the NHS Lothian (South East Scotland Research Ethics Committee). The ORCADES study (referred as ORK), received ethical approval from the NHS Orkney Research Ethics Committee and North of Scotland Research Ethics Committee. All participants signed informed consent prior to participation.
This data set (GS + EU) consisted in 879 individuals genotyped for 97,648 markers in approximate linkage equilibrium and with minor allele frequencies (MAF) larger than 0.01.
Genomic relationship matrix
Relationships were estimated from the genotyped SNP data and summarised into a genomic relationship matrix (GRM) using GCTA version 1.13 . In the first instance, we used all autosomal SNP data that had passed quality control (regardless of MAF). In order to investigate the effect of the allele frequency spectrum of the SNPs used to compute the GRM on relationship estimation, we computed five extra GRMs with different SNP MAF thresholds in the merged population GS + 1 kG. The number of markers corresponding to the thresholds applied is shown in Additional file 1: Table S2. We also computed a new GRM in the GS:SFHS population, after removing areas in high linkage disequilibrium (LD). To do this, we removed all SNPs from chromosome 6 (to remove the influence of Major Histocompatibility Complex) and those located in area 8p23.1 (to remove the influence of a large polymorphic inversion). Then, we generated a pruned subset of markers in approximate linkage equilibrium by using the default indep-pairwise command in PLINK . We also excluded related individuals using the grm-cutoff command in GCTA  with a value of 0.025 (i.e., removing iteratively one of a pair of individuals with a relationship coefficient larger than 0.025). An extra GRM was also computed in the GS + EU population. These GRMs were used as input for the Principal Component Analyses described in the following section that aim to assign ancestral origins to GS:SFHS participants, and to establish the nature of information on relatedness conveyed by markers of different allele frequencies. In addition, to establish the contribution of each chromosome to the observed structure, twenty two extra GRMs (chromosomal GRMs) were calculated i.e., one for each autosome, using only the markers in one chromosome (from 1 to 22) without excluding any marker that had passed quality control irrespective of its frequency in the GS + 1 kG merged population.
Principal component analyses
Principal component analysis (PCA) is a widely used method to convert a set of observations for different (and possible related) variables into values of linearly uncorrelated variables called principal components or eigenvectors. If we apply it to the GRM, we can use it to determine, control and correct for population structure, as usually done in genome-wide association studies. Resulting clusters will reflect individuals that group together because of higher genomic similarities . We carried out an eigenanalysis of the GRM of the GS:SFHS population. In order to put into context the variation observed in GS:SFHS (and any potential outliers in the population, as per ), we used the populations in the 1 kG data as outgroups, and carried out a PCA of the GS + 1 kG population (of the GRM computed using all markers), and a PCA of the GS + EU population. In addition we performed 22 eigenanalysis using the 22 chromosomal GRM. All analyses were conducted in ACTA version 0.9 . We calculated and used in further analyses the first 20 eigenvectors or principal components (PCs) per analysis.
For some of the analyses, to locate the specific areas of the genome causing the different clustering patterns in the different PCs, genome-wide association analyses of the principal components were performed. We tested for association and estimated the effect of each SNP on the values observed for the eigenvectors of interest. In each analysis, the values of each principal component were analysed as a phenotype in a linear model fitting each SNP in turn and including sex, age and age2 as covariates using the linear function in PLINK v1.07 .
Genetic structure due to rare alleles
Where N is the total number of markers, x ij is the genotype of individual j at marker i (coded as 0, 1 or 2, representing the number of copies of the less frequent allele) and p i is the minor allele frequency (MAF) of marker i in the population.
This makes markers with a low MAF in GS:SFHS (smaller values of p i ) contribute to a greater extent to the rarity score. The rarest allele of those markers with low MAF in GS:SFHS are more likely to have come from a different population, and we hypothesise that they have come (“been introgressed”) into the Scottish population from another, distinct, population. To further analyse the structure of rare areas in the genome of GS:SFHS individuals, we estimated the number of rare peaks per individual as follows: we divided each chromosome in non-overlapping windows of 50 SNPs and we calculated the mean rarity score in the window as the mean of all the rarity score values of the all the markers in a window. We defined as rare peaks those windows with a mean rarity score larger than 50 (e.g., those where the harmonic mean allele frequency across the 50 loci in the window is less than approximately 0.02). We assumed that windows with a mean rarity score larger than 50 separated by less than 10 other windows (i.e., less than 500 SNP) belonged to the same rare peak that also included all the in-between windows. We tested several window lengths and thresholds and obtained similar results for all the analyses.
Estimation of individual ancestries using ADMIXTURE
We used the software ADMIXTURE  to estimate the proportion of ancestral populations in GS + 1 kG individuals. We reduced the number of markers in the GS + 1 kG dataset to remove linkage disequilibrium. A subset of markers (81,981 SNPs) was generated by using the default indep-pairwise command in PLINK. Then, we estimated the proportion African, Asian and European origin (K = 3) in all individuals of GS + 1 kG.
Runs of homozygosity
To analyse the pattern of homozygosity in the GS:SFHS cohort, a screen for runs of homozygosity (ROH) was performed using PLINK v1.07  with the same parameters as in McQuillan et al. . The program slides a moving a window of 5,000 Kb (with a minimum of 50 SNPs) and locates and estimates the length of stretches of homozygous genotypes across the genome. The analysis allows one heterozygote genotype (allowing for example for genotyping errors) and 5 missing genotypes per window for the region to still be considered homozygous. The maximum gap between consecutive SNPs to be considered in the same ROH was 100 Kb.
Influence of the geographic origin in the genomics of GS:SFHS
To evaluate the correlation between geographic origin of the individuals and the principal components in the analyses performed, we calculated the regression between the different values of the PCs and the values of latitude and longitude assigned to the origin of the individuals. We used the first 20 whole-genome PCs to develop two different sets of analyses. Individuals were given a value of latitude and longitude according to their birth place, as shown in Table 4, and the multiple regressions of each PCs over both values were calculated. We also extracted a subset of individuals with four grandparents having the same origin (1113 individuals). Fig. 6a shows the distribution of the origins over a map of Scotland. The size of the points represents the number of individuals with four grandparents from each area. We then performed a multiple linear regression between the values of the first 20 PCs in the 1113 individuals and the latitude and longitude of the grandparents’ origins. For these analyses the individuals detected as having Italian origin and those with mixed ethnic backgrounds detected in the GS + 1 kG PCA were removed in order to avoid bias in the geographic component. We also used the values of the PCs to predict latitude and longitude in these individuals by fitting a linear model regressing latitude or longitude against the values of the PCs for each individual. All regression analyses were performed using R version 3.0.1. .
Impact of population structure and geographic location on health related traits
We also explored the impact of the predicted genomic origin of the individuals and their area of residence on four health-related traits: body mass index (BMI), fat, waist-to-hip ratio (WHR) and high density lipoprotein (HDL). We used the values of the PCs to predict latitude and longitude of genomic origin in the unrelated individuals subset using the model detailed in the previous section (Influence of the geographic origin in the genomics of GS:SFHS). Then, for each trait, we used GCTA  to fit two different mixed linear models.
We estimated the effects of the covariates: sex, age, age2, predicted latitude, predicted longitude and current area of residence (based on the different Postal Codes, see regions in Additional file 1: Table S4) and explored if the origin of the individuals estimated from their genomic data (i.e., the predicted geographic coordinates, that represent an average of the origin of the four grandparents) has an impact on the traits.
In this model we fitted sex, age, age2 as covariates and compared the estimates of σ g 2 (V(GRM)) and σ s 2 (V(GEO)) with those obtained fitting the same covariates and including also current place of residence in order to explore if there is a change in the variance explained by those models.
The authors want to acknowledge the Medical Research Council (MRC) UK for funding and the three reviewers for their useful comments. Generation Scotland has received core funding from the Chief Scientist Office of the Scottish Government Health Directorates CZD/16/6 and the Scottish Funding Council HR03006. We are grateful to all the families who took part, the general practitioners and the Scottish School of Primary Care for their help in recruiting them, and the whole Generation Scotland team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists, healthcare assistants and nurses. Genotyping of the GS:SFHS samples was carried out by the Genetics Core Laboratory at the Wellcome Trust Clinical Research Facility, Edinburgh, Scotland and was funded by the UK MRC. CROATIA-Korcula: The CROATIA-Korcula study was funded by grants from the Medical Research Council (UK), European Commission Framework 6 project EUROSPAN (Contract No. LSHG-CT-2006-018947) and Republic of Croatia Ministry of Science, Education and Sports research grants to I.R. (108-1080315-0302). We would like to acknowledge the invaluable contributions of the recruitment team in Korcula, the administrative teams in Croatia and Edinburgh and the people of Korcula. The SNP genotyping for the CROATIA-Korcula cohort was performed in Helmholtz Zentrum München, Neuherberg, Germany. Orkney Complex Disease Study (ORCADES): ORCADES was supported by the Chief Scientist Office of the Scottish Government, the Royal Society, the MRC Human Genetics Unit, Arthritis Research UK and the European Union framework program 6 EUROSPAN project (contract no. LSHG-CT-2006-018947). DNA extractions were performed at the Wellcome Trust Clinical Research Facility in Edinburgh. We would like to acknowledge the invaluable contributions of Lorraine Anderson and the research nurses in Orkney, the administrative team in Edinburgh and the people of Orkney.
- Smith BH, Campbell A, Linksted P, Fitzpatrick B, Jackson C, Kerr SM, et al. Cohort profile: Generation Scotland: Scottish Family Health Study (GS:SFHS). The study, its participants and their potential for genetic research on health and illness. Int J Epidemiol. 2012;42(3):689–700.PubMedView ArticleGoogle Scholar
- Smith BH, Campbell H, Blackwood D, Connell J, Connor M, Deary IJ, et al. Generation Scotland: the Scottish Family Health Study; a new resource for researching genes and heritability. BMC Med Genet. 2006;7:74.PubMed CentralPubMedView ArticleGoogle Scholar
- Tian C, Gregersen PK, Seldin MF. Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet. 2008;17(R2):R143–50.PubMed CentralPubMedView ArticleGoogle Scholar
- O’Dushlaine CT, Morris D, Moskvina V, Gill M, Corvin A, Wilson JF, et al. Population structure and genome-wide patterns of variation in Ireland and Britain. Eur J Hum Genet. 2010;18(11):1248–54.PubMed CentralPubMedView ArticleGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9.PubMedView ArticleGoogle Scholar
- Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, et al. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004;36(4):388–93.PubMedView ArticleGoogle Scholar
- Salm MPA, Horswell SD, Hutchison CE, Speedy HE, Yang X, Liang L, et al. The origin, global distribution, and functional impact of the human 8p23 inversion polymorphism. Genome Res. 2012;22(6):1144–53.PubMed CentralPubMedView ArticleGoogle Scholar
- Broman KW, Matsumoto N, Giglio S, Martin CL, Roseberry JA, Zuffardi O, et al. Common long human inversion polymorphism on chromosome 8p. In: Goldstein D, editor. Science and statistics: a festschrift for Terry Speed IMS Lecture notes—Monograph series, vol. 40. Beachwood, OH: Institute of Mathematical Statistics; 2003. p. 237.View ArticleGoogle Scholar
- Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7219):274–4.View ArticleGoogle Scholar
- Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19(9):1655–64.PubMed CentralPubMedView ArticleGoogle Scholar
- The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–78.PubMed CentralView ArticleGoogle Scholar
- Zou F, Lee S, Knowles MR, Wright FA. Quantification of population structure using correlated SNPs by shrinkage principal components. Hum Hered. 2010;70(1):9–22.PubMed CentralPubMedView ArticleGoogle Scholar
- McQuillan R, Leutenegger A-L, Abdel-Rahman R, Franklin CS, Pericic M, Barac-Lauc L, et al. Runs of homozygosity in European populations. Am J Hum Genet. 2008;83(3):359–72.PubMed CentralPubMedView ArticleGoogle Scholar
- Kirin M, McQuillan R, Franklin CS, Campbell H, McKeigue PM, Wilson JF. Genomic runs of homozygosity record population history and consanguinity. PLoS One. 2010;5(11), e13996.PubMed CentralPubMedView ArticleGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–9.PubMed CentralPubMedView ArticleGoogle Scholar
- Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82.PubMed CentralPubMedView ArticleGoogle Scholar
- Stringer CB, Andrews P. Genetic and fossil evidence for the origin of modern humans. Science. 1988;239(4845):1263–8.PubMedView ArticleGoogle Scholar
- Tishkoff SA, Williams SM. Genetic analysis of African populations: Human evolution and complex disease. Nat Rev Genet. 2002;3(8):611–21.PubMedGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, et al. The structure of haplotype blocks in the human genome. Science. 2002;296(5576):2225–9.PubMedView ArticleGoogle Scholar
- Wall JD, Pritchard JK. Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet. 2003;4(8):587–97.PubMedView ArticleGoogle Scholar
- Jensen-Seaman MI, Furey TS, Payseur BA, Lu YT, Roskin KM, Chen CF, et al. Comparative recombination rates in the rat, mouse, and human genomes. Genome Res. 2004;14(4):528–38.PubMed CentralPubMedView ArticleGoogle Scholar
- Cáceres A, Sindi SS, Raphael BJ, Cáceres M, González JR. Identification of polymorphic inversions from genotypes. BMC Bioinf. 2012;13:28.View ArticleGoogle Scholar
- Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, Barnard J, et al. A common inversion under selection in Europeans. Nat Genet. 2005;37(2):129–37.PubMedView ArticleGoogle Scholar
- Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007;23(10):1294–6.PubMedView ArticleGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.PubMed CentralPubMedView ArticleGoogle Scholar
- The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73.PubMed CentralView ArticleGoogle Scholar
- Polasek O, Marusic A, Rotim K, Hayward C, Vitart V, Huffman J, et al. Genome-wide Association Study of Anthropometric Traits in Korcula Island, Croatia. Croat Med J. 2009;50(1):7–16.PubMed CentralPubMedView ArticleGoogle Scholar
- Lee AB, Luca D, Klei L, Devlin B, Roeder K. Discovering genetic ancestry using spectral graph theory. Genet Epidemiol. 2010;34(1):51–9.PubMedGoogle Scholar
- Gray A, Stewart I, Tenesa A. Advanced complex trait analysis. Bioinformatics. 2012;28(23):3134–6.PubMedView ArticleGoogle Scholar
- R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2014.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.