Genetic Structure of the Spanish Population
- Javier Gayán†1,
- José J Galan†1,
- Antonio González-Pérez†1,
- María Eugenia Sáez†1,
- María Teresa Martínez-Larrad2,
- Carina Zabena2,
- M Carmen Rivero1,
- Ana Salinas1,
- Reposo Ramírez-Lorca1,
- Francisco J Morón1,
- Jose Luis Royo1,
- Concha Moreno-Rey1,
- Juan Velasco1,
- José M Carrasco1,
- Eva Molero1,
- Carolina Ochoa1,
- María Dolores Ochoa1,
- Marta Gutiérrez1,
- Mercedes Reina1,
- Rocío Pascual1,
- Alejandro Romo-Astorga1,
- Juan Luis Susillo-González1,
- Enrique Vázquez1,
- Luis M Real1,
- Agustín Ruiz1Email author and
- Manuel Serrano-Ríos2Email author
© Gayán et al; licensee BioMed Central Ltd. 2010
Received: 14 December 2009
Accepted: 25 May 2010
Published: 25 May 2010
Genetic admixture is a common caveat for genetic association analysis. Therefore, it is important to characterize the genetic structure of the population under study to control for this kind of potential bias.
In this study we have sampled over 800 unrelated individuals from the population of Spain, and have genotyped them with a genome-wide coverage. We have carried out linkage disequilibrium, haplotype, population structure and copy-number variation (CNV) analyses, and have compared these estimates of the Spanish population with existing data from similar efforts.
In general, the Spanish population is similar to the Western and Northern Europeans, but has a more diverse haplotypic structure. Moreover, the Spanish population is also largely homogeneous within itself, although patterns of micro-structure may be able to predict locations of origin from distant regions. Finally, we also present the first characterization of a CNV map of the Spanish population. These results and original data are made available to the scientific community.
The large genotyping studies in the last decade have revolutionize genetic studies. Our current ability to characterize the human genome is unprecedented [1–3], and is contributing to improve our understanding of the genetic etiology of common diseases.
Genetic admixture is one of the caveats for genetic association studies , and has fostered the comparative study of the genetic structure of different human populations. A large number of studies are underway to identify the similarities and differences among existing human populations [2, 3]. These studies started comparing the general human populations such as Africans, Asians and Europeans, but have recently focused on the more specific subgroups within them [5–8]. It seems that, as genetically similar as humans are, we can now tune the genetic "microscope" so that subtle genetic differences among related subpopulations can be detected , even among regions within a country [10, 11].
The Neocodex Biobank and Genome Research Consortium is planning a number of genome-wide association studies (GWAS) in several complex phenotypes. Our basic and general strategy will consist in the systematic comparison of a well-characterized population-based control dataset against a number of datasets of complex phenotypes, such as metabolic syndrome, osteoporosis, Alzheimer's disease, colorectal cancer or multiple sclerosis. Therefore, it is markedly important to select individuals representative of the genetic diversity co-existent in Spain and to make an in-depth genomic characterization of these control individuals that will serve as a reference panel for future GWAS studies.
As an initial step of our investigation, we decided to characterize the genetic structure of the Spanish population using high density SNP arrays. This study lays an essential base for future GWAS, by identifying potential sources of bias that may affect experimental results and that could increase the noise and false positive rate of GWAS in our population. Furthermore, this work begins the characterization of common copy number variants (CNVs) in our population that might interfere with association studies in discrete regions of the genome or that may be related to the phenotypes by itself.
In this study, we have analyzed linkage disequilibrium (LD) patterns and haplotype blocks in the population of Spain, and compared them to Western and Northern Europeans. We have also estimated population stratification and substructure, and have identified CNVs in this sample of the Spanish population.
801 Spanish individuals were genotyped with the Affymetrix Nsp I 250 K chip, from which 166,588 SNPs passed the quality control filters, and were used in the LD, haplotypic and structure analyses described below. In addition, genotype data from the HapMap project were used for comparison purposes: we selected the genotypes from the same chip for 60 unrelated CEU individuals. Moreover, subsets of HapMap individuals with European, African, and Asian ancestry were employed in the principal components analysis.
LD and haplotypic structure
Beyond pair-wise LD patterns, haplotype blocks give a more global description of LD structure. In this sample that represents the population of Spain, we have estimated 33,037 haplotype blocks in the 22 autosomal chromosomes. A list of haplotype blocks, including haplotype frequencies, LD between adjacent haplotypes, and multiallelic LD between adjacent blocks, is included as Additional File 1. Each block covers 3.97 SNPs on average, ranging from small blocks of only 2 SNPs to some very large blocks of as much as 64 SNPs. This largest block is located in chromosome 17q21.31:41,097,235-42,177,829, between rs17760577 and rs199535. This 17q21.31 region is a gene-rich region (including CRHR1 and MAPT) exhibiting large LD blocks (approximately 623 kb) in the HapMap Phase II dataset in all populations studied (YRI,CEU, and JPT+CHB), and with an interesting evolutionary story involving a large inversion .
Total block size (bp)
Mean block size (bp)
Block coverage (%)
Nonetheless, it is noteworthy that a larger portion of the genome is covered by blocks in the Spanish sample (28%), than in the CEU sample (24%). Again, the percentage of chromosomes covered by blocks is quite variable across the chromosomes, ranging between 12.12% for chromosome 19 and 37.34% for chromosome 6q.
Population stratification was analyzed with the STRUCTURE and EIGENSOFT softwares. Two sets of SNPs were analyzed: Subset A consists of 2,050 unlinked SNPs, while subset B includes 102,850 SNPs selected under less stringent criteria for marker relatedness.
A total of 11,743 CNVs were identified in our sample set (14.70 CNVs per individual on average). With the aim of avoiding as much false positive results as possible, we will only consider here those 623 CNVs present in, at least, three individuals (Additional File 2).
Overall, those CNVs span 70.64 Mb of human autosomal genome and chromosome X. Mean (SD) and median sizes for those variants are 194.02 (205.26) Kb and 150.70 Kb, respectively, with a range of 10.15 Kb to 2,475.57 Kb. Population frequency ranges from 0.37% to 44.94%, but only 214 CNVs have frequencies above 1%. Most of the CNVs detected are copy number gains (47.51%), followed by copy number losses (26.64%), and copy number gains and losses (25.84%). We did not detect any difference in mean population frequencies among copy number states. However, copy number losses are lower in size than copy number gains (147.33 Kb versus 203.82 Kb; Mann-Whitney U test p < 0.01).
Some of the CNVs identified in this study (83.31%) overlap fully or partially with previously described structural variants. The mean (SD) and median nucleotide coverage of identified CNVs by previous CNVs (those included in DGV) are 60% (44%) and 87%, respectively. There is a positive correlation among the population frequency of the CNVs and their base pair coverage by previously detected structural variants (Spearman's rho = 0.24; p < 0.01). We detected 104 new CNVs (16.69%) and none of them were above 7.37% of population frequency.
It has been proposed that genomic regions flanked by segmental duplications (SD) (i.e. genomic stretches from 1 to 400 Kb in length with > 0.90 similarity) are susceptible to structural variations by nonallelic homologous recombination (NHR) . To investigate whether NHR could account for some of the CNVs in our sample set, we calculated the percentage of CNVs included in genomic rearrangement hotspots. These regions were defined as DNA stretches of 50 Kb to 10 Mb in length, flanked by intrachromosomal SD > 10 Kb in size, in a similar way than Sharp et al. (2005). Indeed we found that 217 CNVs (34.83%) are included in rearrangement hotspots. Interestingly, we observed that those CNVs are statistically more frequent than CNVs located out of rearrangement hotspots (mean frequencies 2.47% and 1.34%, respectively; Mann-Whitney U test p < 0.01). In addition, the percentage of copy number states are statistically different among those two groups of CNVs, since those within rearrangement hotspots present a higher percentage of copy number gains and losses (both) when compared to CNVs out of these regions (42.86% and 16.75%, respectively; Pearson X2 = 50.37; p < 0.01).
To analyse the impact of CNVs on genomic functional elements, we created a gene interval map comprising 22,738 known genes (refseqs) at autosomes and chromosome X. 553 CNVs (88.76%) overlap at least one gene interval. It has been suggested that deletions are biased away from genes . We observed that the median number of genes is lower in copy number losses (mean = 3.31; SD = 3.05) when compared to copy number gains (mean = 4.14; SD = 5.01) but this difference does not reach statistical significance in our sample set. We identified 154 CNVs overlapping 125 loci included at the morbidmap list ftp://ftp.ncbi.nlm.nih.gov/repository/OMIM/.
The impact of CNVs on genomic surveys was also assessed by analysing Hardy-Weinberg equilibrium (HWE) and missing genotype data. Only a small proportion of the markers with HWE deviations (4.08%) and markers with missing genotypes above 0.10 (2.28%) are included in CNV regions.
This work has generated over 200 million genotypes, the largest study of this kind in Spain. Detailed information of the genetic structure of the Spanish population will serve as a reference framework for future GWAS studies in Spain, and will be shared with other researchers via external National Public Health evaluation and approval.
We have characterized the genetic structure of the population of Spain, describing genome-wide LD patterns, haplotype blocks, population structure and copy-number variants in a sample of over 800 unrelated Spanish individuals. The individuals that participated in the study were recruited by a random sampling approach from a cross-sectional population-based epidemiological survey from eight locations in Spain, representing different geographical locations across the country (South, Central, North-East and North-West). The recruiting centers include both small rural clinics as well as large hospitals close to major metropolitan areas. Individuals that reported a different nationality were not included in the study. Therefore, the sample can be considered as representative of the general Spanish population.
These samples were genotyped at Neocodex with an Affymetrix Nsp I 250 K chip. The high call-rate (99.1%) speaks of the high quality of the genotyping performed. Although there are now commercial genotyping chips that provide a more complete coverage of the genome, at the starting point of this project this Nsp I 250 K chip was the best possible choice, and provides enough genotype information for the current project.
LD and haplotypic structure
A major finding of the present study is that the Spanish population is generally similar to the CEU HapMap sample (of Northern and Western Europe origin), but also largely homogeneous within itself. Numerous pieces of evidence point to this conclusion. For example, a significant proportion of the SNPs analyzed were monomorphic (2.3%) or rare (10.2%), even in this large sample of 801 individuals. In comparison, the CEU dataset yielded 15.1% of monomorphic or rare SNPs, but in a much smaller sample of only 60 individuals (over 13 times smaller). This large amount of SNPs with no or very little variability is a sign of the homogeneity of the Spanish population.
The LD patterns observed in this sample of the Spanish population is similar to the patterns observed in the CEU HapMap sample. This is not surprising since the level of genetic differentiation within Europe is small . We detect LD extending over large distances in the Spanish population, but less than in the CEU sample. We have also found a large number (33,037) of haplotype blocks. These blocks are generally closely located to the blocks detected in the CEU sample, but in the Spanish sample there are more blocks, and smaller on average. These findings could be an artefact due to the difference in sample size between the two samples, but may indeed be reflecting the more complex origin of the current Spanish population . Indeed, these results confirm the suggestion that the Spanish population has more haplotypic diversity than Northern/Western Europeans . This is a possible scenario, given that the Iberian Peninsula has been under large and long-lasting migratory influences, and admixture, from other European, Mediterranean, and North African populations.
Another interesting finding is that a larger portion of the genome is covered by blocks in the Spanish sample (28%), than in the CEU sample (24%). This finding is again probably due to the larger Spanish sample, so that the 1602 chromosomes analyzed probably revealed more rare haplotypes, therefore enlarging the proportion of the chromosome covered by haplotype blocks. This extra block coverage in the Spanish sample may turn useful for association studies, although this is probably a characteristic of other large homogeneous samples.
These results suggest that the general Spanish population, as characterized in the present study by sampling from eight different cities widely-spaced across Spain, is generally similar to other European populations, although more genetically diverse than Western and Northern Europeans. Moreover, the Spanish population is remarkably homogeneous within itself in terms of global genetic structure. In view of these results, the population of Spain is sufficiently genetically similar to the CEU sample so that the CEU HapMap dataset could be used to infer genotypes for the Spanish population. Nonetheless, in spite of their general similarity, there are substantial differences between these two European subgroups, and therefore imputed data from the HapMap study many not describe some particular genetic patterns of the Spanish population. The dataset in this study can be extremely useful to compare allele and haplotype frequencies against the CEU sample, and to estimate the confidence of imputed genotypes in all regions of the genome. It is important to note here that some of the differences found between the Spanish and the CEU samples may be due to the difference in size among both samples.
The results of our population structure analyses are consistent with no major population stratification present in this sample of the Spanish population. This result is reassuring since individuals reporting nationalities other than Spanish were excluded from the study. Both, Structure and PC results with a set of 2,050 uncorrelated SNPs showed no evidence of genetic diversity in the sample.
In addition, we were able to analyze fine structure within this sample by running PC analysis using a large set of markers (102,850 SNPs). The results of this second analysis are also consistent with prior reports that were able to predict locations of origin within a 700 Km radius using different European populations [5, 7], and other studies that found subtle differences between locations within a country [10, 11]. In our sample, following a similar strategy, we were able to differentiate between the two more geographically distant centers. Furthermore, these observed differences seem to correspond to the same geographical axis that has been previously found in European populations. This fine structure can be the result of genomic regions that show strong geographic variation  and may be more evident in small, rural or isolated samples than in major cities where subpopulations tend to mix . This potential source of bias should be taken into account in association studies. It is worth noting that our sample of the Spanish population was quite homogeneous, and the genomic inflation factor (based on median chi-squared), as estimated by the software Plink , was exactly 1, as expected when only one population is being analyzed, but still specific genomic regions need to be carefully reviewed.
We have also defined the first CNV map in the Spanish general population. According to our data, 2.35% of the human genome (autosomes and chromosome X) is susceptible to structural variants. This estimation is in range with previously published studies analysing structural variants with the Affymetrix platform [19–21].
We detected a wide range of CNVs population frequencies, although only 34.35% of these variants had a population frequency above 1%. 301 of the CNVs described in this work are fully covered by previously described structural variants. In addition, another ten CNVs have 90% or more of their nucleotides represented in previous CNVs. These 311 CNVs are therefore supported by at least one independent study. The remaining 312 CNVs are also included in Additional File 2 but for descriptive purposes only. These CNVs need to be confirmed in independent datasets. Indeed, because we have analyzed 799 samples, some of these CNVs could be low frequency or population specific variants which went undetected in previous studies with smaller sample sets.
We have confirmed in this study that nonallelic homologous recombination (NHR) could explain the origin of about 33% of CNVs. Interestingly, those CNVs are more frequent than other variants out of rearrangement hotspot regions and they represent 46.50% of all CNVs detected in this study. Regardless of the frequency of NHR events, we estimate that a considerable proportion of CNVs in the normal population may be a consequence of NHRs.
Most of the CNVs detected in our study overlap with known genes, and of those, 157 CNVs (25.22%) overlap with 125 disease loci. This observation is in agreement with previous results. For instance, the 38,406 structural variant regions included in DGV overlap 1183 disease loci. There exist several plausible reasons for these observations, such as the existence of false positives in CNV genome-wide surveys, inaccurate disease-frequency estimates, embryonic lethality effect for homozygous deletions of specific genes, misclassification of samples as normal controls, and rescue of the altered gene function by other related gene product .
In our sample set, only 43 (6.90%) CNVs overlapping disease loci have a population frequency above 1%, and none of them include homozygous deletions. From those, only three CNVs exceed 10% in population frequency and all of them are completely covered by previously described structural variants. Two of these CNVs are contiguous on chromosomal region 15q11.2, one of the most unstable regions in the human genome . These two CNVs overlap the genes hect domain and RLD 2 (HERC2) associated with skin, hair and eye pigmentation (OMIM: 227220), and BCL8 B-cell CLL/lymphoma 8 (BCL8) which has been implicated potentially in B-cell lymphoma (OMIM: 601889). The third CNV is located at 19p13.13 and overlaps with the gene RNASEH2A ribonuclease H2, subunit A (RNASEH2A) whose mutations may be responsible for the Aicardi-Goutieres syndrome (OMIM: 610333). Interestingly, this is a severe autosomal recessive disorder that mimics in utero viral infections and therefore its real incidence could be underestimated .
All these data suggest that some disease loci could be located within genomic regions that are prone to structural alterations. This observation has potential implications on the molecular diagnosis and on the disease frequency estimations of the phenotypes.
Finally, our results suggest that structural variants could be responsible for a small percentage of the Hardy-Weinberg deviations and missing genotypes commonly observed in genome-wide surveys. Therefore, it is advisable to consider the existence of such structural variants for specific SNPs when Genome Wide Association Studies are (GWAS) performed.
In summary, we have performed a deep characterization of our reference control population for GWAS and confirmed that the Spanish population is sufficiently homogeneous to conduct genetic association studies with minor risk of population stratification. In addition, the results obtained, together with other concomitant efforts underway in other European countries, will be useful to shed light on the nature of European genetic diversity and the Spanish population genomic history. Complete data and further details of our study, including raw genotypes, can be accessed after external Ethical Committee review and Public administrative authorisation.
The dataset includes 825 unrelated individuals recruited by a random sampling approach from a cross-sectional population-based epidemiological survey performed in eight different cities of Spain, including Alicante, Arévalo (Ávila), Avilés (Asturias), Málaga, Mérida (Badajoz), Segovia, Talavera (Toledo), and Vic (Barcelona). The recruiting centers include both small rural clinics as well as large hospitals close to major metropolitan areas from across the country (South, Central, North-East and North-West). Individuals that reported a different nationality were not included in the study. Therefore, the sample can be considered as representative of the general Spanish population. The goal of the survey was to investigate the prevalence in the Spanish population of anthropometric and physiological parameters related to obesity and other components of the metabolic syndrome [25, 26]. The sample includes a total of 450 males (54.5%), and 375 females (45.5%), with an average age of 52 (SD = 8.84) years old, and a range 34-76.
Identity-By-State (IBS) sharing can identify sample duplications or related individuals. Genome-wide IBS estimates suggested the presence of 19 pairs of siblings, two sibling trios, and one parent-offspring pair, and therefore 24 individuals were removed to eliminate these relationships. The remaining samples (N = 801) used in this study grouped together in a broad cluster of diverse ranges of relatedness. All study subjects gave their written informed consent to participate in the study. The study protocol was approved by the Ethics Committee of the Hospital Clínico San Carlos of Madrid.
In addition, for the LD and haplotype analysis, 60 unrelated individuals from the CEU HapMap dataset were also used for comparison with the Spanish sample . The CEU dataset is composed of Utah residents with ancestry from northern and western Europe, and whose samples were collected by CEPH in 1980. For these individuals, we selected only the same SNPs that were genotyped in the Spanish sample. Both datasets were applied the same quality control process. Moreover, datasets from HapMap phase 3 release 3 http://hapmap.ncbi.nlm.nih.gov/ were also employed in the Principal Components analysis. More precisely we used Hap Map datasets of unrelated individuals with European (CEU: Utah residents with ancestry from northern and western Europe, and TSI: Toscani in Italy), African (ASW: African ancestry from Southwest USA, LWK: Luhya in Webuye, Kenya; MKK: Maasai in Kinyawa, Kenya; YRI: Yoruba in Ibadan, Nigeria) and Asian ancestry (CHB: Han Chinese in Beijing, China; CHD: Chinese in Metropolitan Denver, Colorado; GIH: Gujarati Indians in Houston, Texas; JPT Japanese in Tokyo, Japan) .
DNA extraction from frozen peripheral blood was performed in a MagNa Pure LC Instrument (Roche Diagnostics), using MagNa Pure LC DNA Isolation Kit (Roche Diagnostics) in accordance with the manufacturer's instructions.
Genotyping and Quality Control
All samples were genotyped using the Affymetrix Nsp I 250K chip, that includes 262,264 SNP markers (256,512 on autosomes, 5705 on sex chromosomes, and 47 control markers).
This chip provides a good coverage of the genome with an average SNP density of 1 SNP every 11 kb (median 1 SNP per 5 kb), and an average heterozygosity of 0.3. Genotypes were read and called with standard Affymetrix software (GCOS, GTYPE, Genotyping Console, BRLMM) using default parameters, and exported as linkage-format files.
All SNPs in the autosomal chromosomes were subjected to quality control filters, specifically a minor allele frequency (MAF) equal or larger than 10%, a SNP call-rate equal or larger than 90%, and a p-value for Hardy-Weinberg equilibrium (HWE) larger than 10xE-4. Regarding the minor allele frequency, 2.3% of the SNPs were monomorphic, 10.2% were rare alleles (MAF = 0-1%), and 20.4% were low-frequency alleles (MAF = 1-10%). Moreover, all samples yielded a call rate above 93%, as required by the BRLMM software. The average sample call rate was 99.1%, with a range 93.9-99.8%. In addition, the average SNP call rate was 99.1%, with a range 68.2-100%. 0.8% of the SNPs had call-rates below 90%, and 2.6% had call-rates between 90-95%. Finally, for our sample of 801 individuals we decided, based on simulations and Q-Q plots, that 10xE-4 was a sensible HWE cut-off value. We found that 2.0% of the SNPs had a p-value for the HWE test lower than 10xE-4.
In summary, 67.0% of the SNPs had a MAF = > 10%, 99.2% had call-rates above 90%, and 98% passed the HWE test. Overall, 64.9% (166,588) of all autosomal SNPs passed our quality control.
Plink  was employed to manage the datasets and perform quality control filters such as call rate, MAF, HWE, and Identity-By-State (IBS) estimates. GRR  was also employed to estimate IBS and visualize the resulting relationships.
LD and haplotype blocks were estimated with Haploview . Pair-wise LD was measured with Lewontin's standardized deviation coefficient (D') and with pair-wise correlation coefficient (r2). Haplotypes were estimated using the Gabriel definition  with all the defaults parameters as implemented in Haploview. LD and haplotype blocks were analyzed for each chromosome separately. Moreover, due to computer (RAM) limitations, each arm of the first six chromosomes was analyzed independently.
We explored the presence of population stratification in our study sample by using two different available software: STRUCTURE and EIGENSOFT. In order to run STRUCTURE a small subset of unlinked markers were selected using Plink, by excluding all SNPs with a pair-wise genotypic r2 greater than 1.1% with sliding windows of 200 SNPs (with increments of 5 SNPs between windows). A total of 2,050 SNPs (subset A) from the 166,559 that passed the quality control were identified. STRUCTURE uses a model-based clustering method for analyzing multilocus genotype data to infer population structure and assign individuals to populations . We tested different scenarios assuming a different number of underlying populations (k equals to 1 through 4) allowing a large number of iterations (25 K in the burn-in period followed by 500 K repetitions). We estimated the mean log likelihood of the data for a given k (referred to as L(K)) in each run. Furthermore we performed multiple runs for each value of k computing the overall mean L(K) and its standard deviation.
Long-range LD regions across the genome. These regions were excluded from subset B for the PC analysis.
Long-range LD Region
Copy Number Variant (CNV) analyses were carried out in the full SNP set by using the Copy Number Analysis Tool (CNAT) v.4.0 software (Affymetrix, Santa Clara) following the manufacturer's instructions. We selected 25 control female samples from other ongoing projects as the reference group. All the samples in this study (n = 801) passed the IQR quality control with the exception of two samples that were removed from further CNV analyses. Overall, our CNV sample set is composed of 799 samples. Statistical analyses were carried out using Statistical Package for Social Sciences (SPSS) software v.13.0.
As online resources, we used the hg18.knownGene and hg18.refGene tables to build a gene map interval in the autosomes and chromosome X. The table hg18.dgv was used to retrieve information about structural variants from the Database of Genomic Variants (DGV, http://projects.tcag.ca/variation/) and the table hg18.genomicSuperDups to define rearrangement hotspot regions. All these tables were downloaded from the Table Browser at the UCSC Genome Bioinformatics resource http://genome.ucsc.edu/. Galaxy browser tools were used to manage genomic intervals http://main.g2.bx.psu.edu/. Information about OMIM genes and phenotypes were extracted from mim2gene.txt and morbidmap.txt tables at NCBI FTP site ftp://ftp.ncbi.nlm.nih.gov/repository/OMIM/.
We thank all the participants that have contributed their time, information, and samples to this study. This work was supported in part by Agencia IDEA, Consejería de Innovación, Ciencia y Empresa (830882); Corporación Tecnológica de Andalucía (07/124); Ministerio de Educación y Ciencia (PCT-A41502790-2007 and PCT-010000-2007-18); Programa de Ayudas Torres Quevedo del Ministerio de Ciencia en Innovación (PTQ2002-0206, PTQ2003-0549, PTQ2003-0546, PTQ2003-0782, PTQ2003-0783, PTQ2004-0838, PTQ04-1-0006, PTQ04-3-0718, PTQ06-1-0002). CIBER de Diabetes y Enfermedades Metabólicas Asociadas (CIBERDEM) is an ISCIII project.
- The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.PubMed CentralView ArticleGoogle Scholar
- The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449: 851-861. 10.1038/nature06258.PubMed CentralView ArticleGoogle Scholar
- Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies. Nat Genet. 2004, 36: 512-517. 10.1038/ng1337.PubMedView ArticleGoogle Scholar
- Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, Caliebe A, Balascakova M, Bertranpetit J, Bindoff LA, Comas D: Correlation between genetic and geographic structure in Europe. Curr Biol. 2008, 18: 1241-1248. 10.1016/j.cub.2008.07.049.PubMedView ArticleGoogle Scholar
- McEvoy BP, Montgomery GW, McRae AF, Ripatti S, Perola M, Spector TD, Cherkas L, Ahmadi KR, Boomsma D, Willemsen G: Geographical structure and differential natural selection among North European populations. Genome Res. 2009, 19: 804-814. 10.1101/gr.083394.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR: Genes mirror geography within Europe. Nature. 2008, 456: 98-101. 10.1038/nature07331.PubMed CentralPubMedView ArticleGoogle Scholar
- Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4: e236-10.1371/journal.pgen.0030236.PubMed CentralPubMedView ArticleGoogle Scholar
- Auton A, Bryc K, Boyko AR, Lohmueller KE, Novembre J, Reynolds A, Indap A, Wright MH, Degenhardt JD, Gutenkunst RN: Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res. 2009, 19: 795-803. 10.1101/gr.088898.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Heath SC, Gut IG, Brennan P, McKay JD, Bencko V, Fabianova E, Foretova L, Georges M, Janout V, Kabesch M: Investigation of the fine structure of European populations with applications to disease association studies. Eur J Hum Genet. 2008, 16: 1413-1429. 10.1038/ejhg.2008.210.PubMedView ArticleGoogle Scholar
- Nelis M, Esko T, Magi R, Zimprich F, Zimprich A, Toncheva D, Karachanak S, Piskackova T, Balascak I, Peltonen L: Genetic structure of Europeans: a view from the North-East. PLoS One. 2009, 4: e5472-10.1371/journal.pone.0005472.PubMed CentralPubMedView ArticleGoogle Scholar
- Pennisi E: Genetics. 17q21.31: not your average genomic address. Science. 2008, 322: 842-845. 10.1126/science.322.5903.842.PubMedView ArticleGoogle Scholar
- Evanno G, Regnaut S, Goudet J: Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol. 2005, 14: 2611-2620. 10.1111/j.1365-294X.2005.02553.x.PubMedView ArticleGoogle Scholar
- Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005, 77: 78-88. 10.1086/431652.PubMed CentralPubMedView ArticleGoogle Scholar
- Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution survey of deletion polymorphism in the human genome. Nat Genet. 2006, 38: 75-81. 10.1038/ng1697.PubMedView ArticleGoogle Scholar
- The Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.PubMed CentralView ArticleGoogle Scholar
- Helgason A, Yngvadottir B, Hrafnkelsson B, Gulcher J, Stefansson K: An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005, 37: 90-95.PubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralPubMedView ArticleGoogle Scholar
- Komura D, Shen F, Ishikawa S, Fitch KR, Chen W, Zhang J, Liu G, Ihara S, Nakamura H, Hurles ME: Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006, 16: 1575-1584. 10.1101/gr.5629106.PubMed CentralPubMedView ArticleGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.PubMed CentralPubMedView ArticleGoogle Scholar
- Zogopoulos G, Ha KC, Naqib F, Moore S, Kim H, Montpetit A, Robidoux F, Laflamme P, Cotterchio M, Greenwood C: Germ-line DNA copy number variation frequencies in a large North American population. Hum Genet. 2007, 122: 345-353. 10.1007/s00439-007-0404-5.PubMedView ArticleGoogle Scholar
- Hegele RA: Copy-number variations and human disease. Am J Hum Genet. 2007, 81: 414-415. 10.1086/519220. author reply 415.PubMed CentralPubMedView ArticleGoogle Scholar
- Locke DP, Jiang Z, Pertz LM, Misceo D, Archidiacono N, Eichler EE: Molecular evolution of the human chromosome 15 pericentromeric region. Cytogenet Genome Res. 2005, 108: 73-82. 10.1159/000080804.PubMedView ArticleGoogle Scholar
- Crow YJ, Leitch A, Hayward BE, Garner A, Parmar R, Griffith E, Ali M, Semple C, Aicardi J, Babul-Hirji R: Mutations in genes encoding ribonuclease H2 subunits cause Aicardi-Goutieres syndrome and mimic congenital viral brain infection. Nat Genet. 2006, 38: 910-916. 10.1038/ng1842.PubMedView ArticleGoogle Scholar
- Lorenzo C, Serrano-Rios M, Martinez-Larrad MT, Gabriel R, Williams K, Gonzalez-Villalpando C, Stern MP, Hazuda HP, Haffner SM: Was the historic contribution of Spain to the Mexican gene pool partially responsible for the higher prevalence of type 2 diabetes in mexican-origin populations? The Spanish Insulin Resistance Study Group, the San Antonio Heart Study, and the Mexico City Diabetes Study. Diabetes Care. 2001, 24: 2059-2064. 10.2337/diacare.24.12.2059.PubMedView ArticleGoogle Scholar
- Martinez-Larrad MT, Fernandez-Perez C, Gonzalez-Sanchez JL, Lopez A, Fernandez-Alvarez J, Riviriego J, Serrano-Rios M: [Prevalence of the metabolic syndrome (ATP-III criteria). Population-based study of rural and urban areas in the Spanish province of Segovia]. Med Clin (Barc). 2005, 125: 481-486. 10.1157/13080210.View ArticleGoogle Scholar
- Duan S, Zhang W, Cox NJ, Dolan ME: FstSNP-HapMap3: a database of SNPs with high population differentiation for HapMap3. Bioinformation. 2008, 3: 139-141.PubMed CentralPubMedView ArticleGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: GRR: graphical representation of relationship errors. Bioinformatics. 2001, 17: 742-743. 10.1093/bioinformatics/17.8.742.PubMedView ArticleGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.PubMedView ArticleGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.PubMedView ArticleGoogle Scholar
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. Am J Hum Genet. 2000, 67: 170-181. 10.1086/302959.PubMed CentralPubMedView ArticleGoogle Scholar
- Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet. 2006, 2: e190-10.1371/journal.pgen.0020190.PubMed CentralPubMedView ArticleGoogle Scholar
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005, 15: 1451-1455. 10.1101/gr.4086505.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.