This work has generated over 200 million genotypes, the largest study of this kind in Spain. Detailed information of the genetic structure of the Spanish population will serve as a reference framework for future GWAS studies in Spain, and will be shared with other researchers via external National Public Health evaluation and approval.
We have characterized the genetic structure of the population of Spain, describing genome-wide LD patterns, haplotype blocks, population structure and copy-number variants in a sample of over 800 unrelated Spanish individuals. The individuals that participated in the study were recruited by a random sampling approach from a cross-sectional population-based epidemiological survey from eight locations in Spain, representing different geographical locations across the country (South, Central, North-East and North-West). The recruiting centers include both small rural clinics as well as large hospitals close to major metropolitan areas. Individuals that reported a different nationality were not included in the study. Therefore, the sample can be considered as representative of the general Spanish population.
These samples were genotyped at Neocodex with an Affymetrix Nsp I 250 K chip. The high call-rate (99.1%) speaks of the high quality of the genotyping performed. Although there are now commercial genotyping chips that provide a more complete coverage of the genome, at the starting point of this project this Nsp I 250 K chip was the best possible choice, and provides enough genotype information for the current project.
LD and haplotypic structure
A major finding of the present study is that the Spanish population is generally similar to the CEU HapMap sample (of Northern and Western Europe origin), but also largely homogeneous within itself. Numerous pieces of evidence point to this conclusion. For example, a significant proportion of the SNPs analyzed were monomorphic (2.3%) or rare (10.2%), even in this large sample of 801 individuals. In comparison, the CEU dataset yielded 15.1% of monomorphic or rare SNPs, but in a much smaller sample of only 60 individuals (over 13 times smaller). This large amount of SNPs with no or very little variability is a sign of the homogeneity of the Spanish population.
The LD patterns observed in this sample of the Spanish population is similar to the patterns observed in the CEU HapMap sample. This is not surprising since the level of genetic differentiation within Europe is small . We detect LD extending over large distances in the Spanish population, but less than in the CEU sample. We have also found a large number (33,037) of haplotype blocks. These blocks are generally closely located to the blocks detected in the CEU sample, but in the Spanish sample there are more blocks, and smaller on average. These findings could be an artefact due to the difference in sample size between the two samples, but may indeed be reflecting the more complex origin of the current Spanish population . Indeed, these results confirm the suggestion that the Spanish population has more haplotypic diversity than Northern/Western Europeans . This is a possible scenario, given that the Iberian Peninsula has been under large and long-lasting migratory influences, and admixture, from other European, Mediterranean, and North African populations.
Another interesting finding is that a larger portion of the genome is covered by blocks in the Spanish sample (28%), than in the CEU sample (24%). This finding is again probably due to the larger Spanish sample, so that the 1602 chromosomes analyzed probably revealed more rare haplotypes, therefore enlarging the proportion of the chromosome covered by haplotype blocks. This extra block coverage in the Spanish sample may turn useful for association studies, although this is probably a characteristic of other large homogeneous samples.
These results suggest that the general Spanish population, as characterized in the present study by sampling from eight different cities widely-spaced across Spain, is generally similar to other European populations, although more genetically diverse than Western and Northern Europeans. Moreover, the Spanish population is remarkably homogeneous within itself in terms of global genetic structure. In view of these results, the population of Spain is sufficiently genetically similar to the CEU sample so that the CEU HapMap dataset could be used to infer genotypes for the Spanish population. Nonetheless, in spite of their general similarity, there are substantial differences between these two European subgroups, and therefore imputed data from the HapMap study many not describe some particular genetic patterns of the Spanish population. The dataset in this study can be extremely useful to compare allele and haplotype frequencies against the CEU sample, and to estimate the confidence of imputed genotypes in all regions of the genome. It is important to note here that some of the differences found between the Spanish and the CEU samples may be due to the difference in size among both samples.
The results of our population structure analyses are consistent with no major population stratification present in this sample of the Spanish population. This result is reassuring since individuals reporting nationalities other than Spanish were excluded from the study. Both, Structure and PC results with a set of 2,050 uncorrelated SNPs showed no evidence of genetic diversity in the sample.
In addition, we were able to analyze fine structure within this sample by running PC analysis using a large set of markers (102,850 SNPs). The results of this second analysis are also consistent with prior reports that were able to predict locations of origin within a 700 Km radius using different European populations [5, 7], and other studies that found subtle differences between locations within a country [10, 11]. In our sample, following a similar strategy, we were able to differentiate between the two more geographically distant centers. Furthermore, these observed differences seem to correspond to the same geographical axis that has been previously found in European populations. This fine structure can be the result of genomic regions that show strong geographic variation  and may be more evident in small, rural or isolated samples than in major cities where subpopulations tend to mix . This potential source of bias should be taken into account in association studies. It is worth noting that our sample of the Spanish population was quite homogeneous, and the genomic inflation factor (based on median chi-squared), as estimated by the software Plink , was exactly 1, as expected when only one population is being analyzed, but still specific genomic regions need to be carefully reviewed.
We have also defined the first CNV map in the Spanish general population. According to our data, 2.35% of the human genome (autosomes and chromosome X) is susceptible to structural variants. This estimation is in range with previously published studies analysing structural variants with the Affymetrix platform [19–21].
We detected a wide range of CNVs population frequencies, although only 34.35% of these variants had a population frequency above 1%. 301 of the CNVs described in this work are fully covered by previously described structural variants. In addition, another ten CNVs have 90% or more of their nucleotides represented in previous CNVs. These 311 CNVs are therefore supported by at least one independent study. The remaining 312 CNVs are also included in Additional File 2 but for descriptive purposes only. These CNVs need to be confirmed in independent datasets. Indeed, because we have analyzed 799 samples, some of these CNVs could be low frequency or population specific variants which went undetected in previous studies with smaller sample sets.
We have confirmed in this study that nonallelic homologous recombination (NHR) could explain the origin of about 33% of CNVs. Interestingly, those CNVs are more frequent than other variants out of rearrangement hotspot regions and they represent 46.50% of all CNVs detected in this study. Regardless of the frequency of NHR events, we estimate that a considerable proportion of CNVs in the normal population may be a consequence of NHRs.
Most of the CNVs detected in our study overlap with known genes, and of those, 157 CNVs (25.22%) overlap with 125 disease loci. This observation is in agreement with previous results. For instance, the 38,406 structural variant regions included in DGV overlap 1183 disease loci. There exist several plausible reasons for these observations, such as the existence of false positives in CNV genome-wide surveys, inaccurate disease-frequency estimates, embryonic lethality effect for homozygous deletions of specific genes, misclassification of samples as normal controls, and rescue of the altered gene function by other related gene product .
In our sample set, only 43 (6.90%) CNVs overlapping disease loci have a population frequency above 1%, and none of them include homozygous deletions. From those, only three CNVs exceed 10% in population frequency and all of them are completely covered by previously described structural variants. Two of these CNVs are contiguous on chromosomal region 15q11.2, one of the most unstable regions in the human genome . These two CNVs overlap the genes hect domain and RLD 2 (HERC2) associated with skin, hair and eye pigmentation (OMIM: 227220), and BCL8 B-cell CLL/lymphoma 8 (BCL8) which has been implicated potentially in B-cell lymphoma (OMIM: 601889). The third CNV is located at 19p13.13 and overlaps with the gene RNASEH2A ribonuclease H2, subunit A (RNASEH2A) whose mutations may be responsible for the Aicardi-Goutieres syndrome (OMIM: 610333). Interestingly, this is a severe autosomal recessive disorder that mimics in utero viral infections and therefore its real incidence could be underestimated .
All these data suggest that some disease loci could be located within genomic regions that are prone to structural alterations. This observation has potential implications on the molecular diagnosis and on the disease frequency estimations of the phenotypes.
Finally, our results suggest that structural variants could be responsible for a small percentage of the Hardy-Weinberg deviations and missing genotypes commonly observed in genome-wide surveys. Therefore, it is advisable to consider the existence of such structural variants for specific SNPs when Genome Wide Association Studies are (GWAS) performed.