Recent human evolution has shaped geographical differences in susceptibility to disease
© Marigorta et al; licensee BioMed Central Ltd. 2011
Received: 12 August 2010
Accepted: 24 January 2011
Published: 24 January 2011
Searching for associations between genetic variants and complex diseases has been a very active area of research for over two decades. More than 51,000 potential associations have been studied and published, a figure that keeps increasing, especially with the recent explosion of array-based Genome-Wide Association Studies. Even if the number of true associations described so far is high, many of the putative risk variants detected so far have failed to be consistently replicated and are widely considered false positives. Here, we focus on the world-wide patterns of replicability of published association studies.
We report three main findings. First, contrary to previous results, genes associated to complex diseases present lower degrees of genetic differentiation among human populations than average genome-wide levels. Second, also contrary to previous results, the differences in replicability of disease associated-loci between Europeans and East Asians are highly correlated with genetic differentiation between these populations. Finally, highly replicated genes present increased levels of high-frequency derived alleles in European and Asian populations when compared to African populations.
Our findings highlight the heterogeneous nature of the genetic etiology of complex disease, confirm the importance of the recent evolutionary history of our species in current patterns of disease susceptibility and could cast doubts on the status as false positives of some associations that have failed to replicate across populations.
The discovery of genetic variants that increase susceptibility to disease represents one of the greatest challenges for epidemiology and genomics . Detailed knowledge about the etiology of many diseases keeps accumulating and in the near future it will help to improve disease management . After decades of research in genetic epidemiology, more than 51,000 different association studies for human diseases have been published and 11,501 genes have been described to be associated to disease, as recorded up to December 2010 in the HuGENet browser . Moreover, thanks to last technological advances, we have recently escalated into a flurry of genome-wide association studies (GWAS) that simultaneously study hundreds of thousands of SNPs over the whole genome [3–5]. For instance, most GWAS recorded in the HuGENet browser have been published recently, from 2008 on (812 out of 935 by December 15th, 2010).
In spite of their success, genetic association studies for common complex diseases usually suffer from a problem of lack of reproducibility of results. Only a very low number of risk variants have been shown to present a consistent pattern of positive replication through independent studies [4–8]. Different confounding factors may constitute the source of these inconsistencies. Two well-known sources of lack of replicability are reduced statistical power due to small (and varying) experimental samples sizes [5, 9, 10]; and population stratification . Other potential sources of lack of replicability include disease heterogeneity, since some complex diseases might include similar entities with shared symptoms but different genetic architectures ; hidden age-varying effects ; biased ascertainment of genetic markers  and publication bias . To overcome these confounding factors, the NCI-NHGRI working group on replication in association studies published a set of recommendations to achieve essential credibility of true positive disease-associated genetic variants . One of their crucial recommendations is that the replication of results in an independent sample of individuals is required to make an association statistically trustable.
However, a true association could fail to be replicated due to heterogeneity in the genetic architecture of the disease under study, particularly when replicas are carried out in populations with different evolutionary histories. Indeed, many common SNPs present significantly different frequencies among human populations or even appear to be polymorphic just in certain populations (i.e. they are population-specific SNPs) . For instance, the six possible pairwise comparisons of the allele frequencies of 63,012 genic SNPs among 4 different populations (Hispanics, African Americans, Asian Americans and European Americans) show that, although most SNPs (from 72% to 96%) are present in the two compared populations, only 44% to 72% of these shared variants are found to have allelic frequencies >10% (i.e. to be common) in both populations . Furthermore, a resequencing survey in a sample of 90 individuals from 6 world-wide populations showed that only 56% of common SNPs were already present in the HapMap database . Finally, 25 out of 43 meta-analysis of complex disease-associated variants showed heterogeneity in allelic frequency among human populations .
It is thus reasonable to hypothesize that differences in the evolutionary history of loci associated to disease could have led to a non-homogeneous world-wide distribution of genetic risk variants. In this scenario, replication studies of risk alleles would frequently fail because of a true heterogeneity in the genetic architecture of common diseases. Previous studies have partially addressed the role of heterogeneity of the genetic ancestry in association studies, without positive results. Lohmueller et al.  analyzed population differentiation patterns between populations of European and African ancestry in 48 highly replicated disease-associated SNPs. Also, Myles et al.  analyzed the world-wide allelic distribution of 25 disease-associated SNPs from the WTCCC genome-wide scan . Finally, Adeyemo et al.  checked for the differences in allele frequencies among 11 HapMap populations for 621 SNPs that had been associated to disease in GWAS performed with peoples from European ancestry. In all three studies, with the exception of some extreme differences in a few variants, disease-associated SNPs presented levels of differentiation among populations that were equivalent to the genome-wide average.
To date, however, no general study has tested whether inter-population genetic heterogeneity has affected the replication rates of association studies. Here, we aim to evaluate such a hypothesis. Ideally, the study should be carried-out by means of a comprehensive meta-analysis of GWAS data. However, there is still a bias in the populations that are chosen to perform these kind of association studies, since the great majority of them (≈90%) has been carried out upon individuals of European ancestry . In addition, most of these GWAS use mixed panels of individuals from different regions in Europe, making it impossible to assign the status of replication of disease variants through populations within Europe.
In contrast, classical association studies based on candidate genes have been performed in great numbers all over the world and their results are publicly available. The Genetic Association Database (GAD) , is one of the largest repositories of the association studies carried out during the last 25 years. Analyzing that dataset, we find that risk variants from genes that diverged most between human populations present lower rates of replication. In contrast, world-wide distributed risk alleles appear to be located in loci that do not show population-specific patterns of genetic variability. These results point towards a role of the recent evolutionary history of human populations in shaping genetic risk for complex diseases and suggest that part of the disease variants that have not been replicated might be true risk alleles, at least in some populations.
Analyses of the Global Set - Global vs. Pairwise FST
A first analysis showed that the disease-associated genes contained on the Global Set (n = 403 genes) present significantly lower inter-population genetic differentiation than equivalent sets of autosomic genes (FST = 0.083 vs. FST = 0.1045, resampling test, p-value < 10-4).
Next, we analyzed the relationship between levels of population differentiation and the replicability of disease associations. We detected a tendency towards negative correlations between FST and replicability. The tendency is only visible when testing the most reliable associations, the ones with many studies or with longer genes (Additional Files 4, 5 and 6) and it maintains regardless of the method used to compute FST (average genic FST, by SNP or only tagSNPs). Thus, although there was a trend towards lower replicability of associations between disease and genes with high global FST, most correlations were non-significant and the correlations between replicability and global FST lacked consistency.
Analyses of the Continental Set
Summary of Spearman's correlation coefficients between FST and ϕ as the discordance in replicabilities for the 37 associations from the Continental Set
7 × 10 -5
2.58 × 10 -21
2.02 × 10 -4
4.5 × 10 -6
6.85 × 10 -4
These results suggest that differences in the continental replicabilities of disease associations (in Europe and East Asia) tend to occur in disease-associated genes that show an increased amount of genetic differentiation between human populations. Still, different confounding factors could be shaping this correlation. For instance, a recent study based on HapMap data has shown that the degree of differentiation in the frequency of SNPs in different human populations depend on the functional role of the SNPs . Within genes, for instance, non-synonymous SNPs show the lowest amount of genetic differentiation among populations while SNPs located in 3'-UTR, 5'-UTR and intronic regions show an increased level of population differentiation. This trend was also observed in our data: intronic SNPs have a mean FST of 0.117 (n = 3,590, 96.9% of the total) while exonic (synonymous and non-synonymous) SNPs have a mean FST of 0.063 (n = 63, 1.7% of the total). Therefore, variable contributions of different SNP classes to high and low replicabilities may be driving the correlations between FST and ϕ. Moreover, the fact that the GAD database pools association studies performed during a wide range of years and under many different conditions (such as sample size) constitutes another potential source of confounding factors.
To try to control for these potential sources of bias, we performed a multiple forward stepwise regression analysis to determine which variable or combination of variables best explained variance in ϕ. We introduced eight possible predictors in the model, the average genic FST together with seven potential confounding factors: (1) total number of SNPs in the gene (related to gene length); (2) the percentage of intronic SNPs; (3) total number of studies in the association; (4) total number of studies performed in Europe; (5) total number of studies performed in East Asia; (6) the average sample size of studies; and (7) the average year of study publication for each association. In total, 564 association studies were surveyed (Additional File 7).
Summary of multiple regression analysis for the Continental Set.
a) FULL SET
a.1) Two predictors (n = 33)a
Gene Length (number of SNPs)
5 × 10 -7
7 × 10 -6
FST (population differentiation)
a.2) One predictor (n = 31;NRG1/PARK2 out)
FST (population differentiation)
b) CONSERVATIVE SET b
b.1) Two predictors (n = 26)a
FST (population differentiation)
9.9 × 10 -7
1.1 × 10 -5
Gene Length (number of SNPs)
2.14 × 10 -4
b.2) One predictor (n = 24;NRG1/PARK2 out)
FST (population differentiation)
It is still possible that this correlation could have arisen due to pure lack of statistical power. For instance, an association study in East Asians could have failed to replicate a previous association found in Europeans if, with similar sample size, the tested genes harbored markers with lower allele frequencies in the replica population. We calculated the percentage of SNPs per gene from the Continental Set that happened to be very rare in a given continent while being common in the other, that is, the percentage of SNPs that are common in just a continent (see Methods). This percentage of extreme-frequency SNPs was not correlated with ϕ (ρ = 0.138, p < 0.443, n = 33), but it was positively correlated with FST (ρ = 0.469, p < 0.006, n = 33). Additionally, we performed an additional multiple forward stepwise regression with the addition of this statistic as another explanatory variable for. However, the same models as above arose (see Table 2), this statistic being discarded as an explanatory variable of the ϕ. Thus, we can exclude the possibility that FST explains the differences in replicability between Europe and East Asia just as a by-product of lack of statistical power.
Finally, to further validate the correlation between ϕ and FST, we performed a marker-based analysis in which we studied the associated variants themselves and not the genes that contain them. After manual scrutiny of the 444 papers that reported the 37 associations in the Continental Set, we established the genetic marker had been analyzed in each study, and ascertained that 54 different SNPs that where associated in these studies where available for FST analysis (Additional File 8). Again, we found a positive correlation between the discordance in continental replicabilities measured by ϕ and the FST from the selected markers (ρ = 0.286, p < 0.036, n = 54).
Ancestral and derived alleles
Population-specific test on the long-term evolutionary status for the SNPs from the 890 associations from the Global Set.
East Asians (CHB)a
East Asians (JPT)a
(n = 441)
(n = 441)
4.89 × 10 -5
3.19 × 10 -4
1.43 × 10 -4
We have analyzed the role of genetic heterogeneity among human populations in the replicability of genetic association studies. To address this question, we have measured the degree of population differentiation in loci that have shown differential patterns of association to disease, as reported in the Genetic Association Database . We report three main results. First, SNPs harbored in genes associated with complex disease present lower F ST values than the rest of genic SNPs in the genome; second, there is a negative correlation between the replicability of studies associating genes to disease and the FST values of the associated genes in European and East Asian populations; and, third, in the same populations, high replicability genes present increased levels of high-frequency derived alleles. These findings would confirm the importance of the recent evolutionary history of our species in the current patterns of susceptibility to complex diseases.
Given the large number of false positives reported in association studies [4, 6–8] a relevant starting issue is the adequacy of the GAD to perform our analysis. In that respect, two points must be noted. First, it is important to see that replication studies, which are the center of our manuscript, are in fact a way to assess how likely previous associations are false positives. A good part of our study would be unnecessary if every association ever reported had been a true positive. In that sense, the known presence of both false and true positives in the database prompted the particular series of analysis that we presented here. The approach will be different when enough GWAS data are available, since, given current standards in the field; it is false negatives that dominate in these studies [6, 25, 26]. Secondly, even if the "low replicability" category contains a mixture of false and true positives, it is clear that the studies with highest replication rates will correspond to true positives. Indeed, it has been known for quite some time that a considerable number of genetic variants have been consistently associated to complex diseases. For example, a review of 25 associations by Lohmueller et al.  found an excess of replications in classical association studies that cannot be explained by false positives. Moreover, a recent paper by Siontis et al.  shows that a good number of the associations detected in non-GWAS classical association studies (mostly those extensively studied) have been replicated in recent GWAS (41 of 291 with a p < 10-7). Neither of these results would have been obtained if highly replicated associations would have been false positives.
Our first observation of lower F ST values in genes associated to complex disease is relevant to the adaptive history of these genes. It is well-known that purifying selection is the main force driving the evolution of genes related to Mendelian disorders, as they tend to harbor lower levels of polymorphism. In contrast, complex-disease associated genes seem to be under different pressures, with mixed evolutionary signals . Overall, our observation of lower levels of population is suggestive of purifying selection. These findings contradict results from other authors that did not detect differences in FST values of disease-associated variants relative to genome-wide levels [19–21]. However, these previous studies focused in variants instead of genes and, therefore, could only muster small sample sizes. Myles et al.  and Lohmueller et al.  studied, respectively, 25 and 48 SNPs, with the resulting lack in statistical power. More recently, the study by Adeyemo and Rotimi  was able to collect 621 disease-associated SNPs. As expected, they found both SNPs with very large and very low FST values through populations. However, they focused on average FST values per disease and did not test their global average FST of 0.105.
Anyhow, our finding of low average FST values in 403 genes that have been associated to disease is still inconclusive. Since our data mainly come from classical (non genome-wide) association studies, our observation may have different causes, some of them spurious. Of course, a true extensive role of purifying selection governing the evolution of these genes is a possibility; but it is also possible that certain classes of genes with particular average selective pressures tend to be involved in complex diseases; or that there has been a human bias towards the inclusion of certain categories of genes in association studies . Indeed, when tested for functional enrichment of PANTHER Biological Process categories (see Additional File 10), complex-disease genes from the Global Set showed an enrichment for the category "Immunity and defense" (corrected p < 2.11 × 10-40) and an array of "signaling"- related categories, such as "Signal transduction", "Cell surface receptor mediated signal transduction" and "Cell communication" (corrected p-values = 9.71 × 10-40, 2.22 × 10-30 and 1.21 × 10-23, respectively), but these results can be the consequence of anyone of the causes mentioned above, or of several of them.
In a previous analysis of the Genetic Association Database, Amato et al.  found a trend that seems opposite to the one we report here. Namely, they detected increased levels of population differentiation in disease-associated genes when compared to genome-wide base levels. However, a careful analysis shows that our results are consistent with Amato et al.'s and that the apparent contradiction is due to their analysis criteria differing from ours in two key aspects. First, their set of "disease genes" was composed by genes positively associated to disease at least once while, to avoid noise, we only included associations that had been studied four or more times (n = 1,793 vs. n = 403). Second, Amato et al.  used as the F ST value representative of each gene the maximum F ST value of any of the SNP within that gene. In contrast, we averaged the F ST values of all the SNPs in a gene. This second difference is crucial: when we repeat our analysis using the "maximum F ST " method we do find marginally significant increased levels of population differentiation in disease genes (FST = 0.366, n = 403 vs. FST of 0.345, n = 18,671, p-value < 0.022, Mann-Whitney test). The reverse is also true, when we analyze the gene set from Amato et al.  with our "average F ST " approach, we detect significantly lower population differentiation than genome-wide autosomic levels (FST = 0.097, n = 1,631 vs. FST of 0.104, n = 17,443, p-value < 4.4 × 10-5, Mann-Whitney test).
The fact that using either "maximum FST" or "average FST" leads to different results, raises the question of which approach is more accurate. We believe our method to be more precise, due to the larger average length of "disease genes". As such, they tend to harbor more SNPs than the average gene (34.8% more, with an average of 101.48 SNPs, n = 403 vs. an average of 75.28 SNPs, n = 18,671, p-value < 3.1 × 10-14, Mann-Whitney test). And, in fact, there is a strong positive correlation between the number of SNPs a gene harbors and the maximum FST value these SNPs can reach (ρ = 0.527, p < 10-50, n = 19,074), while the correlation is much weaker with the gene-specific average FST (ρ = 0.094, p < 10-39, n = 19,074). As a result, the maximum FST is more biased by gene length than the average FST. Therefore, an approach based on the average FST in our data seems to be more accurate, in the sense that the average FST of a gene is a better proxy of the amount of genetic differentiation at a given locus.
Our second main observation is that genetic heterogeneity through human populations varies greatly amongst loci associated to complex diseases. These loci present different degrees of population differentiation if we attend to their replicability and the consistency of replicabilities between Europeans and East Asians. These two populations are more similar for loci that contain variants which have been similarly associated to disease over and over again in different studies, while greater genetic differences are found in loci whose disease variants have not been consistently replicated. These observations can have at least three sources. First, it is possible that different statistical power in different populations is contributing to the correlation between continental replicability and FST. For this to happen, it should be the case that genetic variants that have been associated to disease in a given population tend to be rare other parts of the world. However, we found no evidence of loci with low consistency of replicability having more SNPs with extreme frequencies (common in a population while rare in the other). Alternatively, recent theoretical studies demonstrate that rare variants may create spurious or synthetic associations at certain common alleles . If rare causal variants make a substantial contribution to disease risk and if different populations present different genealogies, the spurious associations detected in each population would differ and replicability patterns may differ. This scenario would point to an important role for rare variants in the etiology of complex diseases. However it is difficult to see how highly replicated associations could be spurious and we did observe a stronger correlation between FST and consistency of replicability for associations that have been replicated in at least 50% of the studies. The final explanation would be that certain variants are contributing to the risk for the disease in some populations but not in others. The range of factors underlying this possibility is not limited to purely genetic causes. For instance, some gene-environment interactions that have appreciable joint effects in complex diseases have been described  and environmental conditions vary widely across the planet. Thus, environmental variability among populations could have a role in the differential effect of genetic variants through populations that we have detected. In any case, the evolutionary history of humans would be such that some of the variants associated to disease would increase susceptibility differently in different populations.
Our study points at the heterogeneous genetic architecture of complex diseases, which even if modulated by similar cellular and molecular pathways in all humans, may present intricate population differences regarding causal variants and loci. Although in most cases the behavior of susceptibility or protective risk variants are shared through populations , some differential effects for the same alleles in different populations have been established, like the European-specific protective effects to HIV1 infection progression by the 32-bp deletion allele of the CCR5 gene [34–36] or the presence of two different haplotype blocks in the NRG1 gene that give susceptibility to schizophrenia in European and East Asian populations, respectively . These differences could eventually lead to systematic differences among human populations in susceptibility to, and may underlie well-known cases, such as the differential susceptibility and prevalence of asthma between individuals of Mexican or Puerto Rican ancestry [38–40].
Usually, lack of replication of association mapping methods is thought to be due to the presence of confounding factors such as population stratification, lack of statistical power or publication bias. Therefore, stringent replication criteria are necessary to avoid false positives and to ultimately confirm that a certain genetic variant confers susceptibility to disease . However, the fact that the allelic architecture of disease may be different through human populations raises the issue of revisiting some genetic association studies for complex diseases, since some putatively false positives might hint at diseases whose etiology is geographically heterogeneous.
As to the causes of these differences, it has been previously shown that there is variation in the disease-susceptibility variants that are present in different populations. These differences have been attributed to changes in selective pressures over standing variation [41, 42] or to population-specific selective processes [43, 44]. Our results showing that, when compared against low replicability genes, high replicability genes present lower FST values between European and Asians, but high FST values between either of these populations and Africans; together with the fact that derived alleles are more frequent in these high replicability genes in Asian and European populations, suggest that replicability has been higher in loci whose allele frequencies changed in the ancestors of Europeans and Asians after they left Africa. It is tempting to speculate about a role of natural selection in shaping this pattern, which would fit into suggestions about selection leading, in some cases, to disease as a side-effect consequence of adaptation [41, 42]. However, our results could be just due to the action of genetic drift relaxing purifying selection in non-African populations. In fact, it has been shown that the bottleneck due to the out-of-Africa event induced a decreased ability of purifying selection to purge deleterious alleles .
In summary, our results not only show that the evolutionary history of disease-associated loci (influenced either by demographic or by selective forces) plays a role in the genetic susceptibility to disease in Eurasians; but they also cast doubts about the status of false positives of many associations that have not been widely replicated. Obtaining this picture has only been possible by analyzing more than 20 years worth of classical association studies. We hope that the extension of GWAS to populations of non-European ancestry will allow, in time, to perform systematic research on the world-wide distribution of genetic risk variants.
We used the Genetic Association Database (GAD, http://geneticassociationdb.nih.gov/, update December 29th, 2007) , comprising over 39,000 records, to select genetic loci that contain variants associated to common diseases. The GAD reports the most important features of genetic association studies published over the last 25 years, including, among others, risk variant, gene name, disease phenotype, sample ethnic origins, known epistatic interactions, conclusion of the study, journal, year and submitter. Every record refers to an association, that is, if a given study analyzes k different markers from the same gene, GAD keeps them into k different records performing k different associations. However, the protocols of the GAD are hierarchically gene-centered, with less than a 10% of the records providing systematic information about the actual marker analyzed. In other words, the database does not focus on studies of certain genetic markers but on associations between genes and disease phenotypes. Therefore, although ideally our aim was to distinguish marker-specific replicability patterns, we focused onto associations among genes and diseases. A summary of the steps and filtering undertaken upon the records from the GAD that are explained in following sections is available in Figure 1 and Additional File 1.
First set of associations - Global Set
From the original database, we loaded in a local mySQL database those records (n = 17,355) that carried information on the final status of the association, with two possible states: positive or negative (association or lack of it, respectively). Then, all the associations between gene and disease (e.g., CTLA4 - diabetes type II) were selected. Next, we performed a global manually-controlled accuracy control to solve problems due to extra-sensitivity of our queries. Thus, those associations between the same gene and the same disease previously classified as different (such as "NOS3 - high blood pressure" and "NOS3 - hypertension") were clustered together. Also, typographical errors (e.g. "epilpsy" - "epilepsy") were corrected. At this point, our database was formed by 7,072 different associations between one gene and one disease. Although many associations had been studied several times (e.g., ADRB2 - Asthma, 59 times), most of them had been performed only once (4,491 associations, 63.5%).
Second set of associations - Continental Set
From the original database, we kept those records (n = 7,342) for which besides the final status of the association (Y/N), there was also information on the ancestry of the samples (e.g., European Americans from New York). For instance, four different records tested for association between markers at the AKT1 gene and schizophrenia: three of them were positive and based on individuals from Iran, Japan and the USA, while the last study, performed with Finnish individuals, was negative (GAD ID: 116446, 116448, 144228 and 144230, respectively).
Next we classified each study according to the geographic origin of the individuals that took part in it. Incorporating consensus information on human evolutionary history [46, 47], six major geographic regions were considered: Africa, Europe, Middle East, East Asia, Oceania and America. For example, the four studies from the association between AKT1 and schizophrenia were classified into three categories: those performed upon Finnish and USA individuals from European ancestry were grouped together and labeled as European (AKT1 - Schizophrenia - Europe - 2 times); the study with Japanese individuals was labeled as East Asian (AKT1 - Schizophrenia - East Asia - 1 study) and the study with Iranian individuals was classified as Middle Eastern (AKT1 - Schizophrenia - Middle East - 1 study). More recent world-wide migrations were also considered (e.g. association studies on African American individuals were labeled as African). Moreover, we recovered further information from those studies that had an ambiguous label on the genetic ancestry of the samples (such as "Australian" or "Canadian") and only those for which more specific and unequivocal information was available were kept (e.g. the label "Caucasian" was assigned to European category). Finally, those studies performed on a mixed panel of samples from different ethnical origins (e.g. "British individuals from Caucasian and Indian origins") were classified under the label of "Mixed", unless the study carried separate information on the association status (positive/negative) for each of the ethnicities present in the samples.
At this point, the 7,342 records from the Continental Set were classified into 4,979 different associations connecting one gene and one disease and classified into continental populations: 2,136 associations were labeled as European, 1,775 as East Asian, 287 as Mixed, 131 as African, 65 as Middle Eastern, 39 as Amerindian, 11 as Oceanian and 535 were left unassigned.
Replicability Index Assignation
To measure the replicability of a given association, we calculated the proportion of positive studies compared to the total number of studies of the association. However, since a reliable replicability index can only be estimated if associations have been studied several times, we defined an arbitrary cutoff of four studies, so that only associations that had been studied at least four times were considered. After applying these criteria, the Global Set was finally formed by the 890 gene-phenotype associations that had been studied at least 4 times (out of 7,072 initial associations, Additional File 2).
For the Continental Set, 238 associations (out of 4,979) remained after applying the same criterion of at least 4 studies per association. Most of the remaining associations had been carried out with individuals from Europe (n = 129, 54.2%) and East Asia (n = 99, 41.6%). Only a few association studies had been performed with African (n = 4, 1.7%) or Mixed (n = 6, 2.5%) individuals. Since 3 out of the 4 African associations (FCGR2A, NOS2 and TNF loci) were studies about malaria, which is endemic of African populations, we decided to remove them from our analysis and focus on associations that had been widely studied in both Europe and East Asia (≥4 times in each). Thus, the final Continental Set was formed by the 37 overlapping associations consistently studied in each European and East Asian populations (Additional File 3).
Discordance Index for Replicability in the Continental Set - Cramer's Phi (ϕ)
We used Cramer's ϕ coefficient to calculate an index of discordance among the continental-specific replicabilities, so we could make use of the geographic information in the Continental Set. This statistic ranges from 0 to 1 and constitutes an unbiased estimator of the strength of association between two qualitative variables from a contingency table . For the Continental Set, these variables were "continent" (Europe or East Asia) and "positive and negative studies within continent". When ϕ = 0 there is no association between the two variables, indicating that the two levels of replicability in the two continents under study were consistent (e.g. a replicability of 70% in European and 70% in East Asian populations). On the other hand, ϕ = 1 indicates that there is a complete association between the degree of replicability and the continent of origin of the studied populations, that is, that replicabilities were discordant between continents (e.g. the replicability was 0% in European studies and 100% in East Asian studies).
SNP polymorphism data from HapMap Project Phase 2 (release 22, April 2007)  were selected to study genetic variability between human populations. Only genic SNPs as defined by ENSEMBL (Build 35) were ascertained for further analyses (n = 1,439,152 SNPs, from 19,176 genes). We downloaded all genotypes for all unrelated samples from the four HapMap populations: 60 CEU individuals (samples of Northern-European ancestry from CEPH panel), 45 JPT individuals (from Tokyo, Japan), 45 CHB individuals (from Beijing, China) and 60 YRI individuals (Yorubans from Ibadan, Nigeria). Following previous works, JPT and CHB samples were clustered together due to their close genetic relationships (90 individuals, ASN from now on) . We identified a total of 50,317 SNPs located in genes reported in the 890 associations from the Global Set; and a total of 6,092 SNPs within the 27 genes from the 37 associations in the Continental Set (no SNPs were found in 4 genes: APOE, HLA-DQA1, HLA-DQB1 and LTC4S). Finally, those SNPs that were monomorphic in both European and East Asian populations were removed from the Continental Set (final set, n = 3,710 SNPs).
Adjacent SNPs tend to be inherited together (these SNPs being in Linkage Disequilibrium or LD). Therefore, any measure of genetic differentiation calculated for a given SNP may be correlated with the signal from nearby SNPs, if in LD. Since our aim is to check the patterns of replicability and genetic differentiation at different genetic loci, variable SNP densities and LD patterns through different genes might cause some bias in our estimates. To avoid this, we ascertained sets of representative SNPs (tagSNPs) for each block of LD in the genes under study. We used SYSNPs browser http://www.sysnps.org,  that uses the Tagger algorithm , to select the tagSNPs of our interest. We tagged for each population (CEU, ASN and YRI) using an r2 threshold of 0.8 and minimum MAF of 0.1, considering only SNPs with a minimum genotyping call of 75% of the individuals. Finally, we selected those SNPs that appeared to be tagSNPs in all three populations, with a final set of 6,582 and 538 tagSNPs for the Global and Continental Sets, respectively.
Population Differentiation (FST) Calculation
We used Wright's FST to measure genetic differentiation among populations. This statistic ranges from 0 to 1 and quantifies the amount of differences in allelic frequencies among populations and has been classically used to measure genetic differentiation between populations. Allele frequencies and measures of FST[53, 54] for each SNP were calculated with Arlequin v3.11  as implemented in SNPator , using the genotypes from the ASN, CEU and YRI populations for the Global Set and from the ASN and CEU populations for the Continental Set. Therefore, for each SNP we calculated three pairwise FST values (European-Asian, European-African and Asian-African) and a global FST value including the three HapMap populations. To test for genetic differentiation patterns in different genes, we computed FST in three different ways (1) averaging out the FST values of all SNPs in a gene; (2) using separately the FST value corresponding to each SNP and (3) using for each gene only the FST values corresponding to its tagSNPs. Finally, to study how association studies performed in different continents could have failed to replicate due to lack of statistical power, we calculated the percentage of SNPs for each gene from the Continental Set that happened to be rare (MAF < 0.1) in a given continent while common (MAF > 0.2) in the other continental population (see Table 2).
A marker-based analysis of the Continental Set
One of the pitfalls of the GAD database is that the actual markers tested in each study have been rarely recorded. Therefore, we focused on genes and summarized the replicability of each association by genes. However, the tendency of classical association studies to test a set of few markers may have affected our replicability measures. Thus, we decided to perform an analysis based on the actual tested markers that would help to validate our findings. As surveying all the papers that have been selected from the GAD seemed unfeasible, we focused in the 564 records (from 444 papers) that belong to the 37 associations from the Continental Set. For each record (see Additional File 7), we selected those variants that had been tested in at least 10% of the studies from each association. In total, we gathered 72 different polymorphisms. Of those, 54 were SNP markers. For each, we gathered allele frequencies for Europeans and East Asians from either public databases (HapMap, ALFRED or dbSNP) or, if not available, from the paper with the highest sample size for each Continental population. Similar to the gene-based analysis of the Continental Set, for each SNP we calculated the FST between Europeans and East Asians. Finally, we assigned to each marker the ϕ value from the association it belonged to. All the features from the selected markers are available in the Additional File 8.
Ancestral vs. derived alleles
To study the role of long-term evolutionary pressures in disease-associated loci and the replication of association studies, we inferred the ancestral-derived status of each SNP using a phylogenetic parsimony criterion by means of orthologous alignments with chimpanzee (Pan troglodytes) and macaque (Macaca mulatta). Using the Ensembl v49 BlastZ-net alignments [57, 58] we reported the ancestral or derived status for the major allele (allele frequency ≥0.5) for all SNPs in each HapMap population (Additional File 9).
Gene Ontology analysis
We used the service "expression data analysis" from the PANTHER database tools website . This utility permits to "uncover statistically significant relationships between input data and gene or protein functions" . We tested the whole list of complex-disease related genes from the Global Set (n = 403) versus the NCBI full set of genes. By means of a binomial test, we obtained a Bonferroni-corrected p-value for under- or over-representation of each functional category for all Biological Processes.
A conservative dataset
In some analyses (were indicated in the text) we applied some further filters in order to be even more conservative. First, we eliminated associations that had failed to replicate at least 50% of the time after many attempts on the basis that these associations lacked credibility (after filtering, 710 associations remained in the Global Set and 26 associations in the Continental Set). In addition, for the estimation of the average genic F ST , we filtered out any gene that had less than 10 SNPs in order to get more reliable measures of genetic distances. Finally, for the Global Set we applied varying thresholds on the number of studies, filtering out associations with less than 8, 10, 12, 14, 16, 18 or 20 studies, respectively.
Statistical analyses were performed using SPSS version 15.0 (SPSS, Inc., Chicago, IL) and using scripts in R v2.10.1 . To check whether average FST from disease-associated genes from the Global Set was significantly different than genome-wide average FST, we ran a resampling test, with 10,000 sets of genes randomly chosen from the whole genome. Each set of genes had the same number of genes (n = 403) than our Global Set. Thus, we checked how many times 403 random genes chosen from the whole genome had an average FST value equal or greater than the average FST from the Global Set.
We thank Josh M. Akey, Isabel Mendizabal and Olga Fernando for technical support, helpful comments and discussions. We also thank David Comas for help with the manuscript. As well, we thank two anonymous reviewers for their valuable comments and constructive suggestions. Urko M. Marigorta is supported by a PhD fellowship from Universitat Pompeu Fabra. This work was partially supported by a grant to AN from the Ministerio de Ciencia e Innovación (Spain, BFU2006 15413-C02-01) and by the National Institute of Bioinformatics http://www.inab.org, a platform of Genoma España.
- Bamshad M, Wooding S, Salisbury BA, Stephens JC: Deconstructing the relationship between genetics and race. Nature reviews. 2004, 5 (8): 598-609.PubMedGoogle Scholar
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews. 2008, 9 (5): 356-369. 10.1038/nrg2344.PubMedGoogle Scholar
- Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for human genome epidemiology. Nature genetics. 2008, 40 (2): 124-125. 10.1038/ng0208-124.PubMedGoogle Scholar
- Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, Hirschhorn JN, Abecasis G, Altshuler D, Bailey-Wilson JE, et al: Replicating genotype-phenotype associations. Nature. 2007, 447 (7145): 655-660. 10.1038/447655a.PubMedGoogle Scholar
- WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661-678. 10.1038/nature05911.Google Scholar
- Ioannidis JP: Non-replication and inconsistency in the genome-wide association setting. Human heredity. 2007, 64 (4): 203-213. 10.1159/000103512.PubMedGoogle Scholar
- Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG: Replication validity of genetic association studies. Nature genetics. 2001, 29 (3): 306-309. 10.1038/ng749.PubMedGoogle Scholar
- Zintzaras E, Lau J: Trends in meta-analysis of genetic association studies. Journal of human genetics. 2008, 53 (1): 1-9. 10.1007/s10038-007-0223-5.PubMedGoogle Scholar
- Risch NJ: Searching for genetic determinants in the new millennium. Nature. 2000, 405 (6788): 847-856. 10.1038/35015718.PubMedGoogle Scholar
- Wang WY, Barratt BJ, Clayton DG, Todd JA: Genome-wide association studies: theoretical and practical concerns. Nature reviews. 2005, 6 (2): 109-118. 10.1038/nrg1522.PubMedGoogle Scholar
- Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, et al: Assessing the impact of population stratification on genetic association studies. Nature genetics. 2004, 36 (4): 388-393. 10.1038/ng1333.PubMedGoogle Scholar
- Burmeister M, McInnis MG, Zollner S: Psychiatric genetics: progress amid controversy. Nature reviews. 2008, 9 (7): 527-540. 10.1038/nrg2381.PubMedGoogle Scholar
- Lasky-Su J, Lyon HN, Emilsson V, Heid IM, Molony C, Raby BA, Lazarus R, Klanderman B, Soto-Quiros ME, Avila L, et al: On the replication of genetic associations: timing can be everything!. American journal of human genetics. 2008, 82 (4): 849-858. 10.1016/j.ajhg.2008.01.018.PubMedPubMed CentralGoogle Scholar
- Zondervan KT, Cardon LR: The complex interplay among factors that influence allelic association. Nature reviews. 2004, 5 (2): 89-100.PubMedGoogle Scholar
- Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851-861. 10.1038/nature06258.PubMedGoogle Scholar
- Bamshad M: Genetic influences on health: does race matter?. Jama. 2005, 294 (8): 937-946. 10.1001/jama.294.8.937.PubMedGoogle Scholar
- Wall JD, Cox MP, Mendez FL, Woerner A, Severson T, Hammer MF: A novel DNA sequence database for analyzing human demographic history. Genome research. 2008, 18 (8): 1354-1361. 10.1101/gr.075630.107.PubMedPubMed CentralGoogle Scholar
- Ioannidis JP, Ntzani EE, Trikalinos TA: 'Racial' differences in genetic effects for complex diseases. Nature genetics. 2004, 36 (12): 1312-1318. 10.1038/ng1474.PubMedGoogle Scholar
- Lohmueller KE, Mauney MM, Reich D, Braverman JM: Variants associated with common disease are not unusually differentiated in frequency across populations. American journal of human genetics. 2006, 78 (1): 130-136. 10.1086/499287.PubMedGoogle Scholar
- Myles S, Davison D, Barrett J, Stoneking M, Timpson N: Worldwide population differentiation at disease-associated SNPs. BMC medical genomics. 2008, 1: 22-10.1186/1755-8794-1-22.PubMedPubMed CentralGoogle Scholar
- Adeyemo A, Rotimi C: Genetic variants associated with complex human diseases show wide variation across multiple populations. Public health genomics. 2010, 13 (2): 72-79. 10.1159/000218711.PubMedGoogle Scholar
- Need AC, Goldstein DB: Next generation disparities in human genomics: concerns and remedies. Trends Genet. 2009, 25 (11): 489-494. 10.1016/j.tig.2009.09.012.PubMedGoogle Scholar
- Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nature genetics. 2004, 36 (5): 431-432. 10.1038/ng0504-431.PubMedGoogle Scholar
- Barreiro LB, Laval G, Quach H, Patin E, Quintana-Murci L: Natural selection has driven population differentiation in modern humans. Nature genetics. 2008, 40: 340-345. 10.1038/ng.78.PubMedGoogle Scholar
- Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ: Chatterjee Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature genetics. 2010, 42: 570-575. 10.1038/ng.610.PubMedPubMed CentralGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al: Common SNPs explain a large proportion of the heritability for human height. Nature genetics. 2010, 42 (7): 565-569. 10.1038/ng.608.PubMedPubMed CentralGoogle Scholar
- Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN: Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nature genetics. 2003, 33 (2): 177-182. 10.1038/ng1071.PubMedGoogle Scholar
- Siontis CM, Patsopoulos NA, Ioannidis JP: Replication of past candidate loci for common diseases and phenotypes in 100 genome-wide association studies. Eur J Hum Genet. 2010, 18: 832-10.1038/ejhg.2010.26.PubMedPubMed CentralGoogle Scholar
- Blekhman R, Man O, Herrmann L, Boyko AR, Indap A, Kosiol C, Bustamante CD, Teshima KM, Przeworski M: Natural selection on genes that underlie human disease susceptibility. Curr Biol. 2008, 18 (12): 883-889. 10.1016/j.cub.2008.04.074.PubMedPubMed CentralGoogle Scholar
- Akey JM, Zhang G, Zhang K, Jin L, Shriver MD: Interrogating a high-density SNP map for signatures of natural selection. Genome research. 2002, 12 (12): 1805-1814. 10.1101/gr.631202.PubMedPubMed CentralGoogle Scholar
- Amato R, Pinelli M, Monticelli A, Marino D, Miele G, Cocozza S: Genome-wide scan for signatures of human population differentiation and their relationship with natural selection, functional pathways and diseases. PloS one. 2009, 4 (11): e7927-10.1371/journal.pone.0007927.PubMedPubMed CentralGoogle Scholar
- Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB: Rare variants create synthetic genome-wide associations. PLoS biology. 2010, 8 (1): e1000294-10.1371/journal.pbio.1000294.PubMedPubMed CentralGoogle Scholar
- Hunter DJ: Gene-environment interactions in human diseases. Nature reviews. 2005, 6 (4): 287-298.PubMedGoogle Scholar
- Sabeti PC, Walsh E, Schaffner SF, Varilly P, Fry B, Hutcheson HB, Cullen M, Mikkelsen TS, Roy J, Patterson N, et al: The case for selection at CCR5-Delta32. PLoS biology. 2005, 3 (11): e378-10.1371/journal.pbio.0030378.PubMedPubMed CentralGoogle Scholar
- Gonzalez E, Bamshad M, Sato N, Mummidi S, Dhanda R, Catano G, Cabrera S, McBride M, Cao XH, Merrill G, et al: Race-specific HIV-1 disease-modifying effects associated with CCR5 haplotypes. Proceedings of the National Academy of Sciences of the United States of America. 1999, 96 (21): 12004-12009. 10.1073/pnas.96.21.12004.PubMedPubMed CentralGoogle Scholar
- Hedrick PW, Verrelli BC: "Ground truth" for selection on CCR5-Delta32. Trends Genet. 2006, 22 (6): 293-296. 10.1016/j.tig.2006.04.007.PubMedGoogle Scholar
- Li D, Collier DA, He L: Meta-analysis shows strong positive association of the neuregulin 1 (NRG1) gene with schizophrenia. Human molecular genetics. 2006, 15 (12): 1995-2002. 10.1093/hmg/ddl122.PubMedGoogle Scholar
- Choudhry S, Ung N, Avila PC, Ziv E, Nazario S, Casal J, Torres A, Gorman JD, Salari K, Rodriguez-Santana JR, et al: Pharmacogenetic differences in response to albuterol between Puerto Ricans and Mexicans with asthma. American journal of respiratory and critical care medicine. 2005, 171 (6): 563-570. 10.1164/rccm.200409-1286OC.PubMedGoogle Scholar
- Salari K, Choudhry S, Tang H, Naqvi M, Lind D, Avila PC, Coyle NE, Ung N, Nazario S, Casal J, et al: Genetic admixture and asthma-related phenotypes in Mexican American and Puerto Rican asthmatics. Genetic epidemiology. 2005, 29 (1): 76-86. 10.1002/gepi.20079.PubMedGoogle Scholar
- Naqvi M, Thyne S, Choudhry S, Tsai HJ, Navarro D, Castro RA, Nazario S, Rodriguez-Santana JR, Casal J, Torres A, et al: Ethnic-specific differences in bronchodilator responsiveness among African Americans, Puerto Ricans, and Mexicans with asthma. J Asthma. 2007, 44 (8): 639-648. 10.1080/02770900701554441.PubMedGoogle Scholar
- Di Rienzo A, Hudson RR: An evolutionary framework for common diseases: the ancestral-susceptibility model. Trends Genet. 2005, 21 (11): 596-601. 10.1016/j.tig.2005.08.007.PubMedGoogle Scholar
- Di Rienzo A: Population genetics models of common diseases. Current opinion in genetics & development. 2006, 16 (6): 630-636.Google Scholar
- Fullerton SM, Bartoszewicz A, Ybazeta G, Horikawa Y, Bell GI, Kidd KK, Cox NJ, Hudson RR, Di Rienzo A: Geographic and haplotype structure of candidate type 2 diabetes susceptibility variants at the calpain-10 locus. American journal of human genetics. 2002, 70 (5): 1096-1106. 10.1086/339930.PubMedPubMed CentralGoogle Scholar
- Young JH, Chang YP, Kim JD, Chretien JP, Klag MJ, Levine MA, Ruff CB, Wang NY, Chakravarti A: Differential susceptibility to hypertension is due to selection during the out-of-Africa expansion. PLoS genetics. 2005, 1 (6): e82-10.1371/journal.pgen.0010082.PubMedPubMed CentralGoogle Scholar
- Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, Hubisz MJ, Sninsky JJ, White TJ, Sunyaev SR, Nielsen R, et al: Proportionally more deleterious genetic variation in European than in African populations. Nature. 2008, 451 (7181): 994-997. 10.1038/nature06611.PubMedPubMed CentralGoogle Scholar
- Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW: Clines, clusters, and the effect of study design on the inference of human population structure. PLoS genetics. 2005, 1 (6): e70-10.1371/journal.pgen.0010070.PubMedPubMed CentralGoogle Scholar
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW: Genetic structure of human populations. Science (New York, NY). 2002, 298 (5602): 2381-2385. 10.1126/science.1078311.Google Scholar
- Sokal RR, Rohlf FJ: Biometry: The principles and practice of statistics in biological research. 1995, New York: W.H. Freeman, 3Google Scholar
- Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, Mahoney MW, Drineas P: PCA-correlated SNPs for structure identification in worldwide human populations. PLoS genetics. 2007, 3 (9): 1672-1686. 10.1371/journal.pgen.0030160.PubMedGoogle Scholar
- Lorente-Galdos B, Medina I, Morcillo-Suárez C, Sangros R, Alegre J, Pita G, Vellalta G, Malats N, Pisano D, Dopazo J: SYSNPs (Select Your SNPs): a web tool for automatic/massive selection of SNPs. Int J of Data Mining and Bioinformatics.
- de Bakker PI, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D: Efficiency and power in genetic association studies. Nature genetics. 2005, 37 (11): 1217-1223. 10.1038/ng1669.PubMedGoogle Scholar
- Wright S: The genetical structure of populations. AnnEugenics. 1951, 15: 323-354.Google Scholar
- Weir BS: Genetic Data Analysis II: Methods for Discrete Population Genetic Data. 1996, Sunderland, MA.: Sinauer Assoc, 2Google Scholar
- Weir BS, Cockerham CC: Estimating F-statistics for the analysis of population structure. Evolution. 1984, 38: 1358-1370. 10.2307/2408641.Google Scholar
- Excoffier L, Laval G, Schneider S: Arlequin (version 3.0): An integrated software package for population genetics data analysis. Evolutionary bioinformatics online. 2005, 1: 47-50.Google Scholar
- Morcillo-Suarez C, Alegre J, Sangros R, Gazave E, de Cid R, Milne R, Amigo J, Ferrer-Admetlla A, Moreno-Estrada A, Gardner M, et al: SNP analysis to results (SNPator): a web-based environment oriented to statistical genomics analyses upon SNP data. Bioinformatics (Oxford, England). 2008, 24 (14): 1643-1644. 10.1093/bioinformatics/btn241.Google Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D: Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (20): 11484-11489. 10.1073/pnas.1932072100.PubMedPubMed CentralGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome research. 2003, 13 (1): 103-107. 10.1101/gr.809403.PubMedPubMed CentralGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome research. 2003, 13 (9): 2129-2141. 10.1101/gr.772403.PubMedPubMed CentralGoogle Scholar
- Thomas PD, Kejariwal A, Guo N, Mi H, Campbell MJ, Muruganujan A, Lazareva-Ulitsky B: Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucleic acids research. 2006, W645-650. 10.1093/nar/gkl229. 34 Web Server
- R, Development, Core, Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. 2008, Vienna, AustriaGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.