 Methodology article
 Open Access
 Published:
A correction for sample overlap in genomewide association studies in a polygenic pleiotropyinformed framework
BMC Genomics volume 19, Article number: 494 (2018)
Abstract
Background
There is considerable evidence that many complex traits have a partially shared genetic basis, termed pleiotropy. It is therefore useful to consider integrating genomewide association study (GWAS) data across several traits, usually at the summary statistic level. A major practical challenge arises when these GWAS have overlapping subjects. This is particularly an issue when estimating pleiotropy using methods that condition the significance of one trait on the signficance of a second, such as the covariatemodulated false discovery rate (cmfdr).
Results
We propose a method for correcting for sample overlap at the summary statistic level. We quantify the expected amount of spurious correlation between the summary statistics from two GWAS due to sample overlap, and use this estimated correlation in a simple linear correction that adjusts the joint distribution of test statistics from the two GWAS. The correction is appropriate for GWAS with casecontrol or quantitative outcomes. Our simulations and data example show that without correcting for sample overlap, the cmfdr is not properly controlled, leading to an excessive number of false discoveries and an excessive false discovery proportion. Our correction for sample overlap is effective in that it restores proper control of the false discovery rate, at very little loss in power.
Conclusions
With our proposed correction, it is possible to integrate GWAS summary statistics with overlapping samples in a statistical framework that is dependent on the joint distribution of the two GWAS.
Background
The past decade of genomic research has been shaped by the advent of lowcost, high throughput technology, enabling the examination of a large number of genetic variants, i.e. single nucleotide polymorphisms (SNPs), via the genomewide association study (GWAS). The success of the GWAS approach has been limited however because SNPs identified by GWAS only capture a small fraction of the total heritability for any given complex trait. There is ongoing debate on how to detect this socalled ‘missing heritability’ [1, 2], including ideas based on integrating GWAS data across two or more traits which may share a polygenic signal (e.g. [3]). A shared polygenic signal may exist for traits with strong diagnostic overlap and this has motivated the formation of crosstrait GWAS consortia such as the Psychiatric Genetics Consortium including five psychiatric diseases, and the International Cancer Genome Consortium that aims at finding oncogenes that might drive cancer growth in different sites. Seemingly unrelated phenotypes may also have a shared polygenic signal if they partially share a common genetic basis, termed pleiotropy [4]. Pleiotropic effects have been statistically detected in crosstrait analysis of GWAS, including schizophrenia and blood lipids [3], prostate cancer and blood lipids [5], and psychiatric disorders [6].
A major statistical challenge encountered when integrating GWAS data across traits is the widespread reuse of subjects between GWA studies, leading to nonindependent data sets. Power has been maximized by increasing sample sizes, often in the hundreds of thousands, via large metaanalysis conducted by worldwide consortia for complex traits such as coronary artery disease (CAD) [7], height [8] and blood pressure [9]. Second, phenotype definitions have become more specific and have moved towards endophenotypes (e.g. blood lipids [10]), which are often measured on the same set of individuals. This, together with the epidemiological overlap of many common diseases, has led to the reuse of subjects from one GWAS to another. For example, control samples have been reused for several different case definitions, often by design. The Wellcome Trust Case Control Consortium (WTCCC) [11] is one such consortium adopting this strategy. As another example, cases for one trait have been included in quantitative trait studies (e.g. CAD [7] and blood lipids [10] and height [8]).
Addressing subject overlap is complicated by that fact that GWAS data is most often made available in form of summary statistics, i.e data over n samples is condensed into one summary statistic per SNP. GWAS summary statistics from studies with overlapping subjects cannot be made independent by removing these subjects. Aside from the issue of sample overlap, working on the summary statistics level has many advantages. When a sufficient statistic is used this summary statistic contains all the information necessary for further inference. Also, it is computationally efficient to work with summary statistics simply because of the much smaller size compared to the genotype data. This is especially relevant for the integration of several genomic data sets. Importantly, in contrast to genotype data, summary statistics cannot be used to uniquely identify individuals. This allows easier distribution and storage. As a consequence there are several consortia, such as the DIAGRAM Consortium for type 2 diabetes and the Global Blood Lipids Consortium, that have summary statistics covering the whole genome for free download on their homepage.
Lin and Sullivan [12] were the first to address the methodological challenge of integrating GWAS with overlapping subjects. Their contribution focused on integrating casecontrol GWAS using a metaanalysis framework. They do not provide a framework for integrating GWAS coming from different types of outcome variables (e.g. a casecontrol study and a quantitative trait study), nor do they provide a solution that applies in general to different statistical methodology. Han et al. [13] extend the Lin and Sullivan approach for cases and controls to random effects metaanalysis setting using a decoupling approach.
Two other approaches for metaanalysis of multiple traits while accounting for sample overlap are presented by [14, 15]. While these two approaches account for sample overlap in performing the metaanalysis, [16] introduce a test statistic based on a similar derivation as Lin and Sullivan that allows to test for overlapping samples or relatives when performing quality control of summary level data.
There is growing interest in statistical methods that utilize the joint bivariate distribution of GWAS summary statistics for two traits because, in the presence of a shared polygenic signal, these methods may provide more power than traditional GWAS methodology. One such method is the covariatemodulated local false discovery rate (cmfdr) proposed by Ferkingstad et al. [17] and recently revisited and extended [18] where the fdr for the first study depends on a covariate, for example the GWAS summary statistics for a second pleiotropic trait.
Similarly, the tailarea based conditional false discovery rate [3] needs the joint distribution of two sets of GWAS summary statistics to identify SNPs with crossphenotype associations. These methods may be seriously impacted by the spurious correlation due to overlap, but cannot be corrected on a SNPbySNP basis. Liley and Wallace [19] extend the conditional false discovery rate [3] to studies with overlapping controls. Their extension is specific to casecontrol studies and does not apply to the cmfdr or any other bivariate method.
The aims of this paper are threefold. First, we want to show the impact of overlap in samples on integrated analyses of genetic studies. We show that it can induce spurious correlation between the studies and thus seriously confound conclusions. Second, we expand on the work of Lin and Sullivan [12] and quantify the spurious crosstrait correlation due to overlap for both casecontrol studies and studies with quantitative traits. And third, we propose a correction based on a decorrelation transformation that adjusts the joint distribution of two GWAS and allows for the use of the corrected summary statistics in downstream analysis such as cmfdr. We demonstrate the impact of overlap in samples and the success of our proposed correction on synthetic and GWAS data from the Psychiatric Genetics Consortium (PGC).
Results
The impact of overlap in samples on the joint analysis of two genomic data sets
The overlap of samples between two GWAS induces spurious correlation in a bivariate analysis of the two data sets. We illustrate this spurious correlation in a simulation example. The simulation is based on two studies, 1 and 2, with d=100,000 SNPs of a minor allele frequency (MAF) drawn at random from the allele frequency distribution in the 1000 Genomes Project [20]. Genotypes are generated under the null model of no genetic association and accordingly are drawn from a binomial distribution with 2 trials and probability of success equal to the MAF. Each study has a continuous outcome that only depends on the error term (normal with mean 0 and standard deviation of 1). Study 1 and study 2 have n_{ C }=5,000 shared subjects and n_{ A }=n_{ B }=7,500 unique subjects respectively. Thus the total sample size per study is n_{1}=n_{2}=12,500. We then conduct a standard GWAS analysis (univariate linear regression, one SNP at a time) separately in study 1 and study 2.
Figure 1a and b show that pvalues for study 1 and for study 2 respectively follow a uniform distribution as expected. Assume we are interested in selecting the SNPs in study 2 on the basis of their significance in study 1. Figure 1c shows the pvalues of study 2 for which the pvalues in study 1 are smaller than 0.1. Finally, Fig. 1d displays a stratified QQ plot that plots the observed quantiles of the pvalues of study 2 against the quantiles assumed under the null distribution. The strata are defined with respect to the pvalues in study 1. These stratified QQ plots offer an intuitive way of visualizing dependencies between pvalues of two different genetic studies. Despite being generated without any genetic effects, we observe that the conditional distributions of pvalues from study 2 given pvalues in study 1 show strong enrichment for small pvalues with respect to the second conditional phenotype. If we were unaware that these simulations were conducted under the null hypothesis, this leftward deflection of the stratified QQ plot could be falsely interpreted as shared polygenic pleiotropic signal. Clearly, in case of overlapping samples, pleiotropic effects would be confounded with the spurious effects due to sample overlap.
Estimating the correlation of two test statistics due to overlap in samples
Details of this estimation are given in the “Methods” section. Consider two studies, k=1,2, both with continuous outcomes, y_{ ki }, i=1,…,n_{ k }. Assume some samples are shared, so that we can split the set of samples {1,…,n_{ k }} into two sets S_{ C }={1,…,n_{ c }} and S_{ A }={n_{ c }+1,…,n_{1}} for study 1 and similarily for study 2 with S_{ B }={n_{ c }+1,…,n_{2}}. S_{ C } are the shared samples and S_{ A } and S_{ B } are the samples unique to study 1 and study 2 respectively. The full set for study 1 is S_{1}=S_{ C }∪S_{ A } and for study 2 is S_{2}=S_{ C }∪S_{ B }. Denote with X_{ kig } the random genotypes for SNP g in sample i in study k, g=1,2,..,d, where d is typically some large number (≈10^{6}). Simlarly, denote with X_{ kjg } the random genotypes in sample j.Then, cor(X_{1ig},X_{2jg})=1 if i∈S_{ C } for all SNPs g and we assume cor(X_{1ig},X_{2jg})=0 if i∈S_{ A } and j∈S_{ B } for all g.
Consider two regression models, one for each study for one SNP g at a time, Y_{1i}=α_{1g}+β_{1g}X_{1ig}+ε_{1ig} and Y_{2j}=α_{2g}+β_{2g}X_{2jg}+ε_{2jg} where i=1,..,n_{1}, j=1,..,n_{2}, and we assume all errors ε to be independent from each other and with zero mean. Under the null model (β_{ kg }=0) ∀k,g, if S_{ C } was an empty set (i.e. no shared subjects), then \(\text {cor}\left (\hat {\beta }_{1g},\hat {\beta }_{2g}\right) = 0\). But because of the shared samples S_{ C }, \(\rho = \text {cor}\left (\hat {\beta }_{1g},\hat {\beta }_{2g}\right) \neq 0\), the overlap between samples introduces a correlation of the regression parameters which is only due to the overlap. Note, when analyzing study 1 and study 2 separately the analysis is unbiased; the bias due to overlap is only introduced in a joint analysis where ρ≠0 is neglected, as illustrated in Fig. 1.
Building on the work of Lin and Sullivan [12], we estimate the correlation ρ due to overlap in samples under the null model (β_{ kg }=0) ∀k,g, using the correlation between the maximum likelihood (ML) estimates for the regression coefficients for SNP g denoted by \(\hat {\beta }_{kg}\). The ML estimates are asymptotically Gaussian distributed with mean equal to the true coefficients β_{ kg } and variance equal to the inverse Fisher information.
We are also interested in combined analysis of GWAS summary statistics from other study designs, including those analyzed in a casecontrol study. Therefore, in the following we estimate ρ for three possible scenarios with (Y_{1} and Y_{2} both quantitative; Y_{1} quantitative and Y_{2} binary; Y_{1} and Y_{2} both binary, where \(Y_{k}=\{Y_{k1},Y_{k2},\ldots,Y_{k{n_{k}}}\}\phantom {\dot {i}\!}\) for k=1,2). The MLbased derivations (see “Methods” section) result in the following estimated correlation due to sample overlap for each of the three possible study design pairings:

1
Quantitative phenotype in both study 1 and study 2. For each SNP g,
$$ \text{cor}(\hat{\beta}_{1g}, \hat{\beta}_{2g}) \approx \frac{n_{c}}{\sqrt{n_{1} \cdot n_{2}}} \text{cor}(Y_{1}, Y_{2}) $$(1)where n_{ c } is the number of overlapping samples in study 1 and 2, n_{1} is the sample size of study 1, and n_{2} the sample size of study 2, respectively. Note that under the null hypothesis of no SNP effect, this correlation does not depend on the MAF and is the same for every SNP. In this case the g subscript can be dropped and \(\text {cor}\left (\hat {\beta }_{1g},\hat {\beta }_{2g}\right)\) can instead be written as \(\text {cor}\left (\hat {\beta }_{1},\hat {\beta }_{2}\right)\), and this simplified notation is used from this point on.

2
Binary phenotype in study 1 and binary phenotype in study 2
$$ {\selectfont{\begin{aligned} {}\text{cor}\left(\hat{\beta}_{1}, \hat{\beta}_{2}\right)\! \!\approx\! \frac{1}{\sqrt{n_{1}}\sqrt{n_{2}}} \!\times\! \left(n_{c0}\sqrt{\exp \{\alpha_{1} + \alpha_{2}\}} + \frac{n_{c1}}{\sqrt{\exp \{\alpha_{1} + \alpha_{2}\}}} \right) \end{aligned}}} $$(2)where exp{α_{1}+α_{2}}≈n_{11}n_{21}/n_{10}n_{20} [12] and where we denote the number of cases in study 1 and 2 as n_{11} and n_{21} respectively, similarly n_{10} and n_{20} for the number of controls in study 1 and 2 respectively, and denote the overlap in controls by n_{c0} and in cases by n_{c1}.

3
Quantitative phenotype in study 1 and binary phenotype in study 2
$$ cor\left(\hat{\beta}_{1}, \hat{\beta}_{2}\right) \approx \frac{n_{c}}{\sqrt{n_{1} \cdot n_{2}}} \text{cor}_{pb}(Y_{1}, Y_{2}) $$(3)where cor_{ pb }(Y_{1},Y_{2}) equals the pointbiserial correlation coefficient.
Note that the estimates \(cor\left (\hat {\beta }_{1},\hat {\beta }_{2}\right)\) in Eqs. 1 to 3 only estimate the spurious correlation due to sample overlap. This estimate differs from the total correlation between the observed test statistics which captures both the true correlation based on genetic architecture and the spurious correlation induced by sample overlap.
Decorrelation using the correlation due to overlap
In this paper we propose a decorrelation step to adjust the joint distribution of the summary statistics from two GWAS having overlapping subjects. Construct a matrix z consisting of two rows and d columns equal to the number of SNPs common to both studies, including the vector of summary statistics (zscores) for the first study, z_{1}, in the first row and the vector of zscores for the second study, z_{2}, in the second row. The decorrelation transform is defined as
where C is the 2×2 matrix with ones on its diagonal the calculated correlation due to overlap on its offdiagonal.
In Fig. 2 we use the simulated data introduced in Fig. 1 and show how the proposed decorrelation step corrects for the correlation due to overlap and removes the spurious enrichment. The pvalues for study 2 conditional on study 1 are equally distributed (Fig. 2c) and the inflation of the enrichment is removed (Fig. 2d).
Performance of proposed decorrelation step in a covariatemodulated false discovery rate framework
We tested the performance of our proposed correction for sample overlap in a covariatemodulated fdr (cmfdr) [18] framework using a twopronged approach. First, we quantified the impact of sample overlap on the actual false discovery proportion under different pleiotropic simulation scenarios and with different amounts of sample overlap. Second, we used individuallevel (genotypephenotype) data from the Psychiatric Genetics Consortium (PGC), which employed a shared control design for schizophrenia and bipolar disorder, to test our correction in a real data setting. Since we had access to the individuallevel data, we were able to conduct a series of GWAS manipulating the extent of overlapping controls and compare the number of cmfdrbased “discoveries” to equallypowered nonoverlapping control sets.
Simulated data
We simulated bivariate GWAS data under six different simulation scenarios: first under the null model, where genotype is independent from phenotype and then under five different pleiotropic scenarios:

1
Null model, no effect

2
Positive pleiotropy A

3
Positive pleiotropy B

4
Positive pleiotropy C

5
Positive pleiotropy plus univariate effects

6
Positive and antagonistic pleiotropy,
where positive pleiotropy A, B and C differ in the extent of polygenic structure.
We then used this simulated data to conduct synthetic GWAS for paired studies with first no sample overlap and then again with sample overlap. For each study pair, we calculated the cmfdr for the first GWAS using the summary statistics from the second GWAS as a covariate. We did this both with and without our proposed correction for sample overlap and compared the false discovery proportion (FDP), i.e. the number of false discoveries divided by the total number of discoveries, before and after correction and to the nonoverlapping GWAS.
Simulation results
The main purpose of the simulation was to test the performance of our correction for sample overlap in a cmfdr framework with known null and nonnull SNPS under different pleiotropic and polygenic scenarios and with different amounts of sample overlap.
Table 1 reports the mean false discovery proportion (FDP), mean number of falsely rejected null hypotheses (i.e. false positives (FP)) and mean number of correctly rejected nonnull hypotheses, (i.e. true positives (TP)) under different simulation scenarios with d=100,000 SNPs over independent 100 simulations based on a cmfdr cutoff of 0.05 and using the summary statistics from study 2 as a covariate for study 1. This is reported for all six simulation scenarios. The null model simulation shows that, in the absence of any true genetic association and with nonoverlapping samples, no SNPs reach the cmfdr cutoff of 0.05. In contrast, when samples overlap, a mean of 245 SNPs are below the cutoff, and thus are false positives. After applying our proposed correction to the GWAS with overlapping samples, all cmfdr values are again above the significance cutoff and no SNPs are deemed significant. For the simulation scenarios involving pleiotropic effects, 400 of the 100,000 SNPs were nonnull except for positive pleiotropy B and C where 1200 and 2200 were nonnull respectively. For all pleiotropic scenarios, the FDP for the analysis using the nonoverlapping studies shows that the fdr level is conservatively held, while the FDP for the overlapping set, greatly exceeds the desired level of fdr control. After correction the overlapping studies using the proposed decorrelation step, the fdr control is comparable to the nonoverlapping, independent studies.
We performed an extended simulation using the “positive pleiotropy A” scenario, where we varied the amount of sample overlap. Table 2 and Fig. 3 give the FDP, TP and FP and clearly show that the impact of sample overlap is nonlinear. The FDP increases at an increasing rate as the number of overlapping samples increases. After applying our correction for sample overlap to the overlapping studies, the fdr control is comparable to the nonoverlapping, independent studies for all levels of sample overlap. The correction results in a small loss in power (TP), and this loss in power is more severe as the overlap increases.
In practice it may be difficult to calculate the exact overlap in samples or obtain an accurate estimate of Cor(Y_{1},Y_{2}) for continuous traits. We therefore tested the robustness of our proposed correction to the correlation used in the decorrelation step (Eq. 4). Using the positive pleiotropy A scenario, where \(cor\left (\hat {\beta }_{1},\hat {\beta }_{2}\right)=0.4\), we varied the correlation value used in Eq. 4 from 0.3 to 0.5. We find that our proposed correction is robust the the correlation value used in the decorrelation step with fdr level being conservatively held in all cases (Table 3).
Psychiatric Genetics Consortium (PGC) data with shared controls
We used the PGC data [21, 22] to test the performance of our proposed correction for sample overlap in a real data setting, where we varied the amount of overlap in the control group between the schizophrenia and bipolar studies, corresponding to an expected correlation of ρ = 0,0.09,0.18,0.27,0.36,0.45. Using this series of GWAS summary statistics for bipolar disorder and schizophrenia, we calculated the cmfdr using the bipolar disorder summary statistics as the covariate for schizophrenia. The cmfdr calculations were done for both the raw data and also for the data after correction for sample overlap.
PGC results
Which SNPs are null and which SNPs are nonnull is unknown, so it is not possible to count the true and false positives. Instead, we can count the total number of SNPs below a given cmfdr threshold (TP+FP), and use the nonoverlapping set as a reference point. In this case, we used a threshold of 0.05 and called all SNPs with a cmfdr below this threshold a discovery. Importantly, the number of controls is held constant across the different amounts of sample overlap. This rules out any differences in (TP+FP) that may be expected to due differences in power. There were on average 255 discoveries for the analysis with no overlapping controls and significantly more discoveries were made when samples overlapped, as is evident by the nonoverlapping confidence intervals for the no overlapping controls scenario versus all overlapping scenarios (Table 4). After correction for sample overlap, the number of discoveries returned to a more comparable level, usually falling just below the number of discoveries made in the nonoverlapping analysis.
Discussion
There is an increasing interest in combining GWAS data over multiple traits, often using data at the summary statistics level. Here we have proposed a practical and generally applicable approach for estimating the amount of correlation in the test statistics for two GWASs having overlapping subjects and having any type of outcome variable. Using simulation studies assuming various genetic architecture models, we have quantified the magnitude of the effect of sample overlap on the covariatemodulated fdr and have shown that sample overlap can greatly increase the false discovery proportion (FDP). Our proposed correction for sample overlap, which is an efficient prewhitening transformation, restores the FDP to a comparable level to simulated scenarios with no sample overlap. Using data for bipolar disorder and schizophrenia from the Psychiatric Genetics Consortium, we show that increasing numbers of shared controls result in an increased number of “discoveries”, but these socalled discoveries are most likely false positives and indicate a loss of proper control of the false discovery rate.
Statistical methods for integrating GWAS data at the summary statistic level are well established. Examples of such methods are Fisher’s method [23], inversevariance metaanalysis [23], the conjunctional false discovery rate [3], the covariatemodulated fdr [18] and Mendelian randomization [24]. These methods universally assume independent samples. Violation of this assumption will result in increased Type 1 error and biased effect estimates [24]. Lin and Sullivan [12] were the first to recognize this importance of the sample overlap problem in the context of crosstrait analysis of GWAS data. Their work is focused on correcting for sample overlap for casecontrol studies in the context of fixedeffects metaanalysis test statistics. Under the null hypothesis of no genetic effects, they derived the correlation between the maximum likelihood estimates for the logistic regression coefficients for a given SNP in study 1 and study 2 when there are partially overlapping subjects in casecontrol studies. Here we use the same approach to derive the correlation for a casecontrol GWAS paired with a quantitative trait GWAS, or for 2 quantitative trait GWASs. The spurious correlation due to sample overlap is derived under the null and quantifies the correlation which is solely induced by sample overlap and independent of any genetic effect. Others have recognized that the number of overlapping samples is not always known and have proposed methods for estimating the correlation due to overlap using summary statistics alone [14, 25]. These methods could be used for quantitative trait GWASs where in practice the correlation of the two phenotypes (Cor(Y_{1},Y2)) may be difficult to estimate. Our simulations show that our proposed correction is robust with respect to the assumed correlation due to overlap. Further, the impact of Cor(Y_{1},Y2) on the correlation due to overlap increases as the extent of overlap increases. In these cases it may be feasible to request an estimate of Cor(Y_{1},Y2) from the relevant GWAS consortium. Regardless of which method is used to derive the correlation induced by sample overlap, here we propose a general framework to account for this spurious correlation in a simple and yet efficient preprocessing step. Spurious correlation between test statistics can be introduced not only by sample overlap, but also by including relatives in both studies. This results in an effective number of overlapping samples a concept introduced in [16]. Our approach can be easily extended to account for the effective number of overlapping samples in replacing n_{ c } by the effective number of overlapping samples.
Conclusions
Our goal was to provide a more general solution to the problem of crosstrait integration of GWAS that could be applied to statistical methods depending on the joint distribution of 2 GWASs. It is a practical approach in that it is easy to implement and results in transformed test statistics that can be used in different data integration methods. We show that in a cmfdr setting, our correction properly maintains fdr control.
Here we have contributed to the growing body of evidence showing that sample overlap needs to be taken into account when integrating data across different traits. We have shown that our flexible and adaptable adjustment for sample overlap works well as shown with both simulation and with real data in the context of the cmfdr.
Methods
Derivation of the estimates for correlation due to overlap
The correlation due to overlap in samples is derived from the correlation of the maximum likelihood (ML) estimates of the regression coefficients between two studies under the assumption of no genetic effect. We focus on one regression per SNP g and include the intercept and no other covariates. Focusing first on quantitative outcomes, consider two linear regressions, for one SNP g (we drop the index g), Y_{ k }=α_{ k }+β_{ k }X_{ k }+ε_{ k }. We assume all errors ε_{ k } to be independent from each other and with zero mean.
Lin and Sullivan [12] show that for two case control studies the covariance between the ML estimates of the logistic regression coefficients from study 1 and 2 can be approximated as \(Cov\left (\hat {\beta }_{1}, \hat {\beta }_{2}\right) \approx I_{1}^{1}(\beta _{1}) Cov (U_{1}(\beta _{1}), U_{2}(\beta _{2})) I_{2}^{1}(\beta _{2})\) where U_{ k } and I_{ k } are the score function and Fisher’s information with respect to β_{ k }. We use the above to further define the following correlation:
It is now straightforward to expand this result to include quantitative trait studies using the ML estimates from linear regression.
For linear regression the score function with respect to β_{ k } is given by \(U(\beta _{k}) = \frac {1}{\sigma _{k}^{2}} \sum _{i \in S_{k}} (y_{ki} (\alpha _{k} + \beta _{k} x_{ki})) x_{ki}\) and the Fisher information is given by \(I(\beta _{k}) = \frac {1}{\sigma _{k}^{2}} \sum _{i \in S_{k}} x_{ki} x_{ki}\). Similarly for logistic regression the score function with respect to β_{ k } is given by \(U(\beta _{k}) = \sum _{i \in S_{k}} \left (y_{ki} \frac {\exp \{\alpha _{k} + \beta _{k} x_{ki}\}}{1+\exp \{\alpha _{k} + \beta _{k} x_{ki}\}}\right) x_{ki}\) and the Fisher information is given by \(I(\beta _{k}) = \sum _{i \in S_{k}} \frac {\exp \{\alpha _{k} + \beta _{k} x_{ki}\}}{(1+\exp \{\alpha _{k} + \beta _{k} x_{ki}\})^{2}} x_{ki} x_{ki}\).
We make the following assumptions:

1
Y_{ k } is independent of X_{ k }, that is we assume the null model where there is no genetic effect in the data and β_{ k }=0 for all SNPs, k=1,2.

2
The overlapping samples have the same genotype in each study x_{1i}=x_{2i} for i∈S_{ C } for all SNPs.

3
Construct a variable H defined as \(H = E \left (X_{k}X_{k}^{T}\right)\). We can estimate H under the null hypothesis and the following three estimates of H are approximately equal \( n_{1}^{1} \sum _{i \in S_{1}} x_{1i} x_{1i} \approx n_{2}^{1} \sum _{i \in S_{2}} x_{2i} x_{2i} \approx n_{C}^{1} \sum _{i \in S_{C}} x_{1i} x_{2i}\).
In casecontrol studies we assume y_{1i}=y_{2i} for i∈S_{ C } (in other words cases in study 1 are cases in study 2). Thus Cor(Y_{1},Y_{2})=1 for the overlapping samples in casecontrol studies. For quantitative phenotypes we assume that we are able to derive appropriate estimates for Cor(Y_{1},Y_{2}) from epidemiology studies.
Correction for overlapping samples in studies with quantitative traits
In Eq. (5) we use the score function and the Fisher information derived in the linear regression model and arrive at
Assumption 2 allows us to replace the sums over x_{ ki } with H so \(Cor\left (\hat {\beta }_{1},\hat {\beta }_{2}\right) \approx (n_{1} H)^{1/2} \times \frac {1}{\sigma _{1}}\frac {1}{\sigma _{2}} H \sum \limits _{i \in S_{C}} (y_{1i}  \alpha _{1}) (y_{2i}  \alpha _{2}) \times (n_{2} H)^{1/2}\), which simplifies to \(Cor\left (\hat {\beta }_{1},\hat {\beta }_{2}\right) \approx \frac {1}{\sqrt {n_{1}}\sqrt {n_{2}}} \times \frac { \sum \limits _{i \in S_{C}}{(y_{1i}  \alpha _{1}) (y_{2i}  \alpha _{2})}}{\sigma _{1} \cdot \sigma _{2}}\). Multiplying by n_{ c }/n_{ c } we get: \(Cor\left (\hat {\beta }_{1},\hat {\beta }_{2}\right) \approx \frac {n_{c}}{\sqrt {n_{1}}\sqrt {n_{2}}} \times \frac { \frac {1}{n_{c}} \sum \limits _{i \in S_{C}} (y_{1i}  \alpha _{1}) (y_{2i}  \alpha _{2})}{\sigma _{1} \cdot \sigma _{2}}\). When individual level data is available, this can be computed directly. But when only summary statistics are available, the correlation can be approximated as
where in practice we need to estimate Cor(Y_{1},Y_{2}) externally. A plot of Eq. 7 is given in Additional file 1: Figure S1.
Correction for overlapping samples in casecontrol studies
When the data refer to two casecontrols studies we give the result previously derived by Lin and Sullivan [12]. Let n_{c0} denote the number of overlap in controls in study 1 and 2, and n_{c1} denote the number of overlap for cases. First we derive Cov(U_{1}(β_{1}),U_{2}(β_{2})) using the score function from logistic regression, and the fact that y_{ ki }=0 for cases and y_{ ki }=1 for controls
It is easy to show that the right hand side of 8 is equal to \(\frac {1}{(1+\exp \{\alpha _{1}\})(1+\exp \{\alpha _{2}\})} \big \{ n_{c0}\exp \{(\alpha _{1}+\alpha _{2})\}+n_{c1} \big \} \frac {1}{n_{c}} \sum \limits _{i \in S_{C}} x_{1i} x_{2i}\). According to assumption 2 we can introduce H to obtain \(Cov (U_{1}(\beta _{1}), U_{2}(\beta _{2})) = \frac {1}{(1+\exp \{\alpha _{1}\})(1+\exp \{\alpha _{2}\})} \big \{ n_{c0}\exp \{(\alpha _{1}+\alpha _{2})\}+n_{c1} \big \} H\). In logistic regression under the null model there is a connection between the intercept and the log odds \(\exp \{\alpha _{k}\} = \frac {n_{kc0}}{n_{k}} / \left (1\frac {n_{kc0}}{n_{k}}\right) =n_{kc0}/n_{kc1}\).
From Eq. 5, it follows that
Correction for overlapping samples with one quantitative trait study and case control study
Finally, we consider one Y_{1} quantitative and Y_{2} binary. In Eq. (5) we use the score function and the Fisher information derived in both the logistics and linear regression model and arrive at
where p_{2} is the proportion of cases in the case control study. Substituting in H, \(Cor \left (\hat {\beta }_{1}, \hat {\beta }_{2}\right) \approx \left (\frac {1}{\sigma ^{2}_{1}} n_{1} H\right)^{1/2} \times \frac {1}{\sigma _{1}^{2}}H \sum \limits _{i \in S_{C}} (y_{1i}  \alpha _{1})(y_{2i}p_{2}) \times (p_{2}(1p_{2})n_{2} H)^{1/2} \). This can be approximated as \(Cor\left (\hat {\beta }_{1}, \hat {\beta }_{2}\right) \approx \frac {n_{c}}{\sqrt {n_{1} \cdot n_{2}}} \text {Cor}_{pb}(Y_{1}, Y_{2}) \), where Cor_{ pb }(Y_{1},Y_{2}) is the pointbiserial correlation coefficient which needs to be estimated externally when only summary statistics are available.
Decorrelation
The focus here is correcting the bivariate distribution of GWAS test statistics for the correlation due to sample overlap. The test statistics may come from casecontrol studies or studies on quantitative traits. We also assume that the effect direction is known and that the summary statistics are given as Wald statistics, i.e. \(\hat {\beta }_{k}/se\left (\hat {\beta }_{k}\right)\), where \(se\left (\hat {\beta }_{k}\right)\) is the standard error for the regression coefficient of every SNP g, where as before we drop g from the notation. For large samples, Wald statistics approximately follow a standard normal distribution and as such are interpretable as zscores.
Thus, our final dataset is a matrix z consisting of two rows and d columns equal to the number of SNPs common to both studies, including the vector of zscores for the first study, z_{1}, in the first row and the vector of zscores for the second study, z_{2}, in the second row.
To correct for the overlap in samples and to remove the spurious correlation from the data we use a decorrelation transformation as described by [26]. The transform is defined as
where C is the 2×2 empirical correlation matrix of z, with r=cor(z_{1},z_{2}) on its offdiagonal. Note this is different from the Mahalanobis transform, which uses the covariance matrix in Eq. 11 instead of the correlation matrix C. After the transformation, the correlation matrix of z_{decorr} is a diagonal matrix. Importantly this transformation maximizes the correlation between the original data and the transformed data and is thus the most suitable transformation as it has the least impact on the data when performing prewhitening [26].
Suppose that we want to decorrelate the test statistics of quantitative trait studies 1 and 2 but only for the amount of correlation due to sample sharing. Under the null hypothesis that a certain SNP g has no effect on the outcome in both studies, we know that \(\text {cor}\left (\hat {\beta }_{1}, \hat {\beta }_{2}\right)\) is given by Eq. 1 and this correlation is purely induced by sample sharing. We want to correct exactly for this spurious correlation. It can be shown that for sufficiently large n_{1} and \(n_{2} \text {cor}\left (\hat {\beta }_{1}, \hat {\beta }_{2}\right) \approx cor(z_{1}, z_{2})\). Then under the null hypothesis we should correct z with
assuming the y_{ k } are quantitative traits. Alternatively, C could be calculated using the methods of [25] or [14] if lacking explicit information on the number of overlapping subjects.
Simulation study
Simulation of genotype and phenotype For all scenarios, we simulated d=100,000 independent SNPs with a MAF drawn at random from the observed distribution of MAF from the 1000 Genomes Project. The quantitative trait outcomes, Y_{1} (study 1 outcome) and Y_{2} (study 2 outcome), were simulated for n=20,000 individuals, n_{1}=n_{2}=10,000 individuals per study.
The six simulation scenarios differ in the simulation of the outcomes. For the null model, we simulate Y_{1} and Y_{2} as described in the example in the “Methods” section.
For all other simulation scenarios, Y_{1} and Y_{2} are dependent on both the error term and a given subset of SNPs. For the “positive pleiotropy A” scenario, the signal involves SNPs that are nonnull for both Y_{1} and Y_{2}. We set 400 regression parameters not equal to zero (β=0.1 for 100 SNPs, β=−0.1 for 100 SNPs, β=0.15 for 100 SNPs, and β=−0.15 for 100 SNPs) with the same effect strength and direction on Y_{1} and Y_{2}. This gives 400 nonnull SNPs and 99,600 null SNPs for both study 1 and study 2. Similarly for the “positive pleiotropy B” scenario, we increase the polygenicity and set 1200 regression parameters not equal to zero (β=0.1 for 100 SNPs, β=−0.1 for 100 SNPs, β=0.07 for 500 SNPs, and β=−0.07 for 500 SNPs) with the same effect strength and direction on Y_{1} and Y_{2}. For the “positive pleiotropy C” scenario, we increase the polygenicity again and set 2200 regression parameters not equal to zero (β=0.1 for 100 SNPs, β=−0.1 for 100 SNPs, β=0.05 for 1000 SNPs, and β=−0.05 for 1000 SNPs) with the same effect strength and direction on Y_{1} and Y_{2}.
For the “positive pleiotropy plus univariate effects in study 1” scenario, we introduce positive pleiotropy by setting 200 regression parameters not equal to zero (β=0.1 for 100 SNPs, β=−0.1 for 100 SNPs) with the same effect strength and direction on Y_{1} and Y_{2}. Additionally, we add a signal for 200 SNPs that is only present in study 1 (β=0.15 for 100 SNPs, β=−0.15 for 100 SNPs). In the final simulation scenario, we generate “positive and antagonistic pleiotropy” by setting 200 regression parameters not equal to zero (β=0.1 for 100 SNPs, β=−0.1 for 100 SNPs) with the same effect strength and direction on Y_{1} and Y_{2}, and additionally, we add 200 SNPs with opposing effect directions for study 1 and study 2 (β_{1}=0.15 and β_{2}=0.15 for 100 SNPs, β_{1}=−0.15 and β_{2}=0.15 for 100 SNPs).
Generation of independent and overlapping studies For each simulation scenario, we computed GWAS summary statistics for the ideal case of two studies with no overlap in samples. We refer to these as independent studies. Additionally, for each simulation scenario, we generated summary statistics for studies with n_{ c }=5000 overlapping samples. In practice, we did this by randomly assigning 2500 subjects from study 1 to be included into study 2, and vice versa, resulting in n_{1}=n_{2}=12,500. These studies are referred to as the overlapping studies. Since the overlapping studies have more power than the independent studies, we also simulated independent studies with n_{1}=n_{2}=12,500 and refer to this as the independent studies with equal power.
In order to look at the effect of various amounts of sample overlap, we did an extended simulation using the “positive pleiotropy A” scenario, where the number of overlapping samples ranged from 500 to 5000, in steps of 500. In practice, we did this by randomly assigning 250,500,750,1000,…,2500 subjects from study 1 to be included into study 2, and vice versa. Thus the total overlap in samples adds up to n_{ c }=500,1000,1500,2000,…,5000 subjects, and the sample size per group is n_{1}=n_{2}=10250,10500, 10750,1100,…,12500.
In practice the correlation due to overlap may be subject to some estimation error. In order test the robustness of the proposed correction, we varied the correlation value used in the decorrelation step for the "positive pleiotropy A” scenario. For this simulation, the correlation due to overlap is 0.4 but we varied the correlation value in the decorrelation step from 0.3 to 0.5.
Generation of GWAS test statistics and covariate modulated fdr For each simulation scenario, separately for each of study 1 and 2 (“independent”) and again for each of study 1 and 2 (“overlapping”), we computed for each of the d=100,000 SNP we computed a univariate linear regression and estimate the effect size of each SNP by the zscore defined as regression coeffiecient divided by its standard deviation. These zscores are the final summary statistics used in further analysis. The summary statistics were then used to calculate the cmfdr for study 1 using the study 2 summary statistics as the covariate. This was done first for the independent studies and then again using the overlapping studies. The summary statistics for the overlapping studies were then corrected using Eqs. 11 and 12 (“corrected”). The number of true positives (TP), false positives (FP) and the false discovery proportion (FDP) were calculated using a cmfdr cutoff of 0.05.
For each of the simulation scenarios described above, we performed 100 replicates and report the average TP, FP and FDP for the following three settings

1
independent study 1 and 2

2
uncorrected overlapping study 1 and 2

3
overlapping study 1 and 2 with the proposed correction
We define true positives as those SNPs where we introduced effects into the simulation, i.e. known nonnull SNPs.
Psychiatric genetics consortium application
Data description We were granted access to the raw genotype data for bipolar disorder cases, schizophrenia cases and controls from the Psychiatric Genetics Consortium (PGC) [21, 22]. The relevant institutional review boards or ethics committees approved the research protocol of the individual GWAS included in the PGC sample and all participants provided written informed consent. We used the PGC data to test the performance of our proposed correction for sample overlap in a real data setting, where we varied the amount of overlap in the control group between the schizophrenia and bipolar studies.
The data consists of n=9379 schizophrenia cases, n=6990 bipolar disorder cases and n=21,153 shared controls. Imputed genotypes in dosage format were available genomewide, but we limited our analysis to 260,703 SNPs with MAF≥0.05 on chromosomes 1, 2 and 3 due to computational time. Using this dataset, we randomly selected 10,000 controls for schizophrenia, and then randomly selected 10,000 controls for bipolar disorder, of which 0, 2000, 4000, 6000, 8000 or 10000 were drawn from the schizophrenia controls, corresponding to an expected correlation of ρ=0,0.09,0.18,0.27,0.36,0.45 respectively between the GWAS summary statistics for bipolar disorder and schizophrenia. We repeated each of these conditions 10 times. We then conducted a standard GWAS for each of the 120 datasets (6 amounts of overlap * 2 types of cases * 10 repetitions) by conducting logistic regression in Plink (v1.07), adjusting for population stratification using the first two principle components. We then took the summary statistics from each GWAS and entered them pairwise into the cmfdr using the bipolar disorder summary statistics as the covariate for schizophrenia. The cmfdr calculations were done for both the raw data and also for the data after correction for sample overlap.
References
Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, H. NJ. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010; 11:446–50.
Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AA, Lee SH, Robinson MR, Perry JR, Nolte IM, van VlietOstaptchouk JV, et al.Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature genetics. 2015.
Andreassen OA, Thompson WK, Schork AJ, Ripke S, Mattingsdal M, Kelsoe JR, Kendler KS, et al. Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropyinformed conditional false discovery rate. PLoS Genet. 2013; 9:1003455.
Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nat Rev Genet. 2013; 14:483–95.
Andreassen OA, Zuber V, Thompson WK, Schork AJ, Betella F, Djurovic S, the PRACTICAL Consortium, et al. Identifying common genetic variants in blood pressure due to polygenic pleiotropy with associated phenotypes. Int J Epidemiol. 2014; 43(4):1205–14.
Chung D, Yang C, Li C, Gelernter J, Zhao H. GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet. 2014; 10(11):1004787.
Deloukas P, Kanoni S, Willenborg C, Farrall M, Assimes TL, Thompson JR, Ingelsson E, et al. Largescale association analysis identifies new risk loci for coronary artery disease. Nat. Genet. 2013; 45(1):25–33.
Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010; 467(7317):832–8.
for Blood Pressure GenomeWide Association Studies IC, et al. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011; 478(7367):103–9.
Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, Ganna A, et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 2013; 45(11):1274–83.
Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, et al. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447(7145):661–78.
Lin DY, Sullivan PF. Metaanalysis of genomewide association studies with overlapping subjects. Am J Hum Genet. 2009; 85:862–72.
Han B, Duong D, Sul JH, de Bakker PI, Eskin E, Raychaudhuri S. A general framework for metaanalyzing dependent studies with overlapping subjects in association mapping. Human molecular genetics. 2016;049.
Zhu X, Feng T, Tayo BO, Liang J, Young JH, Franceschini N, Smith JA, et al. Metaanalysis of correlated traits via summary statistics from gwass with an application in hypertension. Am J Hum Genet. 2015; 96(1):21–36.
Bolormaa S, Pryce JE, Reverter A, Zhang Y, Barendse W, Kemper K, Tier B, Savin K, Hayes BJ, Goddard ME. A multitrait, metaanalysis for detecting pleiotropic polymorphisms for stature, fatness and reproduction in beef cattle. PLoS genetics. 2014; 10(3):1004198.
Chen GB, Lee SH, Robinson MR, Trzaskowski M, Zhu ZX, Winkler TW, Day FR, CroteauChonka DC, Wood AR, Locke AE, et al. Acrosscohort qc analyses of gwas summary statistics from complex traits. Eur J Hum Genet. 2017; 25(1):137.
Ferkingstad E, Frigessi A, Rue H, Thorleifsson G, Kong A. Unsupervised empirical bayesian multiple testing with external covariates. Ann Appl Stat. 2008;714–35.
Zablocki RW, Schork AJ, Levine RA, Andreassen OA, Dale AM, Thompson WK. Covariatemodulated local false discovery rate for genomewide association studies. Bioinformatics. 2014; 30(15):2098–104.
Liley J, Wallace C. A pleiotropyinformed bayesian false discovery rate adapted to a shared control design finds new disease associations from gwas summary statistics. PLoS genetics. 2015; 11(2):1004926.
Consortium GP, et al. A global reference for human genetic variation. Nature. 2015; 526(7571):68.
of the Psychiatric Genomics Consortium SWG, et al. Biological insights from 108 schizophreniaassociated genetic loci. Nature. 2014; 511(7510):421–7.
Group PGCBDW, et al. Largescale genomewide association analysis of bipolar disorder identifies a new susceptibility locus near odz4. Nat Genet. 2011; 43(10):977–83.
Evangelou E, Ioannidis JP. Metaanalysis methods for genomewide association studies and beyond. Nat Rev Genet. 2013; 14(6):379–89.
Burgess S, Davies NM, Thompson SG. Bias due to participant overlap in twosample mendelian randomization. Genet Epidemiol. 2016; 40(7):597–608.
Province MA, Borecki IB. A correlated metaanalysis strategy for data mining ‘omic’scans. In: Pac Symp Biocomput, vol. 18.2013. p. 236–246. World Scientific.
Kessy A, Lewin A, Strimmer K. Optimal whitening and decorrelation. Am Stat. 2017;justaccepted.
Acknowledgements
We acknowledge the following collaborators from the Schizophrenia Working Group of the Psychiatric Genomics Consortium: Stephan Ripke, Benjamin M. Neale, Aiden Corvin, James T. R. Walters, KaiHow Farh, Peter A. Holmans, Phil Lee, Brendan BulikSullivan, David A. Collier, Hailiang Huang, Tune H. Pers, Ingrid Agartz, Esben Agerbo, Margot Albus, Madeline Alexander, Farooq Amin, Silviu A. Bacanu, Martin Begemann, Richard A Belliveau Jr, Judit Bene, Sarah E. Bergen, Elizabeth Bevilacqua, Tim B Bigdeli, Donald W. Black, Richard Bruggeman, Nancy G. Buccola, Randy L. Buckner, William Byerley, Wiepke Cahn, Guiqing Cai, Murray J. Cairns, Dominique Campion, Rita M. Cantor, Vaughan J. Carr, Noa Carrera, Stanley V. Catts, Kimberly D. Chambert, Raymond C. K. Chan, Ronald Y. L. Chen, Eric Y. H. Chen, Wei Cheng, Eric F. C. Cheung, Siow Ann Chong, C. Robert Cloninger, David Cohen, Nadine Cohen, Paul Cormican, Nick Craddock, Benedicto CrespoFacorro, James J. Crowley, David Curtis, Michael Davidson, Kenneth L. Davis, Franziska Degenhardt, Jurgen Del Favero, Lynn E. DeLisi, Ditte Demontis, Dimitris Dikeos, Timothy Dinan, Srdjan Djurovic, Gary Donohoe, Elodie Drapeau, Jubao Duan, Frank Dudbridge, Naser Durmishi, Peter Eichhammer, Johan Eriksson, Valentina EscottPrice, Laurent Essioux, Ayman H. Fanous, Martilias S. Farrell, Josef Frank, Lude Franke, Robert Freedman, Nelson B. Freimer, Marion Friedl, Joseph I. Friedman, Menachem Fromer, Giulio Genovese, Lyudmila Georgieva, Elliot S. Gershon, Ina Giegling, Paola GiustiRodriguez, Stephanie Godard, Jacqueline I. Goldstein, Vera Golimbet, Srihari Gopal, Jacob Gratten, Lieuwe de Haan, Christian Hammer, Marian L. Hamshere, Mark Hansen, Thomas Hansen, Vahram Haroutunian, Annette M. Hartmann, Frans A. Henskens, Stefan Herms, Joel N. Hirschhorn, Per Hoffmann, Andrea Hofman, Mads V. Hollegaard, David M. Hougaard, Masashi Ikeda, Inge Joa, Antonio Julia, Rene S. Kahn, Luba Kalaydjieva, Sena KarachanakYankova, Juha Karjalainen, David Kavanagh, Matthew C. Keller, Brian J. Kelly, James L. Kennedy, Andrey Khrunin, Yunjung Kim, Janis Klovins, James A. Knowles, Bettina Konte, Vaidutis Kucinskas, Zita Ausrele Kucinskiene, Hana KuzelovaPtackova, Anna K. Kahler, Claudine Laurent, Jimmy Lee Chee Keong, S. Hong Lee, Sophie E. Legge, Bernard Lerer, Miaoxin Li, Tao Li, KungYee Liang, Jeffrey Lieberman, Svetlana Limborska, Carmel M. Loughland, Jan Lubinski, Jouko Lonnqvist, Milan Macek Jr, Patrik K. E. Magnusson, Brion S. Maher, Wolfgang Maier, Jacques Mallet, Sara Marsal, Manuel Mattheisen, Morten Mattingsdal, Robert W. McCarley, Colm McDonald, Andrew M. McIntosh, Sandra Meier, Carin J. Meijer, Bela Melegh, Ingrid Melle, Raquelle I. MesholamGately, Andres Metspalu, Patricia T. Michie, Lili Milani, Vihra Milanova, Younes Mokrab, Derek W. Morris, Ole Mors, Kieran C. Murphy, Robin M. Murray, Inez MyinGermeys, Bertram MullerMyhsok, Mari Nelis, Igor Nenadic, Deborah A. Nertney, Gerald Nestadt, Kristin K. Nicodemus, Liene NikitinaZake, Laura Nisenbaum, Annelie Nordin, Eadbhard OĆallaghan, Colm OĎushlaine, F. Anthony OŃeill, SangYun Oh, Ann Olincy, Line Olsen, Jim Van Os, Psychosis Endophenotypes International Consortium, Christos Pantelis, George N. Papadimitriou, Sergi Papiol, Elena Parkhomenko, Michele T. Pato, Tiina Paunio, Milica PejovicMilovancevic, Diana O. Perkins, Olli Pietilainen, Jonathan Pimm, Andrew J. Pocklington, John Powell, Alkes Price, Ann E. Pulver, Shaun M. Purcell, Digby Quested, Henrik B. Rasmussen, Abraham Reichenberg, Mark A. Reimers, Alexander L. Richards, Joshua L. Roffman, Panos Roussos, Douglas M. Ruderfer, Veikko Salomaa, Alan R. Sanders, Ulrich Schall, Christian R. Schubert, Thomas G. Schulze, Sibylle G. Schwab, Edward M. Scolnick, Rodney J. Scott, Larry J. Seidman, Jianxin Shi, Engilbert Sigurdsson, Teimuraz Silagadze, Jeremy M. Silverman, Kang Sim, Petr Slominsky, Jordan W. Smoller, HonCheong So, Chris C. A. Spencer, Eli A. Stahl, Hreinn Stefansson, Stacy Steinberg, Elisabeth Stogmann, Richard E. Straub, Eric Strengman, Jana Strohmaier, T. Scott Stroup, Mythily Subramaniam, Jaana Suvisaari, Dragan M. Svrakic, Jin P. Szatkiewicz, Erik Soderman, Srinivas Thirumalai, Draga Toncheva, Paul A.Tooney, Sarah Tosato, Juha Veijola, John Waddington, Dermot Walsh, Dai Wang, Qiang Wang, Bradley T. Webb, Mark Weiser, Dieter B. Wildenauer, Nigel M. Williams,Stephanie Williams, Stephanie H. Witt, Aaron R. Wolen, Emily H. M. Wong, Brandon K. Wormley, Jing Qin Wu, Hualin Simon Xi, Clement C. Zai, Xuebin Zheng, Fritz Zimprich, Naomi R. Wray, Kari Stefansson, Peter M. Visscher, Wellcome Trust CaseControl Consortium 2, Rolf Adolfsson, Ole A. Andreassen, Douglas H. R. Blackwood, Elvira Bramon, Joseph D. Buxbaum, Anders D. Borglum, Sven Cichon, Ariel Darvasi, Enrico Domenici, Hannelore Ehrenreich, Tonu Esko, Pablo V. Gejman, Michael Gill, Hugh Gurling, Christina M. Hultman, Nakao Iwata, Assen V. Jablensky, Erik G. Jonsson, Kenneth S. Kendler, George Kirov, Jo Knight, Todd Lencz, Douglas F. Levinson, Qingqin S. Li, Jianjun Liu, Anil K. Malhotra, Steven A. McCarroll, Andrew McQuillin, Jennifer L. Moran, Preben B. Mortensen, Bryan J. Mowry, Markus M. Nothen, Roel A. Ophoff, Michael J. Owen, Aarno Palotie, Carlos N. Pato, Tracey L. Petryshen, Danielle Posthuma, Marcella Rietschel, Brien P. Riley, Dan Rujescu, Pak C. Sham, Pamela Sklar, David St Clair, Daniel R. Weinberger, Jens R. Wendland, Thomas Werge, Mark J. Daly, Patrick F. Sullivan and Michael C. OĎonovan.
We acknowledge the following collaborators from the Bipolar Disorder Working Group of the Psychiatric Genomics Consortium: Mark Daly, Marcella Rietschel, Nicholas Craddock, John I. Nurnberger, Michael Gill, Keith Matthews, Jana Strohmaier, Devin Absher, Huda Akil, Adebayo Anjorin, Lena Backlund, Judith A. Badner, Jack D. Barchas, Thomas B. Barrett, Nick Bass, Michael Bauer, Frank Bellivier, Sarah E. Bergen, Wade Berrettini, Douglas Blackwood, Cinnamon S. Bloss, Michael Boehnke, Gerome Breen, William E. Bunner, Margit Burmeister, William Byerley, Sian Caesar, Kim Chambert, David W. Craig, Richard Day, Howard J. Edenberg, Amanda Elkin, Bruno Etain, Manuel A. Ferreira, I. Nicol Ferrier, Matthew Flickinger, Tatiana Foroud, Christine Fraser, Louise Frisen, Elliot S. Gershon, Katherine GordonSmith, Elaine K. Green, Tiffany A. Greenwood, Detelina Grozeva, Weihua Guan, Marian L. Hamshere, Martin Hautzinger. Maria Hipolito, Stephane Jamain, Edward G. Jones, Radhika Kandaswamy, John R. Kelsoe, James L. Kennedy, Daniel L. Koller, Phoenix Kwan, Mikael Landen, Niklas Langstrom, Mark Lathrop, Jacob Lawrence, Marion Leboyer, Phil H. Lee, Jun Li, Chunyu Liu, Falk W. Lohoff, Pamela B. Mahon, Melvin G. McInnis, Rebecca McKinney, Francis J McMahon, Andrew McQuillin, Sandra Meier,Fan Meng, Manuel Mettheisen, Philip B Mitchell, Jennifer Moran, Gunnar Morken, Thomas W. Muhleisen, Walter J. Muir, Richard M. Myers, Caroline M. Nievergelt, Vishwajit Nimgaonkar, Evaristus A. Nwulia, Urban Osby, Benjamin S. Pickard, Peter Propping, Emma Quinn, Soumya Raychaudhuri, John Rice, Martin Schalling, Alan F. Schatzberg, Peter R. Schofield, Nicholas J. Schork, Johannes Schumacher, Markus M. Schwarz, Ed Scolnick, Laura J. Scott, Paul D. Shilling, Erin N. Smith, David St. Clair, John Strauss, Szabocls Szelinger, Robert C. Thompson, John B. Vincent, Stanley J. Watson, Thomas F. Wienker, Richard Williamson, Stephanie H. Witt, Adam Wright, Wei Xu, Allan H. Young, Peter P. Zandi, Peng Zhang, Sebastian Zollner, Anne E Farmer, Lisa Jones, Ian Jones, William B. Lawson, Susanne Lucae, Nicholas G. Martin, Peter McGuffin, Alan W. McLean, Grant W. Montgomery, Pierandrea Muglia, Bertram MullerMyhsok, James B. Potash, William A. Scheftner, Federica Tozzi, William H. Coryell, Shaun M. Purcell, Ole A. Andreassen, Srdjan Djurovic, Morten Mattingsdal, Danyu Lin, Valentina Moskvina, David A. Collier, Aiden Corvin, Frank Dudbridge, Hugh Gurling, Peter A. Holmans, Christina M. Hultman, George K. Kirov, Paul Lichtenstein, Kevin A. McGhee, Ingrid Melle, Derek W. Morris, Ivan Nikolov, Colm O’Dushlaine, Michael J. Owen, Hannes Petursson, Douglas Ruderfer, Engilbert Sigurdsson, Pamela Sklar, Kari Stefansson, Michael C. O’Donovan, Andrew McIntosh, Rene Breuer, Josef Frank, Stefan Herms, Wolfgang Maier, Manuel Mattheisen, Markus M Nothen, Michael Steffens, Jens Treutlein, Sven Cichon, Franziska Degenhardt, Thomas G. Schulze.
Funding
Verena Zuber is supported by the Wellcome Trust and the Royal Society (Grant Number 204623/Z/16/Z) and the UK Medical Research Council (Grant Number MC_UU_00002/7).
Availability of data and materials
For simulated data: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
For the Psychiatric Genetics Consortium (PGC) data: The data that support the findings of this study are available from the PGC but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the PGC.
Author information
Authors and Affiliations
Consortia
Contributions
ML, VZ: conception and design, data simulation, analysis and interpretation and manuscript writing. BKA and AF: conception and design, interpretation and manuscript writing. WKT: interpretation and manuscript writing. OAA: data access, interpretation and manuscript writing. The Schizophrenia and Bipolar Disorder Working Groups of the Psychiatric Genomics Consortium: data access. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
We did not collect any new samples for this study. The Psychiatric Genetics Consortium data used here has been previously published [21, 22] and was collected in accordance with ethical regulations in the partner countries and as defined in original research publications (For schizophrenia see the Supplement of [21] and for bipolar disorder see the supplement of [22]) The lead PI of each sample warranted that their protocol was approved by their local Ethical Committee. All subjects provided written informed consent. There were nearly 50 ethics committees that approved the contributed samples and these are listed in the Supplements of the original publications.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Plot of correlation due to overlap versus quantitative trait correlation. Supplemental Figure 1. Plot of the correlation due to overlap for two quantative traits as a function of percent sample overlap and the correlation of the traits (Cor(Y_{1},Y_{2})). Here we assume the sample sizes for the two GWASs are equal. The See Eq. 7. (PDF 40 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
LeBlanc, M., Zuber, V., Thompson, W.K. et al. A correction for sample overlap in genomewide association studies in a polygenic pleiotropyinformed framework. BMC Genomics 19, 494 (2018). https://doi.org/10.1186/s1286401848597
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401848597
Keywords
 Data integration
 Metaanalysis with shared subjects
 Covariatemodulated false discovery rate
 Crossphenotype association