Samples and applied data
In this study we examined samples from 47 Danube Swabian individuals with well-documented family history dating back to 3–6 succeeding generations with unadmixed Swabian ancestries supported by self-declaration-based family history and the resulting pedigree trees. The sampled Danube Swabian individuals live in the villages of Dunaszekcső and Bár which can be found along the Danube River in Southwest Hungary (Supplemental Fig. 1). 29 samples are from Dunaszekcső and 18 samples were collected from the village of Bár. From the 47 individuals, 19 were males and 28 were females, so the M/F ratio was 0.68. The Swabian population of these villages remained mostly isolated from other ethnicities until today, providing an opportunity to study their genetic makeup and relationship with major European groups.
DNA was extracted from ethylenediaminetetraacetic acid (EDTA)-anticoagulated whole blood and was genotyped on the Illumina Infinium Global Screening Array Beadchip platform which contains 725 831 single-nucleotide polymorphisms (SNPs). Isolation, genotyping, and preliminary quality control of the samples was carried out by the third-party service provider Human Genomics Facility (HUGE-F) in the Netherlands at the University of Rotterdam. Quality control and data preparation of the marker data was carried out domestically applying in-house scripts and the PLINK1.9 and 2.0 software packages [14, 15]. The data was filtered using the Hardy-Weinberg equilibrium tests, and additionally, SNPs with missing genotypes were removed from the dataset using PLINK with the ‘geno’ flag applying a threshold value of 0.1. All Swabian individuals passed these tests and 665 073 SNPs remained in our Danube Swabian dataset.
This study belongs to a series of investigations that were approved by the National Ethics Board (ETT TUKEB), and by Regional Ethics Committee of Pécs and follows the principles expressed in the Declaration of Helsinki.
Genome-wide autosomal marker data from other open genotype databases was also considered in the study. We used the 1000 Genomes Project (1KGP) and Human Genome Diversity Project (HGDP) datasets which are openly available from the respected sources [9, 16,17,18]. We also considered population data from datasets of the open genome-wide marker data repository which can be found on the server of the Estonian Biocentre [19, 20]. We used also the Allen Ancient DNA Resource (AADR) dataset which is openly available from the David Reich lab on the Harvard University [21]. Populations from the European and Caucasus regions were applied from the HGDP and 1KGP datasets. Additional populations from the Estonian Biocentre included Hungarians, Romanians, and Germans. German samples were filtered according to preliminary PCA and ADMIXTURE analyses using 1KGP and HGDP data separately, since we discovered that some of the German samples are outliers possessing significant non-West European (presumably East European) genetic ancestry. These German samples (6) were removed from the German data prior to our analyses. Since the sampling of German data was based on self-declaration, some of these individuals might not originate from the area of Germany but from neighboring countries.
Principal component analysis-based population structure analysis
Population structure analysis along with fixation index (Fst) matrix calculation were achieved using the SMARTPCA software of the EIGENSOFT 6.1.4 package [22].
For the PCA analysis, first, we merged the Swabian samples with European 1KGP groups, namely with British (English and Scottish) samples from England and Scotland (GBR), Finnish from Finland (FIN), Iberian samples from Spain (IBS), Toscani from Italy (TSI) and Utah residents with Northern and Western European ancestry from the collection of CEPH (CEU).
A second merged dataset containing the Swabian samples with various European groups using HGDP, AADR and Estonian Biocentre data was also created and analyzed using PCA. This dataset includes the HGDP populations French, French Basques, Orcadians, North Italians, Sardinians, Tuscans, Russians and Adygei. Populations from the AADR were the English, Scottish, Spanish, Finnish, and Greek, including also 1KGP samples of these groups. From the Estonian data, Hungarians, Romanians, and Germans were used.
The first, 1KGP dataset contained n = 556 individuals and 159 240 SNPs, the secondly created dataset featuring various European populations from various repositories contained n = 666 individuals and 106 121 SNPs. SNPs with strong background linkage disequilibrium (LD) were also pruned out with the ‘indep-pairwise’ command of PLINK1.9 setting the r2 threshold to 0.3. It is necessary before the analyses due to strong background LD can bias the PCA method, but also expectation maximization-based ancestry estimation algorithms like ADMIXTURE and TreeMix which were used in this study. After the pruning process, 149 979 and 79 757 SNPs remained in our first and second dataset, respectively. We used SMARTPCA with default settings, the σ-threshold was set to 6.0. Fst calculations were carried out with our second dataset using the “fstonly” option of the SMARTPCA software.
Maximum likelihood method-based ancestry estimation
Ancestry estimation was carried out with the ADMIXTURE 1.22 algorithm which is a maximum likelihood estimation method using an expectation maximization approach [23]. We carried out ADMIXTURE analysis on our second dataset containing various European populations from different genotype data repositories. The correct number of clusters (K) were calculated applying K values of 2 to 10 and cross-validation was also performed in order to find the best fitting K for the relationship of our investigated populations.
TreeMix was also applied along with ADMIXTURE analysis on this dataset, to better describe the relationship of these populations in a maximum-likelihood tree-based manner in addition to the stacked column styled ancestry estimation [24]. The size of the SNP blocks (-k flag) was set to 1000 and we also set the algorithm to estimate for 1–6 migration events in the data through multiple runs. For these investigations, the same pruned dataset was used that was created for PCA, but Uyghurs from the HGDP data was added as outgroup (n = 681, 79 757 SNPs).
Formal test of admixture
In order to test the relationship of Swabians and other investigated populations in the second dataset, we utilized a formal test of admixture, the 4-population test. The qpDstat program from the ADMIXTOOLS 4.1 package was used for this purpose, and as its name suggests, this test was implemented here as D-statistics [25]. For these calculations, we used the unpruned version of our second dataset. YRI from the 1KGP data was added to these tests as an outgroup. We tested the unrooted phylogenetic trees containing YRI, Swabians, Hungarians and various European populations, Germans, English, French, Orcadian, Scottish, Spanish, North Italian, Toscani, Russian and Romanian. We applied five different setups of the ((W,X)(Y,Z)) unrooted trees which were the following:
((YRI,Hungarian)(Swabian,German)), ((YRI,Swabian)(Hungarian,German)), ((YRI,European Test)(Swabian,Hungarian)), ((YRI,European Test)(Swabian,German)). These tests intended to show the relationship of Swabians to the Hungarian host population, to the Germans and to various European populations.
Identity by descent and homozygous by descent analyses
For assessing the sources of ancestry in the investigated Swabian samples, we implemented here the Refined IBD algorithm of Beagle 4.1 [26]. The software seeks in phased haplotype data for IBD segments between all pairs of individuals, which shows us the relative share of one population in the ancestry of the investigated population. In order to minimalize the SNP loss, we used in this test an unpruned dataset consisting only of Swabians and the HGDP and Estonian Biocentre groups, featuring n = 601 individuals and 110 733 SNPs. Before the analysis, the data was converted according to the needs of the algorithm using the PLINK1.9 software. The major alleles were set as A2 allele and the dataset was converted to Variant Call Format 4.1 with the PLINK/SEQ software [27]. The minimum segment length was set to 3 centiMorgan, the IBD trim parameter value was 10. The IBD scale parameter was calculated with the \(\sqrt{n/100}\) recommended formula since our data contained more than 400 individuals [26]. Using the inferred IBD segment data, we calculated an average pairwise IBD sharing between Swabians and various populations with the following formula according to Atzmon et al.:
$$Average\;pairwise\;IBD\;sharing=\frac{\sum_{i=1}^n\sum_{j=1}^m{IBD}_{ij}}{n\cdot m}$$
IBDij is the length of the IBD segment shared between individuals i and j. The n and m are the number of individuals in the groups I and J [28].
We also calculated the average number and average length of IBD segments between Swabians and the investigated various populations.
Besides IBD segments, Refined IBD simultaneously detects homozygous by descent (HBD) segments, which allows us also to infer the genome-wide autozygosity of respective populations. This can imply the degree of isolation and degree of inbreeding of these groups. Therefore, average length and number of HBD segments were also calculated.