Through the method described here, we demonstrate that by using a large collection of SNP markers and patterns of linkage disequilibrium modeled by a panel of haplotypes, sets of non-overlapping, low-coverage sequencing data can be compared to determine if the two samples originated from the same individual.

### Additional methods

#### Algorithm

We begin with a SNP reference panel of haplotypes from a population chosen based on a priori knowledge or previous analyses to best represent the population from which the samples originated. For each sample, we examine reads that overlap SNP positions from the panel to identify the base at that position. Bases that do not match one of the two alleles from the reference panel are discarded. Observations for positions where multiple reads map are omitted. The majority of positions will have no observation.

Next, we find all pairs of SNPs between the two samples within a specified distance on the chromosome. A minimal distance is also enforced to ensure that the two base observations are never made from the same fragment in within-library comparisons. For each pair of base observations, denoted as *A* and *B*, we calculate the probabilities of that observation under two models. The first represents the probability of observing this combination of bases when the observations were made from independent chromosomes, i.e. two unrelated individuals. In this case, the two observations are independent and based solely on the frequencies of each allele in the population:

$$ {P}_2\left(A\wedge B\right)=f(A)f(B) $$

Where *f*(*A*) and *f*(*B*) are the frequencies of alleles *A* and *B* in the population.

In the second model, the observations are made from a diploid individual, where there is an equal chance of the two observations originating from the same chromosome or from different chromosomes.

$$ {P}_1\left(A\wedge B\right)=\frac{1}{2}f(AB)+\frac{1}{2}f(A)f(B) $$

In the case where both observations are made from the same chromosome, the probability of observing alleles *A* and *B* is the frequency that *A* and *B* appear on the same chromosome in the population, i.e. the haplotype frequency, *f*(*AB*). Otherwise, the probability is the same as independently observing *A* and *B* on different chromosomes.

These two models are compared as a log-likelihood ratio, which is calculated as:

$$ \gamma \left(A,B\right)={ \log}_2\frac{P_1\left(A\wedge B\right)}{P_2\left(A\wedge B\right)} $$

Log-likelihood ratios are aggregated across the entire genome through summation of *γ* for pairs of SNPs in a set *S* of SNP pairs sampled from windows a set size across the genome.

$$ \Lambda (S) = {\displaystyle \sum_{\left(A,B\right)\in S}}\gamma \left(A,B\right) $$

This step can be repeated as a bootstrapping approach to estimate the empirical distribution of the genome-wide aggregated log-likelihood ratio.

#### Simulations

We performed coalescent simulations to test our method free of base errors and under various demographic scenarios. We used the coalescent simulator ms [35] to simulate diploid individuals and population reference panels of haplotypes for comparison. For each replicate, we simulated 3000 independent segments of 500 Kb in size for a total of 1.5 Gb. Segregating sites with minor allele frequencies lower than 10 % were removed. Reference panels consisted of 200 haplotypes. For diploid individuals, we simulated base observations from low-coverage sequencing by randomly drawing an allele from segregating sites at a rate of 0.01. This was done separately for each chromosome and sites where both alleles were observed were discarded, resulting in ~0.02 fold coverage. This process was repeated to construct multiple observation sets per individual.

The single simple population model used a constant effective population size of 10,000. The second model, representing the reference population and samples originating in distinct populations, simulated an ancestral population of 10,000 that split 100, 500, 1000, 2000, and 4000 generations ago into two equal sized populations of 10,000 each. The model of recent human history was based off of parameters inferred by Gutenkunst et al. [26] (see Additional file 1).

#### Reference panels

All human sequence and reference panel data used in this study were downloaded from public sources (accession details listed below). Institutional review and ethical approval were not required for this research.

We constructed reference panels of single nucleotide polymorphisms (SNPs) using the 1000 Genomes Project Phase one data set [25]. We filtered for biallelic SNPs that were polymorphic in the target population (CEU, GBR, etc) with a minimum minor allele count of 10. To avoid errors from mismapped reads, we restricted our panels to sites where all overlapping 35mers are unique across hg19 according to the Duke Uniq 35 track from the Mappability tracks on the UCSC Genome Browser [36].

#### Modern and ancient human sequence data

We obtained Illumina sequencing data from a European male (NA12891) and a European female (NA12892) sequenced as part of Platinum Genomes by Illumina, Inc [27] from the National Center for Biotechnology Information Sequence Read Archive (accession IDs ERR194160 and ERR194161) [37].

Illumina sequencing data from DNA extracted from 12 samples from 11 Bronze Age Eurasian humans [19] were downloaded from the European Nucleotide Archive (project accession ID PRJEB9021) [38]. We downloaded mapped reads in BAM format for samples RISE109, RISE154, RISE240, RISE247, RISE480, RISE483, RISE507, RISE508, RISE510, RISE546, RISE554, and RISE586.

### Availability of supporting data

Software written for this manuscript is available at http://github.com/svohr/tilde.