Skip to main content

Table 2 Detailed inspection of similar file pairs

From: Fast probabilistic file fingerprinting for big data

Dataset

File pair and remarks

File sizes (in MB)

δ

E61/fa

Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_SSTO.fa

166.04

0.015

 

Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_MANN.fa

166.06

 
 

These are two alternative haplotype "patch" files for the same chromosome locus. The dataset contains 11 other examples of similar file pairs with δ < 0.06 (when unpacked). All are related to the alternative haplotypes for the MHC locus. The next most similar pair of files has δ > 0.8.

  

GPL570/cel

GSM405175.CEL

12.93

8e-6

 

GSM341406.CEL

12.93

 
 

The second file differs from the first by a single Affymetrix probe measurement. According to GEO metadata the two files are simply different packagings of the same experimental data by two researchers. The GEO570 dataset contains 9 other examples of similar file pairs with δ < 0.002. The next most similar pair of files has δ > 0.3.

  

GPL570/cel.gz

GSM405175.CEL.gz

4.31

6e-4

 

GSM341406.CEL.gz

4.31

 
 

A gzip-compressed version of the pair above. Same remarks apply. The most similar pair of actually different datafiles has δ > 0.9.

  

BioC2.7/B SGenome/u

BSgenome.Athaliana.TAIR.01222004/extdata/chr1.rda

29.04

2e-4

 

BSgenome.Athaliana.TAIR.04232008/extdata/chr1.rda

29.04

 
 

Consequtive versions of A.thaliana reference genome. The next most similar file pair in this dataset has δ > 0.5. Note that the compressed versions of the same files have δ > 0.9.

  
  1. The table lists the suspiciously similar pairs of files from the studied datasets.