Skip to main content

Table 2 Detailed inspection of similar file pairs

From: Fast probabilistic file fingerprinting for big data

Dataset File pair and remarks File sizes (in MB) δ
E61/fa Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_SSTO.fa 166.04 0.015
  Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_MANN.fa 166.06  
  These are two alternative haplotype "patch" files for the same chromosome locus. The dataset contains 11 other examples of similar file pairs with δ < 0.06 (when unpacked). All are related to the alternative haplotypes for the MHC locus. The next most similar pair of files has δ > 0.8.   
GPL570/cel GSM405175.CEL 12.93 8e-6
  GSM341406.CEL 12.93  
  The second file differs from the first by a single Affymetrix probe measurement. According to GEO metadata the two files are simply different packagings of the same experimental data by two researchers. The GEO570 dataset contains 9 other examples of similar file pairs with δ < 0.002. The next most similar pair of files has δ > 0.3.   
GPL570/cel.gz GSM405175.CEL.gz 4.31 6e-4
  GSM341406.CEL.gz 4.31  
  A gzip-compressed version of the pair above. Same remarks apply. The most similar pair of actually different datafiles has δ > 0.9.   
BioC2.7/B SGenome/u BSgenome.Athaliana.TAIR.01222004/extdata/chr1.rda 29.04 2e-4
  BSgenome.Athaliana.TAIR.04232008/extdata/chr1.rda 29.04  
  Consequtive versions of A.thaliana reference genome. The next most similar file pair in this dataset has δ > 0.5. Note that the compressed versions of the same files have δ > 0.9.   
  1. The table lists the suspiciously similar pairs of files from the studied datasets.