Skip to main content

Table 1 Variability in biological data

From: Fast probabilistic file fingerprinting for big data

Dataset Description File type Number of files Total size (in GB) File size (in MB) δ
      Min Max  
E61/dat Ensembl v61 genome annotation (DAT) and DNA sequence (FASTA) files in both compressed (gzip) and uncompressed forms. dat 5544 169.57 5.04 1385.14 0.782
E61/dat.gz   dat.gz 5544 42.92 1.02 400.21 0.996
E61/fa   fa 1484 498.51 3.47 13306.96 0.015
E61/fa.gz   fa.gz 1484 95.25 1.0 973.15 0.594
GPL570/cel Microarray files for the HG U133 Plus chip from GEO (all files of GPL570 platform as of 03.2011). Affymetrix CEL and CHP format files, in compressed (gzip) and uncompressed form. cel 59892 1022.29 1.92 173.27 0.000
GPL570/cel.gz   cel.gz 59892 330.09 1.13 48.84 0.000
GPL570/chp   chp 2535 63.30 1.67 36.50 0.209
GPL570/ch.gz   chp.gz 2535 26.36 1.02 23.05 0.995
BioC2.7/BSGenome Raw DNA sequence from the Bioconductor package BSGenome, in compressed and uncompressed forms rda 513 8.45 1.00 117.17 0.981
BioC2.7/BSGenome/u   un-packed 513 32.41 1.62 447.40 0.000
YaleTFBS/bedGraph4 Raw ChIP-seq data from the YaleTFBS dataset of the ENCODE project. Four different file types, both in compressed and uncompressed forms. bed-Graph4 171 139.91 216.73 2447.62 0.924
YaleTFBS/bedGraph4.gz   bed-Graph4.gz 171 31.45 52.89 551.80 0.996
YaleTFBS/fastq   fastq 388 541.99 199.25 4469.89 0.919
YaleTFBS/fastq.gz   fastq.gz 388 160.75 49.55 1564.84 0.996
YaleTFBS/tagAlign   tagAlign 520 279.45 79.95 2357.32 0.544
YaleTFBS/tagAlign.gz   tag-Align.gz 520 96.70 27.86 815.63 0.994
YaleTFBS/wig   wig 33 10.66 188.92 693.66 0.912
YaleTFBS/wig.gz   wig.gz 33 3.27 59.76 207.93 0.996
  1. Measurements of δ-variability in several biological datasets. Exact description of the experiment is available in the Supplementary material online [14].