Skip to main content

Table 1 Variability in biological data

From: Fast probabilistic file fingerprinting for big data

Dataset

Description

File type

Number of files

Total size (in GB)

File size (in MB)

δ

     

Min

Max

 

E61/dat

Ensembl v61 genome annotation (DAT) and DNA sequence (FASTA) files in both compressed (gzip) and uncompressed forms.

dat

5544

169.57

5.04

1385.14

0.782

E61/dat.gz

 

dat.gz

5544

42.92

1.02

400.21

0.996

E61/fa

 

fa

1484

498.51

3.47

13306.96

0.015

E61/fa.gz

 

fa.gz

1484

95.25

1.0

973.15

0.594

GPL570/cel

Microarray files for the HG U133 Plus chip from GEO (all files of GPL570 platform as of 03.2011). Affymetrix CEL and CHP format files, in compressed (gzip) and uncompressed form.

cel

59892

1022.29

1.92

173.27

0.000

GPL570/cel.gz

 

cel.gz

59892

330.09

1.13

48.84

0.000

GPL570/chp

 

chp

2535

63.30

1.67

36.50

0.209

GPL570/ch.gz

 

chp.gz

2535

26.36

1.02

23.05

0.995

BioC2.7/BSGenome

Raw DNA sequence from the Bioconductor package BSGenome, in compressed and uncompressed forms

rda

513

8.45

1.00

117.17

0.981

BioC2.7/BSGenome/u

 

un-packed

513

32.41

1.62

447.40

0.000

YaleTFBS/bedGraph4

Raw ChIP-seq data from the YaleTFBS dataset of the ENCODE project. Four different file types, both in compressed and uncompressed forms.

bed-Graph4

171

139.91

216.73

2447.62

0.924

YaleTFBS/bedGraph4.gz

 

bed-Graph4.gz

171

31.45

52.89

551.80

0.996

YaleTFBS/fastq

 

fastq

388

541.99

199.25

4469.89

0.919

YaleTFBS/fastq.gz

 

fastq.gz

388

160.75

49.55

1564.84

0.996

YaleTFBS/tagAlign

 

tagAlign

520

279.45

79.95

2357.32

0.544

YaleTFBS/tagAlign.gz

 

tag-Align.gz

520

96.70

27.86

815.63

0.994

YaleTFBS/wig

 

wig

33

10.66

188.92

693.66

0.912

YaleTFBS/wig.gz

 

wig.gz

33

3.27

59.76

207.93

0.996

  1. Measurements of δ-variability in several biological datasets. Exact description of the experiment is available in the Supplementary material online [14].