Skip to main content

Table 1 Descriptive statistics of the gold standard sequencing datasets used in the study

From: Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Sample

Type

Source (SRA ID)

Mean coverage

Fraction of 10x bases

Median called variant countb

True variant count

HG001

WGS

GIAB FTP

22.2

0.987

20,422

20,444

WES

ERR1905890

248.8a

0.985a

19,875

HG002

WGS

GIAB FTP

23.2

0.990

20,651

20,647

WES

SRR2962669

241.4

0.987

20,048

HG003

WGS

GIAB FTP

23.2

0.987

20,623

20,660

WES

SRR2962692

203.9

0.987

20,046

HG004

WGS

GIAB FTP

22.8

0.990

20,729

20,745

WES

SRR2962694

228.4

0.987

20,112

HG005

WGS

GIAB FTP

37.3

0.998

20,650

20,620

WES

SRR2962693

195.5

0.985

19,969

HG006

WGS

GIAB FTP

25.5

0.991

20,320

20,354

WES

SRR14724507

183.2

0.955

19,650

HG007

WGS

GIAB FTP

25.6

0.992

20,483

20,526

WES

SRR14724506

176

0.957

19,793

  1. Coverage and variant statistics are given with respect to GIAB v4.2 high-confidence CDS regions. aCoverage values for downsampled HG001 WES dataset are given (see Methods); bvariant counts were obtained by calculating the median number of variants discovered by different pipelines. Full statistics for each sample and tool combination is given in Supplementary Table S1