Here you can find a full report of about oading loading 62,000 samples for Genomics England Research environment.

Platform

A 30- nodes Hadoop cluster ...

Data

62,000 genomes organised in...

GRCh37 Germline - LOADING
- RD37 (~5,000 VCF files multisample)

* GRCh38 Germline - LOADED, STATS and ANNOTATED

- RD38 (~16,000 VCF files multisample)
- CG38 (~10,000 VCF files)
* GRCh38 Somatic - LOADED, STATS and ANNOTATED
- CS38 (~10,000 VCF files)The platform used for this case study consists on a Hadoop Cluster of 35 nodes (5 + 30) and a LSF queue system:

Node	#nodes	cores	memory (GB)
LSF queue node for load	10	12	364
Hadoop master nodes	5	28	216
Hadoop worker nodes	30	28	216

Data

The data of this case study contains a total of 64,078 samples divided in 4 different datasets.

Dataset	Alias	Files	Samples	Samples per file	Variants
Rare Disease GRCh38	RD38	16,591	33,180	2.00	437,740,498
Cancer Germline GRCh38	CG38	9,167	9,167	1.00	286,136,051
Cancer Somatic GRCh38	CS38	9,589	9,589	1.00	398,402,166
Rare Disease GRCh37	RD37	5,329	12,142	2.28	298,763,059
Total		40,676	64,078		1,421,041,774

Loading Data

Query Performance

Common Queries

Clinical Queries

Table of Contents:

Table of Contents

indent	20px

Page tree

Versions Compared

Old Version 1

New Version 2

Key

Platform

Data

Data

Loading Data

Query Performance

Common Queries

Clinical Queries

Page tree

Page History

Versions Compared

Old Version 1

New Version 2

Key

Platform

Data

Data

Loading Data

Query Performance

Common Queries

Clinical Queries