Here you can find a full report of about loading 62,000 samples for Genomics England Research environment.

Platform

The platform used for this case study consists on a Hadoop Cluster of 35 nodes (5 + 30) and a LSF queue system:

Node	#nodes	cores	memory (GB)
LSF queue node for load	10	12	364
Hadoop master nodes	5	28	216
Hadoop worker nodes	30	28	216

Data

The data of this case study contains a total of 64,078 samples divided in 4 different datasets.

Dataset	Alias	Files	Samples	Samples per file	Variants
Rare Disease GRCh38	RD38	16,591	33,180	2.00	437,740,498
Cancer Germline GRCh38	CG38	9,167	9,167	1.00	286,136,051
Cancer Somatic GRCh38	CS38	9,589	9,589	1.00	398,402,166
Rare Disease GRCh37	RD37	5,329	12,142	2.28	298,763,059
Total		40,676	64,078		1,421,041,774