Here you can find a full report of about loading 62,000 samples for Genomics England Research environment.
The platform used for this case study consists on a Hadoop Cluster of 35 nodes (5 + 30) and a LSF queue system:
Node | #nodes | cores | memory (GB) |
---|---|---|---|
LSF queue node for load | 10 | 12 | 364 |
Hadoop master nodes | 5 | 28 | 216 |
Hadoop worker nodes | 30 | 28 | 216 |
The data of this case study contains a total of 64,078 samples divided in 4 different datasets.
Dataset | Alias | Files | Samples | Samples per file | Variants |
---|---|---|---|---|---|
Rare Disease GRCh38 | RD38 | 16,591 | 33,180 | 2.00 | 437,740,498 |
Cancer Germline GRCh38 | CG38 | 9,167 | 9,167 | 1.00 | 286,136,051 |
Cancer Somatic GRCh38 | CS38 | 9,589 | 9,589 | 1.00 | 398,402,166 |
Rare Disease GRCh37 | RD37 | 5,329 | 12,142 | 2.28 | 298,763,059 |
Total | 40,676 | 64,078 | 1,421,041,774 |
Table of Contents: