Versions Compared
compared with
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Here you can find a full report of about oading loading 62,000 samples for Genomics England Research environment.
Platform
A 30- nodes Hadoop cluster ...
Data
62,000 genomes organised in...
- GRCh37 Germline - LOADING
- RD37 (~5,000 VCF files multisample)
* GRCh38 Germline - LOADED, STATS and ANNOTATED
- RD38 (~16,000 VCF files multisample)
- CG38 (~10,000 VCF files)
* GRCh38 Somatic - LOADED, STATS and ANNOTATED
- CS38 (~10,000 VCF files)The platform used for this case study consists on a Hadoop Cluster of 35 nodes (5 + 30) and a LSF queue system:
Node | #nodes | cores | memory (GB) |
---|---|---|---|
LSF queue node for load | 10 | 12 | 364 |
Hadoop master nodes | 5 | 28 | 216 |
Hadoop worker nodes | 30 | 28 | 216 |
Data
The data of this case study contains a total of 64,078 samples divided in 4 different datasets.
Dataset | Alias | Files | Samples | Samples per file | Variants |
---|---|---|---|---|---|
Rare Disease GRCh38 | RD38 | 16,591 | 33,180 | 2.00 | 437,740,498 |
Cancer Germline GRCh38 | CG38 | 9,167 | 9,167 | 1.00 | 286,136,051 |
Cancer Somatic GRCh38 | CS38 | 9,589 | 9,589 | 1.00 | 398,402,166 |
Rare Disease GRCh37 | RD37 | 5,329 | 12,142 | 2.28 | 298,763,059 |
Total | 40,676 | 64,078 | 1,421,041,774 |
Loading Data
Query Performance
Common Queries
Clinical Queries
Table of Contents:
Table of Contents | ||
---|---|---|
|