Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Here you can find a full report of about loading 62,000 samples for Genomics England Research environment.

Platform

The platform used for this case study consists on a Hadoop Cluster of 35 nodes (5 + 30) and a LSF queue system:

Node#nodescoresmemory (GB)
LSF queue node for load1012364
Hadoop master nodes528216
Hadoop worker nodes3028216

Data

The data of this case study contains a total of 64,078 samples divided in 4 different datasets.

DatasetAliasFilesFile typeSamplesSamples per fileVariants
Rare Disease GRCh37RD375,329Multi sample VCF12,1422.28298,763,059
Rare Disease GRCh38RD3816,591Multi sample VCF33,1802.00437,740,498
Cancer Germline GRCh38CG389,167Single sample VCF9,1671.00286,136,051
Cancer Somatic GRCh38CS389,589Somatic VCF9,5891.00398,402,166Rare Disease GRCh37RD375,329Multi sample VCF12,1422.28298,763,059
Total40,676
64,078
1,421,041,774


Each dataset is loaded in OpenCGA as a study.

Loading Data

The data ingestion is executed in the LSF nodes, connected directly to the Hadoop platform. This configuration allows us to run multiple files at the same time.

Having 10 worker nodes in the queue, each of them loading up to 6 files at the same time, results in 60 files being loaded concurrently.

Multi sample files

The files from Rare Disease studies (RD38 & RD37) contain more than one sample per file. In average, 2 samples per file.


Concurrent load files60
Average files loaded per hour125.72
Load time per file00:28:38

Single sample files

The files from Cancer Germline studies (CG38) contain one sample per file. Compared with the Rare Disease, these files are smaller in size, therefore, the load is slightly faster.

Concurrent load files60
Average files loaded per hour242.05
Load time per file00:14:52

Query Performance

Common Queries


Clinical Queries



Table of Contents:

Table of Contents
indent20px