Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

One of the aims of The 100,000 Genomes Project from Genomics England is to enable new medical research. Researchers will study how best to use genomics in healthcare and how best to interpret the data to help patients. The causes, diagnosis and treatment of disease will also be investigated. This is currently the largest national sequencing project of its kind in the world.

In order to provide a platform capable to provide the researchers with required functionalityresearch environment, the data was indexed using OpenCGAin OpenCGA using the BigData Hadoop Platform.

Here you can find a full report of about loading 64,078 samples for Genomics England Research environment.


The platform used for this case study consists on a Hadoop Cluster of 35 nodes (5 + 30) and a LSF queue system:

Node#nodescoresmemory (GB)
LSF queue node for load1012364
Hadoop master nodes528216
Hadoop worker nodes3028216


The data of this case study contains a total of 64,078 samples divided in 4 different datasets.

DatasetAliasFilesFile typeSamplesSamples per fileVariants
Rare Disease GRCh37RD375,329Multi sample VCF12,1422.28298,763,059
Rare Disease GRCh38RD3816,591Multi sample VCF33,1802.00437,740,498
Cancer Germline GRCh38CG389,167Single sample VCF9,1671.00286,136,051
Cancer Somatic GRCh38CS389,589Somatic VCF9,5891.00398,402,166

Each dataset is loaded in OpenCGA as a study, grouped in 3 different projects, depending on the type of data (germline / somatic), and assembly (GRCh37 / GRCh38).

Loading Data

The data ingestion is executed in the LSF nodes, connected directly to the Hadoop platform. This configuration allows us to run multiple files at the same time.

Having 10 worker nodes in the queue, each of them loading up to 6 files at the same time, results in 60 files being loaded concurrently.

Multi sample files

The files from Rare Disease studies (RD38 & RD37) contain more than one sample per file. In average, 2 samples per file.

Concurrent files loaded60
Average files loaded per hour125.72
Load time per file00:28:38

Single sample files

The files from Cancer Germline studies (CG38) contain one sample per file. Compared with the Rare Disease, these files are smaller in size, therefore, the load is slightly faster.

Concurrent files loaded60
Average files loaded per hour242.05
Load time per file00:14:52

Query Performance

We would like to distinguish two types of queries: General and Clinical

General Queries

These queries are are only filtering by variant annotation and cohort stats. These queries only include aggregated data, not returning sample genotypes.

FilterResultsTotal ResultsTime
consequence type = LoF + missense_variant



consequence type = LoF + missense_variant

biotype = protein_coding


panel with 200 genes


Clinical Queries

Clinical queries, or sample queries, enforces queries to return variants of a specific set of samples. These queries can use all the filters from the general queries. The result will include a ReportedEvent for each variant, which determines possible conditions associated to the variant.

FilterResultsTotal ResultsTime

Segregation mode = biallelic

filter = PASS



Segregation mode = biallelic

filter = PASS


De novo variants

filter = PASS

consequence type = LoF + missense_variant

Compound Heterozygous

filter = PASS

biotype = protein_coding

consequence type = LoF + missense_variant

Table of Contents:

Table of Contents