Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

One of the goals of The 100,000 Genomes Project from Genomics England is to enable new medical research. Researchers will study how best to use genomics in healthcare and how best to interpret the data to help patients. The causes, diagnosis and treatment of disease will also be investigated. This is currently the largest national sequencing project of its kind in the world.

To achieve this goal Genomics England set up a Research environment for the researchers and installed OpenCGA, CellBase and IVA from OpenCB. We have loaded 64,078 whole genomes in OpenCGA, in total more than 1 billion unique variants have been indexed in the OpenCGA Variant Storage, and all metadata and clinical data for samples and patients have been loaded in OpenCGA Catalog. Variants were annotated using CellBase and IVA front-end was installed to analyse and visualise the data. Here you can find a full report of about the loading and analysis of the 64,078 genomes.

Genomic and Clinical Data

The variants and clinical data of 64,078 participant genomes been loaded and indexed in OpenCGA. In total they represent more than 30,000 VCF files compressed accounting for about 40TB of disk space. This data is divided in four different datasets depending on the genome assembly and the type of study (germline or somatic), this data has been organised in OpenCGA in three Projects and four Studies:

ProjectStudy ID and NameVCF FilesVCF File TypeSamplesSamples per fileVariants
GRCh37 Germline

RD37

Rare Disease GRCh37

5,329Multi sample12,1422.28298,763,059
GRCh38 Germline

RD38

Rare Disease GRCh38

16,591Multi sample33,1802.00437,740,498

CG38

Cancer Germline GRCh38

9,167Single sample9,1671.00286,136,051
GRCh38 Somatic

CS38

Cancer Somatic GRCh38

9,589Somatic9,5891.00398,402,166
Total40,676
64,078
1,421,041,774

Each dataset is loaded in OpenCGA as a different Study, grouped in 3 different Projects, depending on the type of data (germline / somatic), and assembly (GRCh37 / GRCh38).

Platform

The Hadoop cluster consists of about 30 nodes running Hortonworks 2.6.5 (with HBase 1.1.2) and a LSF queue for loading all the VCF files, see this table for more detail:

NodeNodesCoresMemory (GB)Storage (TB)
Hadoop Master5282167.2 (6x1.2)
Hadoop Worker30282167.2 (6x1.2)
LSF Loading Queue 1012364Isilon storage


OpenCGA 1.4.2 with the Hadoop Variant Storage has been used.

Loading Data

The data ingestion is executed in the LSF nodes, connected directly to the Hadoop platform. This configuration allows us to run multiple files at the same time.

Having 10 worker nodes in the queue, each of them loading up to 6 files at the same time, results in 60 files being loaded concurrently.

Multi sample files

The files from Rare Disease studies (RD38 & RD37) contain more than one sample per file. In average, 2 samples per file.


Concurrent files loaded60
Average files loaded per hour125.72
Load time per file00:28:38

Single sample files

The files from Cancer Germline studies (CG38) contain one sample per file. Compared with the Rare Disease, these files are smaller in size, therefore, the load is slightly faster.

Concurrent files loaded60
Average files loaded per hour242.05
Load time per file00:14:52

Query Performance

We would like to distinguish two types of queries: General and Clinical

General Queries

These queries are are only filtering by variant annotation and cohort stats. These queries only include aggregated data, not returning sample genotypes.

FilterResultsTotal ResultsTime
consequence type = LoF + missense_variant

10

37046260.189s

consequence type = LoF + missense_variant

biotype = protein_coding

1035764720.260s

panel with 200 genes

1038829020.299s

Clinical Queries

Clinical queries, or sample queries, enforces queries to return variants of a specific set of samples. These queries can use all the filters from the general queries. The result will include a ReportedEvent for each variant, which determines possible conditions associated to the variant.

FilterResultsTotal ResultsTime

Segregation mode = biallelic

filter = PASS

10

2117870.420s

Segregation mode = biallelic

filter = PASS

20002117871.079s

De novo variants

filter = PASS

consequence type = LoF + missense_variant

24240.680s
Compound Heterozygous

filter = PASS

biotype = protein_coding

consequence type = LoF + missense_variant
71771710.995s




Table of Contents:


  • No labels