Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

Genomic and Clinical Data

Genomic variants of 4,700 genomes were loaded and indexed in OpenCGA. In total we loaded almost 5,000 VCF files accounting for about 20TB of compressed disk space.

Platform

For this proof of concept (PoC) we used the development version OpenCGA v2.0.0-beta using the Hadoop Variant Storage Engine that uses Apache HBase as back-end. We also used CellBase 4.6 for the variant annotation.

For the platform we used a 10-nodes Azure HDInsight 3.6 cluster using Data Lake Storage Gen2. HDInsight 3.6 uses Hortonworks HDP 2.6.5 (with Hadoop 2.7.3 and HBase 1.1.2) and we used Azure Batch for loading concurrently all the VCF files which had been copied previously to a NFS server, you can see details here:

Node TypeNodesAzure TypeCoresMemory (GB)Storage
Hadoop Master3Standard_D12_V2428Data Lake Gen2
Hadoop Worker10Standard_DS13_V2856Data Lake Gen2
Azure Batch Queue 10Standard_D4s_v3416NFS Server


We evaluated the new HDInsight 4 but after finding few issues and we decided to use the more stable HDInsight 3.6 (HDI3.6) over Data Lake Gen2 (DL2), we will refer to this as HDI3.6+DL2. We worked with Azure engineers to debug and fix these issues during the PoC, unfortunately we did no have time to repeat the benchmark.


As you will below in the analysis benchmark, once we completed the PoC we repeated some tests with 20 working nodes to study the performance improvement.


Table size

TableCompressionSize (TB)
Variants tableGZ2.9
Variants tableSNAPPY4.7


Genomics Data Loading and Indexing


Number of loaded files across time. We can differentiate some sections with different performance.

The more representative section is the last one, where we upgraded the input disk to speed up the reading. In average, with the improved disk, processing up to 20 files simultaneously we have these numbers:



TimeTime/nodes
Transform00:29:3600:01:28
Load00:46:1900:02:19
Total1:15:5500:03:48


Index speed:

  • 15.8 files/h
  • 379.4 files/day
  • 79.0 GB/h
  • 1.85 TB/day




#Files   Day Hour
10 2019-07-10 19
4 2019-07-10 20
15 2019-07-10 21
30 2019-07-10 22
26 2019-07-10 23
28 2019-07-11 00
25 2019-07-11 01
23 2019-07-11 02
25 2019-07-11 03
28 2019-07-11 04
30 2019-07-11 05
30 2019-07-11 06
31 2019-07-11 07
30 2019-07-11 08
28 2019-07-11 09
28 2019-07-11 10
30 2019-07-11 11
32 2019-07-11 12
27 2019-07-11 13
27 2019-07-11 14
45 2019-07-11 15
14 2019-07-11 16
29 2019-07-11 17
35 2019-07-11 18
39 2019-07-11 19
11 2019-07-11 20


Operations

First batch of 700 files

74.096.015 variants

Aggregate

Prepare:    529.303s [ 00:08:49 ]

Aggregate: 9591.626s [ 02:39:52 ]

Write:     7012.733s [ 01:56:53 ]  -> Size : 59.5 GiB

Stats

1352.675s [ 00:22:33 ]

Annotate

Prepare:       722.327s [ 00:12:02 ]

Annot:       50384.383s [ 13:59:44 ]

Load:       28204.666s [ 07:44:04 ]

SampleIndex:  12403.542s [ 03:26:44 ]

Secondary index (Solr)

.....

Analysis Benchmark

Query and Aggregation Stats


Stats


GWAS


Clinical Analysis





Table of Contents:


  • No labels