Genomic and Clinical Data

Genomic variants of 4,700 genomes were loaded and indexed in OpenCGA. In total we loaded almost 5,000 VCF files accounting for about 20TB of compressed disk space.

Platform

For this proof of concept (PoC) we used the development version OpenCGA v2.0.0-beta using the Hadoop Variant Storage Engine that uses Apache HBase as back-end. We also used CellBase 4.6 for the variant annotation.

For the platform we used a 10-nodes Azure HDInsight 3.6 cluster using Data Lake Storage Gen2. HDInsight 3.6 uses Hortonworks HDP 2.6.5 (with Hadoop 2.7.3 and HBase 1.1.2) and we used Azure Batch for loading concurrently all the VCF files which had been copied previously to a NFS server, you can see details here:

Node Type	Nodes	Azure Type	Cores	Memory (GB)	Storage
Hadoop Master	3	Standard_D12_V2	4	28	Data Lake Gen2
Hadoop Worker	10	Standard_DS13_V2	8	56	Data Lake Gen2
Azure Batch Queue	10	Standard_D4s_v3	4	16	NFS Server

We evaluated the new HDInsight 4 but after finding few issues and we decided to use the more stable HDInsight 3.6 (HDI3.6) over Data Lake Gen2 (DL2), we will refer to this as HDI3.6+DL2. We worked with Azure engineers to debug and fix these issues during the PoC, unfortunately we did no have time to repeat the benchmark.

As you will below in the analysis benchmark, once we completed the PoC we repeated some tests with 20 working nodes to study the performance improvement.

Table size

Table	Compression	Size (TB)
Variants table	GZ	2.9
Variants table	SNAPPY	4.7

Genomics Data Loading and Indexing

Number of loaded files across time. We can differentiate some sections with different performance.

The more representative section is the last one, where we upgraded the input disk to speed up the reading. In average, with the improved disk, processing up to 20 files simultaneously we have these numbers:

	Time	Time/nodes
Transform	00:29:36	00:01:28
Load	00:46:19	00:02:19
Total	1:15:55	00:03:48

Index speed:

15.8 files/h
379.4 files/day
79.0 GB/h
1.85 TB/day

#Files   Day Hour
10 2019-07-10 19
4 2019-07-10 20
15 2019-07-10 21
30 2019-07-10 22
26 2019-07-10 23
28 2019-07-11 00
25 2019-07-11 01
23 2019-07-11 02
25 2019-07-11 03
28 2019-07-11 04
30 2019-07-11 05
30 2019-07-11 06
31 2019-07-11 07
30 2019-07-11 08
28 2019-07-11 09
28 2019-07-11 10
30 2019-07-11 11
32 2019-07-11 12
27 2019-07-11 13
27 2019-07-11 14
45 2019-07-11 15
14 2019-07-11 16
29 2019-07-11 17
35 2019-07-11 18
39 2019-07-11 19
11 2019-07-11 20

Operations

First batch of 700 files

74.096.015 variants

Aggregate

Prepare: 529.303s [ 00:08:49 ]

Aggregate: 9591.626s [ 02:39:52 ]

Write: 7012.733s [ 01:56:53 ] -> Size : 59.5 GiB

Stats

1352.675s [ 00:22:33 ]

Annotate

Prepare: 722.327s [ 00:12:02 ]

Annot: 50384.383s [ 13:59:44 ]

Load: 28204.666s [ 07:44:04 ]

SampleIndex: 12403.542s [ 03:26:44 ]

Secondary index (Solr)

.....

Analysis Benchmark

Query and Aggregation Stats

Stats

GWAS

Clinical Analysis

Table of Contents:

Page tree

Microsoft Azure

Genomic and Clinical Data

Platform

Table size

Genomics Data Loading and Indexing

Index speed:

Operations

First batch of 700 files

Aggregate

Stats

Annotate

Secondary index (Solr)

Analysis Benchmark

Query and Aggregation Stats

Stats

GWAS

Clinical Analysis