Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

OpenCGA loads a huge number of variants files (Terabyte to Petabyte in size) into storage and is one of its critical and time consuming task. OpenCGA striving hard to reduce the time from availability of file to enable user to query it using OpenCGA. OpenCGA latest version ( 1.4.0 rc3) enables users to use "Azure Batch Service" which can large scale out and thus distribute load and index variant files in parallel. This will significantly increase the variant loading time and OpenCGA performance.

OpenCGA uses ARM template to auto deploy a pool with preconfigured opencga docker image. This pool is "AutoScale" enabled so will scale as number of variant index jobs will grow. These ARM script will also populate the following section in "configuration.yml"  which are enables OpenCGA daemon to submit job (azure task) to Azure Batch Service. Once an OpenGCA index job is created, it ll be prepared and then OpenCGA dameon will submit it to the Azure Batch Service for executionprepare catalog job and then submit it as azure task to Azure Batch Service.

Configuration 

Code Block
titleconfiguration.yml
....
execution:
  mode: AZURE
	...
  options:
    #Azure Batch Service information
    batchAccount : "batchAccount"
    batchKey : "batchKey"
    batchUri : "https://batchservice.uksouth.batch.azure.com"
    batchPoolId : "poolId"
    dockerImageName : "openCGADockerImageName" # preconfigured docker image
    dockerArgs : "dockerRunOptions"   # e,g; mount points etc.
....

...

Once user create an OpenCGA variant indexing job, this will be stored in OpenCGA catalog. For example, following is an example to link a file in catalog and then create index pipeline which internally will be stored as a catalog job :

Code Block
titleOpenCGA Variant Index Job Creation
./opencga.sh files link -i ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.variantFile.vcf.gz -s "sudy"myStudy 
./opencga.sh variant index --file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.variantFile.vcf.gz --calculate-stats --annotate -o tmp

...

Code Block
titleOpenCGA Daemon
/opt/opencga/bin$ ./opencga-admin.sh catalog daemon --start <<< admin_password


Once daemon is startedrunning, it will fetch available jobs from the catalog, prepare them and then submit each catalog job as an "Azure Task" to the batch pool specified in "configuration.yml". A typical Azure task command will look like :

Code Block
titleAzure Batch Service Task Command
/opt/opencga/bin/opencga-analysis.sh variant index --outdir /opt/opencga/sessions/jobs/J_2510 --merge ADVANCED -DaggregatedType=NONE -DcalculateStats=true --annotate  -Dstdin=false -Dstdout=false --file ALL2.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypesvariantFile.vcf.gz -DoverwriteAnnotations=false -Dsid=eyJkjGciOiJIUzI1NiJ9.eyJzdWIiOiJ0ZXN0IiwiYXVkIjoiT3BlbkNHQSB1c2VycyIsImlhdCI6MTU0Njk0OTM3NiwiZXhwIjoxNTQ2OTUyOTc2fQ.oRKCs0uUANRwZOp12NxbY3st6MVe9K1Wp3eMH1Bgjdc -Dinclude.extra-fields=all -DpostLoad.check.skip=false -Dload.split-data=false -Dexclude.genotypes=false --path tmp:-path tmp: ...


On startup, docker container will mount the locations listed in "dockerArgs"  parameter in "configuration.yml" file file, e.g; "/opt/opencga/conf", "/opt/opencga/sessions", "storage location" (where variant files are stored) and any other run time options. This docker container will have access to shared configuration, session and storage location and will start indexing the variant file into storage (HBase|MongoDB) as described in index pipeline.