Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

There are two possible ways of querying variants in OpenCGA using the Command Line Interface (CLI), these are:

  • opencga.sh: this is the user command line, it works remotely (outside of OpenCGA cluster) by querying the REST or gRPC services. This can also query Catalog data.
  • opencga-analysis.sh: a private and internal command line, this is not intended to be used by users and it only works inside the OpenCGA cluster.

Although both command lines provide similar functionality users are expected to use opencga.sh. They can be found in the _bin_ folder of OpenCGA installation directory.

Using opencga.sh

This allows to query by: genomic regions and feature IDs such as gene and SNPa query by variant annotation such as consequence types, conservations scores, polyphen, sift or population frequencies sample genotypes variant stats in the study * some basic aggregations such as ranks, group-by or counts

All these filters can be combined. There are some query modifiers implemented: skip and limit count: this can be added to all CLIs and return the number of results

From the $OPENCGAHOME_ folder you can execute to see all the parameters:

./bin/opencga.sh variants query -h

NOTE: for security reasons you need to login into OpenCGA if you want to use this CLI in a standard OpenCGA installation, this will guarantee you only access to the data you have permission, to login you only need to execute:

./bin/opencga.sh users login -u USER -p PASSWORD

A session token will be stored in your home directory and used internally by OpenCGA Storage.

Design considerations

There are some design decisions you must be aware of:

  1. Comma character ',' is used in different places in the CLI, this ',' can take two different behaviours. If the comma is used to enumerate query values such as regions, genes, SO terms, ... then this behaves as a logical OR as in region 1:1800000-1900000,1:2000000-2100000. But if comma is used to separate query fields such as "sift<0.2,polyphen<0.5" then it acts as a logical AND.

  2. Independently where regions, genes or SNPs IDs are in the CLI they always behave as a logical OR. For instance in next CLI region and gene parameters act as a logical OR:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2

  1. For all the other CLI parameters a logical AND is executed, so in next query only variants for the specified regions with a sift below 0.2 AND a polyphen score below 0.5 are returned:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5" --return-study STUDY_ID

Example queries

Using variant attributes

To fetch variants for a specific region:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000

and for several regions separating them by ',':

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000

you can also add a list of genes:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2,TP53

Note: remember all regions and genes are always a logical OR.

If you want SNV, INDELS or SV you can use --type parameter:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --type INDEL

Using variant annotation info

To query by SIFT or PolyPhen2 you use --protein-substitution:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2"

or using both, remember that here the ',' acts as a logical AND:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5"

To only count the number of variants remember you can always add --count:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:1500000-2000000 --protein-substitution "sift<0.2" --count

To query using Consequence Type terms from Sequence Ontology (SO), you can use the terms at http://www.ensembl.org/info/genome/variation/predicted_data.html, use comma to add terms:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count

And you can always combine parameters in a logical AND, so next query will return variants annotated with those SO terms in the specified region:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count

To query using conservation scores you can use --conservation, next query use both PhastCons and Phylop in separated by ',', since they are different query fields the act as a logical AND:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1 --conservation "phastCons<0.1,phylop<0.2" --count

You can also query using population frequencies from 1000 Genome project, EVS and EXaC using --population-freqs parameter: ./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01" --count

or several populations together separated by comma, since they are different populations and query fields this is a logical AND:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01,1000GENOMES_phase_1:AFR<0.01" --count

Sample genotype

To query by specific sample genotypes you can use --sample-genotype parameter. You must separate samples by ';', and the accepted genotypes for each sample by ','. This will execute an AND between samples and a OR for the genotypes, so in:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --sample-genotype "15:0/0;20:0/1,1/1" --limit 15

variants which are 0/0 for sample 15 and 0/1 or 1/1 for sample 20 are returned (Note: in a few days sample names will be allowed)

Building more complex queries

You can combine all the parameters above to execute more complex queries:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 region 1:50000-3000000 --sample-genotype "15:0/0;20:0/1,1/1" --protein-substitution "sift<0.2,polyphen<0.5" --conservation "phastCons<0.1"

Some aggregations and rankings

To group variants per gene or consequence type you can use --group-by parameter:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --group-by gene

You can also rank genes or consequence type using --rank:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --rank gene


Include Page
Querying variants with the Command Line Private
Querying variants with the Command Line Private

Table of Contents:

Table of Contents
indent20px