Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Overview
There are two possible ways of querying variants in OpenCGA using the Command Line Interface (CLI), these are:
- opencga.sh: this is the user command line, it works remotely (outside of OpenCGA cluster) by querying the REST or gRPC services. This can also query Catalog data.
- opencga-analysis.sh: a private and internal command line, this is not intended to be used by users and it only works inside the OpenCGA cluster.
Although both command lines provide similar functionality users are expected to use opencga.sh. They can be found in the _bin_ folder of OpenCGA installation directory.
Using opencga.sh
This allows to query by: genomic regions and feature IDs such as gene and SNPa query by variant annotation such as consequence types, conservations scores, polyphen, sift or population frequencies sample genotypes variant stats in the study * some basic aggregations such as ranks, group-by or counts
All these filters can be combined. There are some query modifiers implemented: skip and limit count: this can be added to all CLIs and return the number of results
From the $OPENCGAHOME_ folder you can execute to see all the parameters:
./bin/opencga.sh variants query -h
NOTE: for security reasons you need to login into OpenCGA if you want to use this CLI in a standard OpenCGA installation, this will guarantee you only access to the data you have permission, to login you only need to execute:
./bin/opencga.sh users login -u USER -p PASSWORD
A session token will be stored in your home directory and used internally by OpenCGA Storage.
Design considerations
There are some design decisions you must be aware of:
Comma character ',' is used in different places in the CLI, this ',' can take two different behaviours. If the comma is used to enumerate query values such as regions, genes, SO terms, ... then this behaves as a logical OR as in region 1:1800000-1900000,1:2000000-2100000. But if comma is used to separate query fields such as "sift<0.2,polyphen<0.5" then it acts as a logical AND.
Independently where regions, genes or SNPs IDs are in the CLI they always behave as a logical OR. For instance in next CLI region and gene parameters act as a logical OR:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2
- For all the other CLI parameters a logical AND is executed, so in next query only variants for the specified regions with a sift below 0.2 AND a polyphen score below 0.5 are returned:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5" --return-study STUDY_ID
Example queries
Using variant attributes
To fetch variants for a specific region:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000
and for several regions separating them by ',':
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000
you can also add a list of genes:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2,TP53
Note: remember all regions and genes are always a logical OR.
If you want SNV, INDELS or SV you can use --type parameter:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 1:1800000-1900000,1:2000000-2100000 --type INDEL
Using variant annotation info
To query by SIFT or PolyPhen2 you use --protein-substitution:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2"
or using both, remember that here the ',' acts as a logical AND:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5"
To only count the number of variants remember you can always add --count:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 22:1500000-2000000 --protein-substitution "sift<0.2" --count
To query using Consequence Type terms from Sequence Ontology (SO), you can use the terms at http://www.ensembl.org/info/genome/variation/predicted_data.html, use comma to add terms:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count
And you can always combine parameters in a logical AND, so next query will return variants annotated with those SO terms in the specified region:
./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --return-study STUDY_ID --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count
To query using conservation scores you can use --conservation, next query use both PhastCons and Phylop in separated by ',', since they are different query fields the act as a logical AND:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1 --conservation "phastCons<0.1,phylop<0.2" --count
You can also query using population frequencies from 1000 Genome project, EVS and EXaC using --population-freqs parameter: ./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01" --count
or several populations together separated by comma, since they are different populations and query fields this is a logical AND:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01,1000GENOMES_phase_1:AFR<0.01" --count
Sample genotype
To query by specific sample genotypes you can use --sample-genotype parameter. You must separate samples by ';', and the accepted genotypes for each sample by ','. This will execute an AND between samples and a OR for the genotypes, so in:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --sample-genotype "15:0/0;20:0/1,1/1" --limit 15
variants which are 0/0 for sample 15 and 0/1 or 1/1 for sample 20 are returned (Note: in a few days sample names will be allowed)
Building more complex queries
You can combine all the parameters above to execute more complex queries:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 region 1:50000-3000000 --sample-genotype "15:0/0;20:0/1,1/1" --protein-substitution "sift<0.2,polyphen<0.5" --conservation "phastCons<0.1"
Some aggregations and rankings
To group variants per gene or consequence type you can use --group-by parameter:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --group-by gene
You can also rank genes or consequence type using --rank:
./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --return-study 2 --region 1:1245816-3245819 --rank gene
Include Page | ||||
---|---|---|---|---|
|
Table of Contents:
Table of Contents | ||
---|---|---|
|