The command line implements many filters which allows a powerful and highly flexibility queries, including genomic regions, feature IDs (e.g. gene and SNP ids), consequence types, conservation scores, polyphen, sift, population frequencies, ... and even some basic aggregations such as ranks, group-by or counts. All these filters can be combined. There are also some query modifiers implemented: include, exclude, skip, limit and count, which can be added to most queries.
You can execute opencga.sh to see all the parameters. Please note that opencga.sh script is located within the opencga/bin directory in the installation directory. You can see an integrated help with -h (or --help) parameter, you can see this by expanding next section:
Design considerations
There are some design decisions you must be aware of:
Comma character ',' is used in different places in the CLI and will always behave as a logical OR. For example, in region 1:1800000-1900000,1:2000000-2100000 or "sift<0.2,polyphen<0.5". The semi-colon ';' when allowed, will behave as a logical AND.
- Independently where regions, genes or SNPs IDs are in the CLI they always behave as a logical OR. For instance in next CLI region and gene parameters act as a logical OR:
./opencga.sh variant query --region 1:1849612-1850388,1:2049808-2050192 --gene BRCA2 --study GONL --exclude studies --of json_pretty
- For all the other CLI parameters a logical AND is executed, so in next query only variants for the specified regions with a sift below 0.2 AND a polyphen score below 0.5 are returned:
./opencga.sh variant query --region 22:17464756-17479892 --protein-substitution "sift<=0.5,polyphen>=0.1" --study reference_grch38:1kG_phase3 --limit 10 --exclude studies
Example queries
Using variant attributes
To fetch variants for a specific region:
./opencga.sh variant query --studies STUDY --region CHR:START-END
For example, to fetch variants from the 1k genomes project on region 22:15000000-20000000:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 22:15000000-20000000 --limit 3 --exclude studies
Please note: the number of variants in the region may be huge - hundreds of thousands in the example. The total number of variants returned has been limited to 3 by using the --limit parameter. Also, in order to improve the efficiency of the query, all studies metadata, which in turn contain all samples metadata for all 1kG phase 3 samples, are excluded from the result by using the parameter --exclude.
To fetch variants from several regions separate them by ',':
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 1:1800000-1900000,1:2000000-2100000 --limit 3 --exclude studies
you can also add a list of genes:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2,TP53 --limit 3 --exclude studies
Note: remember all regions and genes are always a logical OR.
If you want SNV, INDELS or SV you can use --type parameter:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 1:1800000-1900000,1:2000000-2100000 --limit 3 --exclude studies --type INDEL
Using variant annotation info
To query by SIFT or PolyPhen2 you can use --sift and/or --polyphen:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 22:17464756-17479892 --protein-substitution "sift<0.5" --limit 3 --exclude studies
or using both:
./opencga.sh variant query --region 22:17464756-17479892 --protein-substitution "sift<=0.5,polyphen>=0.1" --study reference_grch38:1kG_phase3 --limit 10 --exclude studies
To only count the number of variants remember you can always add --count:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 22:17464756-17479892 --protein-substitution "sift>0.5" --count
To query using Consequence Type terms from Sequence Ontology (SO), you can use the terms at http://www.ensembl.org/info/genome/variation/predicted_data.html, use comma to add terms:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 21:15888971-15889629 --consequence-type missense_variant,stop_gained --count
To query using conservation scores you can use --conservation. Multiple comparisons may be combined by using either the ',' or the ';' as separators. Comparisons separated by ',' will perform an OR logical operation. Comparisons separated by ';' will perform and AND logical operation. Complex logical operations combining ',' and ';' in a single query are not currently allowed. Next query uses both PhastCons and Phylop in separated by ',', since they are different query fields the act as a logical OR:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 21:15888971-15889629 --conservation "phastCons>0.5,phylop<0.1,gerp>0.1" --count
You can also query using population frequencies from 1000 Genome project, EVS and EXaC using --population-freqs parameter:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 21:15888971-15889629 --alt-population-frequency "1kG_phase3:EUR<0.01" --count
or several populations together separated by ',' or ';', since they are different populations and query fields this is a logical OR:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 21:15888971-15889629 --alt-population-frequency "1kG_phase3:EUR<0.01,1kG_phase3:AFR<0.01" --count
Sample genotype
To query by specific sample genotypes you can use --genotype parameter. You must separate samples by ';', and the accepted genotypes for each sample by ','. This will execute an AND between samples and a OR for the genotypes, so in:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --genotype "NA19030:0|1,1|0,1|1;NA19043:0|1,1|0,1|1" --limit 3 --exclude studies
variants which are present in samples NA19030 and NA19043 are returned (number of returned variants is limited to 3 in this case)
Building more complex queries
You can combine all the parameters above to execute more complex queries:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --genotype "NA19030:0|1,1|0,1|1;NA19043:0|0" --limit 3 --exclude studies,annotation.geneTraitAssociation --conservation "phastCons<1"
Some aggregations and rankings
To group variants per gene or consequence type you can use --group-by parameter:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 21:15888971-15889629 --group-by gene --include annotation.consequenceTypes --log-level debug --limit 10
You can also rank genes or consequence type using --rank:
./opencga.sh variant query --study reference_grch37:1kG_phase3 --region 21:15888971-15889629 --rank gene --include annotation.consequenceTypes --limit 10