Querying Variant Data

Overview

The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. There are different alternatives ways to access to the data (via CLI, RESTful , Java API, Python API, ...) and multiple implementations of the VariantStorageManager (OpenCGA Storage MongoDB, OpenCGA Storage Hadoop, ...).

All this layers and implementations will use the same specification defined in this document.

There are defined an small set of READ-ONLY methods to achieve all the required functionality.

Query Return all variants that matches with a given query
Count Count the result of a given query
GroupBy & Rank Group variants by some field and, optionally, creates a rank by number of variants.
Frequency Group variants by region and count. Useful to plot histograms.

Query filters

A filter is a pair of <key>, <value>, where the keys are predefined, and the values are defined by the user, using an specific format. In the next sections, all this keys are going to be enumerated, explaining their effect and the required format of the value.

There are some general rules that are applied for every case:

Returned variants will match positively with all the filters, except with the positional filters. Variants will need to match with, at least, one positional filter (if any).
When a filter accepts a list of values, they can be separated with:
- Comma , : Which will define an OR operation between the separated elements
- Semicolon ; : Which will define an AND operation between the separated elements

So, the query chromosome: 2,3 will return all variants with 2 and 3, but chromosome: 2;3 will return all variants in chromosomes 2 and 3, but chromosome: 2;3 will return an empty result, because there are no variants in chromosomes 2 and 3 at the same time.

General filters

This general filters will match with fields from the VCF input files.

Key	Description	Format
ids	Matches with the ID field	List of values
region	Matches with the chromosome and start position	List of `<chromosome>:<start>-<end>`
chromosome	Matches with the chromosome	List of values
type	Matches with the type of the variant	List of values. Accepted values: [SNV, MNV, INDEL, SV, CNV]
reference	Matches with the reference	List of values
alternate	Matches with the alternate	List of values
studies	Matches with variants that are in the specified studies	List of values. Accept negations.
files
genotype	`{samp_1}:{gt_1}(,{gt_n});`	HG0097:0/0;HG0098:0/1,1/1
qual	PENDING
filter	PENDING
info	PENDING

Modifiers:

Key	Description	Format
limit
skip
sort
include
exclude
returnedStudies
returnedFiles
returnedSamples
unknownGenotype

Statistics filters

Apart from the data provided on the files, there are some statistics calculated from the genotypes, or parsed from the INFO column, if the input was an aggregated file.

This filters are related with the statistics from a specific study and cohort. Knowing that, the format will be the same for each filter: <study>:<cohort><comparator><value>, where the available comparators are: <, <=, >, >=, = and !=.

Key	Description
maf	Minor Allele Frequency
mgf	Minor Genotype Frequency
missingAlleles	Number of missing alleles
missingGenotypes	Number of missing genotype

Annotation filters

Key	Description	Format	Example
gene	List of genes	List of values. Accept negations.
annotationExists		`true`/`false`
annot-ct	Consequence type SO term list.		SO:0000045,SO:0000046
annot-xref	External references
annot-biotype	List of biotypes
polyphen	Polyphen, protein substitution score.	`[<\|>\|<=\|>=]{number}` or `[~=\|=\|]{description}`	<=0.9 , =benign
sift	Sift, protein substitution score.	`[<\|>\|<=\|>=]{number}` or `[~=\|=\|]{description}`	>0.1 , ~=tolerant
protein_substitution	Protein substitution score	`{protein_score}[<\|>\|<=\|>=]{number}` or `{protein_score}[~=\|=\|]{description}`	polyphen>0.1 , sift=tolerant
conservation	Conservation score. Phylop, phastCons or gerp.	`{conservation_score}[<\|>\|<=\|>=]{number}`	phastCons>0.5 , phylop<0.1 , gerp>0.1
alternate_frequency	Alternate Population Frequency	`{study}:{population}[<\|>\|<=\|>=]{number}`	1000GENOMES_phase_3:AFR>0.2
reference_frequency	Reference Population Frequency	`{study}:{population}[<\|>\|<=\|>=]{number}`	ESP_6500:AA<0.2
annot-population-maf	Population minor allele frequency	`{study}:{population}[<\|>\|<=\|>=]{number}`	EXAC:AES>=0.6
annot-transcription-flags	List of transcript annotation flags		CCDS, basic, cds_end_NF, mRNA_end_NF, cds_start_NF, mRNA_start_NF, seleno
annot-gene-trait-id	List of gene trait association ids		umls:C0007222 , OMIM:269600
annot-gene-trait-name	List of gene trait association names		Cardiovascular Diseases
annot-hpo	List of HPO terms.		HP:0000545
annot-go	List of GO (Genome Ontology) terms.		GO:0002020,GO:0006508
annot-protein-keywords	List of protein variant annotation keywords
annot-drug	List of drug names
annot-functional-score	Functional score, like cadd	`{functional_score}[<\|>\|<=\|>=]{number}`	cadd_scaled>5.2 , cadd_raw<=0.3

GroupBy and Rank

Histogram

Table of Contents:

Page tree