Page History

Overview

The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. There are different alternatives ways to access to the data (via CLI, RESTful , Java API, Python API, ...) and multiple implementations of the VariantStorageManager (OpenCGA Storage MongoDB, OpenCGA Storage Hadoop, ...).

All this layers and implementations will use the same specification defined in this document.

There are defined an small set of READ-ONLY methods to achieve all the required functionality.

Query Return all variants that matches with a given query
Count Count the result of a given query
GroupBy & Rank Group variants by some field and, optionally, creates a rank by number of variants.
Frequency Group variants by region and count. Useful to plot histograms.

Query filters

A filter is a pair of <key>, <value>, where the keys are predefined, and the values are defined by the user, using an specific format. In the next sections, all this keys are going to be enumerated, explaining their effect and the required format of the value.

There are some general rules that are applied for every case:

Returned variants will match positively with all the filters, except with the positional filters. Variants will need to match with, at least, one positional filter (if any).
When a filter accepts a list of values, they can be separated with:
- Comma , : Which will define an OR operation between the separated elements
- Semicolon ; : Which will define an AND operation between the separated elements

So, the query chromosome: 2,3 will return all variants with 2 and 3, but chromosome: 2;3 will return all variants in chromosomes 2 and 3, but chromosome: 2;3 will return an empty result, because there are no variants in chromosomes 2 and 3 at the same time.

General filters

This general filters will match with fields from the VCF input files.

All this parameters are positive filters. The output will contain variants that match with this filters.

Key Description Format

idsid Matches with the ID field List of values

region Matches with the chromosome and start position List of <chromosome>:<start>-<end>chromosome Matches with the chromosome List of values

type Matches with the type of the variant List of values. Accepted values: [SNV, MNV, INDEL, SV, CNV]

reference Matches with the reference List of values

alternate Matches with the alternate List of values

studiesstudy Matches with variants that are in the specified studies List of values. Accept negations.

filesfile

genotype Samples with specific genotypes. {samp_1}:{gt_1}(,{gt_n}); HG0097:0/0;HG0098:0/1,1/1

samplessample Filter variants where ALL the provided samples are mutated (HET or HOM_ALT) List of samples.

filter

Specify the FILTER for any of the files. If "files" filter is provided, will match the file and the filter.

List of values.

qual

Status

colour	Yellow
title	PENDING

Specify the QUAL for any of the files. If 'file' filter is provided, will match the file and the qual

info

Status

colour	Yellow
title	PENDING

Statistics filters

Apart from the data provided on the files, there are some statistics calculated from the genotypes, or parsed from the INFO column, if the input was an aggregated file.

This filters are related with the statistics from a specific study and cohort. Knowing that, the format will be the same for each filter: <study>:<cohort><comparator><value>, where the available comparators are: <, <=, >, >=, = and !=.

Key	Description
maf	Minor Allele Frequency
mgf	Minor Genotype Frequency
missingAlleles	Number of missing alleles
missingGenotypes	Number of missing genotype

Annotation filters

Key	Description	Format	Example
gene	List of genes	List of values. Accept negations.
annotationExists		`true`/`false`
annot-ct	Consequence type SO term list.		SO:0000045,SO:0000046
annot-xref	External references
annot-biotype	List of biotypes
polyphen	Polyphen, protein substitution score.	`[<\|>\|<=\|>=]{number}` or `[~=\|=\|]{description}`	<=0.9 , =benign
sift	Sift, protein substitution score.	`[<\|>\|<=\|>=]{number}` or `[~=\|=\|]{description}`	>0.1 , ~=tolerant

protein_substitution	proteinSubstitution	Protein substitution score	`{protein_score}[<\|>\|<=\|>=]{number}` or `{protein_score}[~=\|=\|]{description}`	polyphen>0.1 , sift=tolerant
conservation	Conservation score. Phylop, phastCons or gerp.	`{conservation_score}[<\|>\|<=\|>=]{number}`	phastCons>0.5 , phylop<0.1 , gerp>0.1
alternate_frequencypopulationFrequencyAlt	Alternate Population Frequency	`{study}:{population}[<\|>\|<=\|>=]{number}`	1000GENOMES_phase_3:AFR>0.2
reference_frequencypopulationFrequencyRef	Reference Population Frequency	`{study}:{population}[<\|>\|<=\|>=]{number}`	ESP_6500:AA<0.2
annot-population-maf	Population minor allele frequency	`{study}:{population}[<\|>\|<=\|>=]{number}`	EXAC:AES>=0.6
annot-transcription-flags	List of transcript annotation flags		CCDS, basic, cds_end_NF, mRNA_end_NF, cds_start_NF, mRNA_start_NF, seleno
annot-gene-trait-id	List of gene trait association ids		umls:C0007222 , OMIM:269600
annot-gene-trait-name	List of gene trait association names		Cardiovascular Diseases
annot-hpo	List of HPO terms.		HP:0000545
annot-go	List of GO (Genome Ontology) terms.		GO:0002020,GO:0006508
annot-protein-keywords	List of protein variant annotation keywords
annot-drug	List of drug names
annot-functional-score	Functional score, like cadd	`{functional_score}[<\|>\|<=\|>=]{number}`	cadd_scaled>5.2 , cadd_raw<=0.3

Modifiers

Modifies over the variants to return.

Key	Description	Format	Example
limit	Number of elements to return.	number	100
skip	Number of elements to skip.	number	100
sort	Sort variants by position	true,false	true
include	Fields from the Variant's model to be included in the response See Variant Fields.	List of fields	chromosome,start,reference,alternate
exclude	Fields from the Variant's model to be excluded in the response. Ignored if "include" is present. See Variant Fields.	List of fields	type,studies.stats,annotation.geneDrugInteraction
summary	Selects an small amount of fields to return. Ignored if "include" or "exclude" are present. See Variant Fields.	true,false	true
include-formatincludeFormat	List of FORMAT names from Samples Data to include in the output. Accepts "all" and "none".		AD,DP
include-genotypeincludeGenotype	Include genotypes, apart of other formats defined with include-format. If "GT" is not provided in "include-format" or this parameter is false, genotypes won't be returned.	true,false
returnedStudies	returnedFiles	returnedSamplesincludeStudy	List of studies to be included in the result. Accepts "all" and "none".
includeFile	List of files to be included in the result. Accepts "all" and "none".
includeSample	List of samples to be included in the result. Accepts "all" and "none".
unknownGenotype	Returned genotype for unknown genotypes. Common values: [0/0, 0\|0, ./.]

Variant Fields

The parameters include and exclude accepts a list of Variant Fields. This is a list with all the accepted values. Some short alias to those fields are listed in italic.

id
chromosome
start
end
reference
alternate
length
type
studies
- studies.samplesData | samples | samplesData
- studies.files | files
- studies.stats | stats
- studies.secondaryAlternates
- studies.studyId
annotation
- annotation.ancestralAllele
- annotation.id
- annotation.xrefs
- annotation.hgvs
- annotation.displayConsequenceType
- annotation.consequenceTypes
- annotation.populationFrequencies
- annotation.minorAllele
- annotation.minorAlleleFreq
- annotation.conservation
- annotation.geneExpression
- annotation.geneTraitAssociation
- annotation.geneDrugInteraction
- annotation.variantTraitAssociation
- annotation.functionalScore
- annotation.additionalAttributes

GroupBy and Rank

Histogram

Table of Contents:

Table of Contents

indent	20px

Page tree

Versions Compared

Old Version 9

New Version 10

Key