Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Overview

The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. There are different alternatives ways to access to the data (via CLI, RESTful , Java API, Python API, ...) and multiple implementations of the VariantStorageManager (OpenCGA Storage MongoDB, OpenCGA Storage Hadoop, ...).

All this layers and implementations will use the same specification defined in this document.

There are defined an small set of READ-ONLY methods to achieve all the required functionality.

  • Query Return all variants that matches with a given query
  • Count Count the result of a given query
  • GroupBy & Rank Group variants by some field and, optionally, creates a rank by number of variants.
  • Frequency Group variants by region and count. Useful to plot histograms.

Query filters

A filter is a pair of <key>, <value>, where the keys are predefined, and the values are defined by the user, using an specific format. In the next sections, all this keys are going to be enumerated, explaining their effect and the required format of the value.

There are some general rules that are applied for every case:

  1. Returned variants will match positively with all the filters, except with the positional filters. Variants will need to match with, at least, one positional filter (if any).

  2. When a filter accepts a list of values, they can be separated with:

    • Comma , : Which will define an OR operation between the separated elements
    • Semicolon ; : Which will define an AND operation between the separated elements

So, the query chromosome: 2,3 will return all variants with 2 and 3, but chromosome: 2;3 will return all variants in chromosomes 2 and 3, but chromosome: 2;3 will return an empty result, because there are no variants in chromosomes 2 and 3 at the same time.

General filters

This general filters will match with fields from the VCF input files.

All this parameters are positive filters. The output will contain variants that match with this filters.

KeyDescriptionFormat
idsMatches with the ID fieldList of values
regionMatches with the chromosome and start positionList of <chromosome>:<start>-<end>
chromosomeMatches with the chromosomeList of values
typeMatches with the type of the variantList of values. Accepted values: [SNV, MNV, INDEL, SV, CNV]
referenceMatches with the referenceList of values
alternateMatches with the alternateList of values
studiesMatches with variants that are in the specified studiesList of values. Accept negations.
files

genotypeSamples with specific genotypes. {samp_1}:{gt_1}(,{gt_n});HG0097:0/0;HG0098:0/1,1/1
samplesFilter variants where ALL the provided samples are mutated (HET or HOM_ALT)List of samples.
filter

Specify the FILTER for any of the files. If "files" filter is provided, will match the file and the filter.

List of values.
qual

Status
colourYellow
titlePENDING


info

Status
colourYellow
titlePENDING



Statistics filters

Apart from the data provided on the files, there are some statistics calculated from the genotypes, or parsed from the INFO column, if the input was an aggregated file.

This filters are related with the statistics from a specific study and cohort. Knowing that, the format will be the same for each filter: <study>:<cohort><comparator><value>, where the available comparators are: <, <=, >, >=, = and !=.

KeyDescription
mafMinor Allele Frequency
mgfMinor Genotype Frequency
missingAllelesNumber of missing alleles
missingGenotypesNumber of missing genotype
Annotation filters
KeyDescriptionFormatExample
geneList of genesList of values. Accept negations.
annotationExists
true/false
annot-ctConsequence type SO term list.
SO:0000045,SO:0000046
annot-xrefExternal references

annot-biotypeList of biotypes

polyphenPolyphen, protein substitution score.[<|>|<=|>=]{number} or [~=|=|]{description}<=0.9 , =benign
siftSift, protein substitution score.[<|>|<=|>=]{number} or [~=|=|]{description}>0.1 , ~=tolerant
protein_substitutionProtein substitution score{protein_score}[<|>|<=|>=]{number} or {protein_score}[~=|=|]{description}polyphen>0.1 , sift=tolerant
conservationConservation score. Phylop, phastCons or gerp.{conservation_score}[<|>|<=|>=]{number}phastCons>0.5 , phylop<0.1 , gerp>0.1
alternate_frequencyAlternate Population Frequency{study}:{population}[<|>|<=|>=]{number}1000GENOMES_phase_3:AFR>0.2
reference_frequencyReference Population Frequency{study}:{population}[<|>|<=|>=]{number}ESP_6500:AA<0.2
annot-population-mafPopulation minor allele frequency{study}:{population}[<|>|<=|>=]{number}EXAC:AES>=0.6
annot-transcription-flagsList of transcript annotation flags
CCDS, basic, cds_end_NF, mRNA_end_NF, cds_start_NF, mRNA_start_NF, seleno
annot-gene-trait-idList of gene trait association ids
umls:C0007222 , OMIM:269600
annot-gene-trait-nameList of gene trait association names
Cardiovascular Diseases
annot-hpoList of HPO terms.
HP:0000545
annot-goList of GO (Genome Ontology) terms.
GO:0002020,GO:0006508
annot-protein-keywordsList of protein variant annotation keywords

annot-drugList of drug names

annot-functional-scoreFunctional score, like cadd{functional_score}[<|>|<=|>=]{number}cadd_scaled>5.2 , cadd_raw<=0.3
Modifiers

Modifies over the variants to return.

includeGenotype
KeyDescriptionFormatExample
limitNumber of elements to return.number100
skipNumber of elements to skip.number100
sortSort variants by positiontrue,falsetrue
include

Fields from the Variant's model to be included in the response

See Variant Fields.

List of fieldschromosome,start,reference,alternate
exclude

Fields from the Variant's model to be excluded in the response.

Ignored if "include" is present.

See Variant Fields.

List of fieldstype,studies.stats,annotation.geneDrugInteraction
summary

Selects an small amount of fields to return.

Ignored if "include" or "exclude" are present.

See Variant Fields.

true,false
includeFormat
true
include-format

List of FORMAT names from Samples Data to include in the output.

Accepts "all" and "none".


AD,DP
include-genotype

Include genotypes, apart of other formats defined with include-format.

If "GT" is not provided in "include-format" or this parameter is false, genotypes won't be returned.

true,false
returnedStudies


returnedFiles


returnedSamples


unknownGenotypeReturned genotype for unknown genotypes. Common values: [0/0, 0|0, ./.]

Variant Fields

The parameters include and exclude accepts a list of Variant Fields. This is a list with all the accepted values. Some short alias to those fields are listed in italic.

  • id
  • chromosome
  • start
  • end
  • reference
  • alternate
  • length
  • type
  • studies
    • studies.samplesData | samples | samplesData

    • studies.files | files

    • studies.stats | stats

    • studies.secondaryAlternates

    • studies.studyId

  • annotation
    • annotation.ancestralAllele
    • annotation.id
    • annotation.xrefs
    • annotation.hgvs
    • annotation.displayConsequenceType
    • annotation.consequenceTypes
    • annotation.populationFrequencies
    • annotation.minorAllele
    • annotation.minorAlleleFreq
    • annotation.conservation
    • annotation.geneExpression
    • annotation.geneTraitAssociation
    • annotation.geneDrugInteraction
    • annotation.variantTraitAssociation
    • annotation.functionalScore
    • annotation.additionalAttributes


GroupBy and Rank


Histogram

Table of Contents:

Table of Contents
indent20px