- Created by Jacobo Coll, last modified on Jan 13, 2017
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 2 Next »
Overview
The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. There are different alternatives ways to access to the data (via CLI, RESTful , Java API, Python API, ...) and multiple implementations of the VariantStorageManager (OpenCGA Storage MongoDB, OpenCGA Storage Hadoop, ...).
All this layers and implementations will use the same specification defined in this document.
There are defined an small set of READ-ONLY methods to achieve all the required functionality.
- Query Return all variants that matches with a given query
- Count Count the result of a given query
- GroupBy & Rank Group variants by some field and, optionally, creates a rank by number of variants.
- Frequency Group variants by region and count. Useful to plot histograms.
Query filters
A filter is a pair of <key>, <value>
, where the keys are predefined, and the values are defined by the user, using an specific format. In the next sections, all this keys are going to be enumerated, explaining their effect and the required format of the value.
There are some general rules that are applied for every case:
Returned variants will match positively with all the filters, except with the positional filters. Variants will need to match with, at least, one positional filter (if any).
When a filter accepts a list of values, they can be separated with:
- Comma
,
: Which will define anOR
operation between the separated elements - Semicolon
;
: Which will define anAND
operation between the separated elements
- Comma
So, the query chromosome: 2,3
will return all variants with 2 and 3, but chromosome: 2;3
will return all variants in chromosomes 2 and 3, but chromosome: 2;3
will return an empty result, because there are no variants in chromosomes 2 and 3 at the same time.
General filters
This general filters will match with fields from the VCF input files.
Key | Description | Format |
---|---|---|
ids | Matches with the ID field | List of values |
region | Matches with the chromosome and start position | List of <chromosome>:<start>-<end> |
chromosome | Matches with the chromosome | List of values |
type | Matches with the type of the variant | List of values. Accepted values: [SNV, MNV, INDEL, SV, CNV] |
reference | Matches with the reference | List of values |
alternate | Matches with the alternate | List of values |
studies | Matches with variants that are in the specified studies | List of values. Accept negations. |
files | ||
genotype | {samp_1}:{gt_1}(,{gt_n}); | HG0097:0/0;HG0098:0/1,1/1 |
qual | PENDING | |
filter | PENDING | |
info | PENDING |
Modifiers:
Key | Description | Format |
---|---|---|
limit | ||
skip | ||
sort | ||
include | ||
exclude | ||
returnedStudies | ||
returnedFiles | ||
returnedSamples | ||
unknownGenotype |
Statistics filters
Apart from the data provided on the files, there are some statistics calculated from the genotypes, or parsed from the INFO column, if the input was an aggregated file.
This filters are related with the statistics from a specific study and cohort. Knowing that, the format will be the same for each filter: <study>:<cohort><comparator><value>
, where the available comparators are: <
, <=
, >
, >=
, =
and !=
.
Key | Description |
---|---|
maf | Minor Allele Frequency |
mgf | Minor Genotype Frequency |
missingAlleles | Number of missing alleles |
missingGenotypes | Number of missing genotype |
Annotation filters
Key | Description | Format | Example |
---|---|---|---|
gene | List of genes | List of values. Accept negations. | |
annotationExists | true /false | ||
annot-ct | Consequence type SO term list. | SO:0000045,SO:0000046 | |
annot-xref | External references | ||
annot-biotype | List of biotypes | ||
polyphen | Polyphen, protein substitution score. | [<|>|<=|>=]{number} or [~=|=|]{description} | <=0.9 , =benign |
sift | Sift, protein substitution score. | [<|>|<=|>=]{number} or [~=|=|]{description} | >0.1 , ~=tolerant |
protein_substitution | Protein substitution score | {protein_score}[<|>|<=|>=]{number} or {protein_score}[~=|=|]{description} | polyphen>0.1 , sift=tolerant |
conservation | Conservation score. Phylop, phastCons or gerp. | {conservation_score}[<|>|<=|>=]{number} | phastCons>0.5 , phylop<0.1 , gerp>0.1 |
alternate_frequency | Alternate Population Frequency | {study}:{population}[<|>|<=|>=]{number} | 1000GENOMES_phase_3:AFR>0.2 |
reference_frequency | Reference Population Frequency | {study}:{population}[<|>|<=|>=]{number} | ESP_6500:AA<0.2 |
annot-population-maf | Population minor allele frequency | {study}:{population}[<|>|<=|>=]{number} | EXAC:AES>=0.6 |
annot-transcription-flags | List of transcript annotation flags | CCDS, basic, cds_end_NF, mRNA_end_NF, cds_start_NF, mRNA_start_NF, seleno | |
annot-gene-trait-id | List of gene trait association ids | umls:C0007222 , OMIM:269600 | |
annot-gene-trait-name | List of gene trait association names | Cardiovascular Diseases | |
annot-hpo | List of HPO terms. | HP:0000545 | |
annot-go | List of GO (Genome Ontology) terms. | GO:0002020,GO:0006508 | |
annot-protein-keywords | List of protein variant annotation keywords | ||
annot-drug | List of drug names | ||
annot-functional-score | Functional score, like cadd | {functional_score}[<|>|<=|>=]{number} | cadd_scaled>5.2 , cadd_raw<=0.3 |
GroupBy and Rank
Histogram
Table of Contents:
- No labels