Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Overview
The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. There are different alternatives ways to access to the data (via CLI, RESTful , Java API, Python API, ...) and multiple implementations of the VariantStorageManager (OpenCGA Storage MongoDB, OpenCGA Storage Hadoop, ...).
All this layers and implementations will use the same specification defined in this document.
There are defined an small set of READ-ONLY methods to achieve all the required functionality.
- Query: Return all variants that matches with a given query
- Count: Count the result of a given query
- Aggregation Stats: Group variants by some field, you can count by region with this query.
Query Parameters
A filter is a pair of <key>, <value>
, where the keys are predefined, and the values are defined by the user, using an specific format. In the next sections, all this keys are going to be enumerated, explaining their effect and the required format of the value.
There are some general rules that are applied for every case:The API described here fetches the sample data from a variant
Returned variants will match positively with all the filters, except with the positional filters. Variants will need to match with, at least, one positional filter (if any).
When a filter accepts a list of values, they can be separated with:
- Comma
,
: Which will define anOR
operation between the separated elements - Semicolon
;
: Which will define anAND
operation between the separated elements
- Comma
So, the query chromosome: 2,3
will return all variants with 2 and 3, but chromosome: 2;3
will return all variants in chromosomes 2 and 3, but chromosome: 2;3
will return an empty result, because there are no variants in chromosomes 2 and 3 at the same time.
Genomic Parameters
This general filters will match with fields from the VCF input files.
All this parameters are positive filters. The output will contain variants that match with this filters
Parameter | Description |
---|
Example | |
---|---|
id | List of IDs, these can be rs IDs (dbSNP) or variants in the format chrom:start:ref:alt, |
rs116600158 19:7177679:C:T |
region | List of regions, these can be just a single chromosome name or regions in the |
format |
| chr22 3:100000-200000 |
<chromosome>:<start>-<end>
type |
List of |
types, accepted values are SNV, MNV, INDEL, SV, CNV |
Samples with a specific genotype
e.g. HG0097:0/0;HG0098:0/1,1/1.
Genotype aliases accepted: HOM_REF, HOM_ALT, HET, HET_REF, HET_ALT and MISS
e.g. HG0097:HOM_REF;HG0098:HET_REF,HOM_ALT, INSERTION, DELETION, | SNV,INDEL | |
project | Project [user@]project where project can be either the ID or the alias |
|
study | Filter variants from the given studies, these can be either the numeric ID or the alias with the format user@project:study |
|
sample | Filter variants where the samples contain the variant (HET or HOM_ALT). Accepts AND ( ; ) and OR ( , ) operators. This will automatically set 'includeSample' parameter when not provided | HG0097,HG00978 |
sampleAnnotation | Selects some samples using metadata information from Catalog. | age>20;phenotype=hpo:123,hpo:456;name=smith |
genotype | Samples with a specific genotype: {samp_1}:{gt_1}(,{gt_n})*(;{samp_n}:{gt_1}(,{gt_n})*)* |
Specify the FILTER for any of the files. If "files" filter is provided, will match the file and the filter.
Specify the QUAL for any of the files. If 'file' filter is provided, will match the file and the qual
Filter by INFO attributes from file.
If no file is specified, will use all files from "file" filter.
e.g. AN>200 or file_1.vcf:AN>200;file_2.vcf:AN<10 .
Many INFO fields can be combined. e.g. file_1.vcf:AN>200;DB=true;file_2.vcf:AN<10
Unphased genotypes (e.g. 0/1, 1/1) will also include phased genotypes (e.g. 0|1, 1|0, 1|1), but not vice versa. Genotype aliases accepted: HOM_REF, HOM_ALT, HET, HET_REF, HET_ALT and MISS This will automatically set 'includeSample' parameter when not provided | HG0097:0/0;HG0098:0/1,1/1 HG0097:HOM_REF;HG0098:HET_REF,HOM_ALT |
format | Filter by any FORMAT field from samples. [{sample}:]{key}{op}{value}[,;]* |
. If no sample is specified, will use all samples from "sample" or "genotype" filter. |
Many FORMAT fields can be combined. | DP>200 HG0097:DP>200,HG0098:DP<10 . |
HG0097:DP>200;GT=1/1,0/1,HG0098:DP<10 |
file | Filter variants from the files specified. This will set includeFile parameter when not provided |
|
info | Filter by INFO attributes from file. [{file}:]{key}{op}{value}[,;]* |
Sample and Family Parameters
If no file is specified, will use all files from "file" filter. Many INFO fields can be combined. | AN>200 file_1.vcf:AN>200;file_2.vcf:AN<10 file_1.vcf:AN>200;DB=true;file_2.vcf:AN<10 | |
filter | Specify the FILTER for any of the files. If 'file' filter is provided, will match the file and the filter. | PASS,LowGQX |
qual | Specify the QUAL for any of the files. If 'file' filter is provided, will match the file and the qual. | >123.4 |
cohort | Select variants with calculated stats for the selected cohorts |
|
Sample and Family Parameters
Parameter | Description | Example |
---|---|---|
family | Filter variants where any of the samples from the given family contains the variant (HET or HOM_ALT) |
|
familyMembers | Sub set of the members of a given family |
|
familyDisorder | Specify the disorder to use for the family segregation |
|
familySegregation | Filter by mode of inheritance from a given family. Accepted values: [ monoallelic, monoallelicIncompletePenetrance, biallelic, biallelicIncompletePenetrance, XlinkedBiallelic, XlinkedMonoallelic, Ylinked, MendelianError, DeNovo, CompoundHeterozygous ] |
|
Cohort Stats Parameters
Apart from the data provided on the files, there are some statistics calculated from the genotypes, or parsed from the INFO column, if the input was an aggregated file.
This filters are related with the statistics from a specific study and cohort. Knowing that, the format will be the same for each filter: <study>:<cohort><comparator><value>
, where the available comparators are: <
, <=
, >
, >=
, =
and !=
.
Parameter | Description | Example |
---|---|---|
cohortStatsRef | Reference Allele Frequency: [{study:}]{cohort}[<|>|<=|>=]{number}. | ALL>0.6 |
cohortStatsAlt | Alternate Allele Frequency: [{study:}]{cohort}[<|>|<=|>=]{number}. | ALL<=0.4 |
cohortStatsMaf | Minor Allele Frequency: [{study:}]{cohort}[<|>|<=|>=]{number}. | study:ALL<0.01 |
cohortStatsMgf | Minor Genotype Frequency: [{study:}]{cohort}[<|>|<=|>=]{number}. | COH1<0.1,COH2<0.3 |
Variant Annotation Parameters
Key | Description | Format | Example |
---|---|---|---|
gene | List of genes | List of values. Accept negations. | |
annotationExists | true /false | ||
ct | Consequence type SO term list. | SO:0000045,SO:0000046 | |
xref | External references | ||
biotype | List of biotypes | ||
proteinSubstitution | Protein substitution score | {protein_score}[<|>|<=|>=]{number} or {protein_score}[~=|=|]{description} | polyphen>0.1 , sift=tolerant |
conservation | Conservation score. Phylop, phastCons or gerp. | {conservation_score}[<|>|<=|>=]{number} | phastCons>0.5 , phylop<0.1 , gerp>0.1 |
populationFrequencyAlt | Alternate Population Frequency | {study}:{population}[<|>|<=|>=]{number} | 1000GENOMES_phase_3:AFR>0.2 |
populationFrequencyRef | Reference Population Frequency | {study}:{population}[<|>|<=|>=]{number} | ESP_6500:AA<0.2 |
populationFrequencyMaf | Population minor allele frequency | {study}:{population}[<|>|<=|>=]{number} | EXAC:AES>=0.6 |
transcriptionFlags | List of transcript annotation flags | CCDS, basic, cds_end_NF, mRNA_end_NF, cds_start_NF, mRNA_start_NF, seleno | |
geneTraitId | List of gene trait association ids | umls:C0007222 , OMIM:269600 | |
geneTraitName | List of gene trait association names | Cardiovascular Diseases | |
trait | List of traits, based on ClinVar, HPO, COSMIC | ||
hpo | List of HPO terms. | HP:0000545 | |
go | List of GO (Genome Ontology) terms. | GO:0002020,GO:0006508 | |
expression | List of tissues of interest | ||
proteinKeywords | List of protein variant annotation keywords | ||
drug | List of drug names | ||
functionalScore | Functional score, like cadd | {functional_score}[<|>|<=|>=]{number} | cadd_scaled>5.2 , cadd_raw<=0.3 |
Query Options
Modifies over the variants to return.
Key | Description | Format | Example |
---|---|---|---|
limit | Number of elements to return. | number | 100 |
skip | Number of elements to skip. | number | 100 |
sort | Sort variants by position | true,false | true |
include | Fields from the Variant's model to be included in the response See Variant Fields. | List of fields | chromosome,start,reference,alternate |
exclude | Fields from the Variant's model to be excluded in the response. Ignored if "include" is present. See Variant Fields. | List of fields | type,studies.stats,annotation.geneDrugInteraction |
summary | Selects an small amount of fields to return. Ignored if "include" or "exclude" are present. See Variant Fields. | true,false | true |
includeFormat | List of FORMAT names from Samples Data to include in the output. Accepts "all" and "none". | AD,DP | |
includeGenotype | Include genotypes, apart of other formats defined with include
-format. If "GT" is not provided in "include-format" or this parameter is false, genotypes won't be returned. | true,false | |
includeStudy | List of studies to be included in the result. Accepts "all" and "none". | ||
includeFile | List of files to be included in the result. Accepts "all" and "none". | ||
includeSample | List of samples to be included in the result. Accepts "all" and "none". | ||
unknownGenotype | Returned genotype for unknown genotypes. Common values: [0/0, 0|0, ./.] |
Variant Fields
The parameters include and exclude accepts a list of Variant Fields. This is a list with all the accepted values. Some short alias to those fields are listed in italic.
- id
- chromosome
- start
- end
- reference
- alternate
- length
- type
- studies
studies.samplesData | samples | samplesData
studies.files | files
studies.stats | stats
studies.secondaryAlternates
studies.studyId
- annotation
- annotation.ancestralAllele
- annotation.id
- annotation.xrefs
- annotation.hgvs
- annotation.displayConsequenceType
- annotation.consequenceTypes
- annotation.populationFrequencies
- annotation.minorAllele
- annotation.minorAlleleFreqminocohortStatsRefrAlleleFreq
- annotation.conservation
- annotation.geneExpression
- annotation.geneTraitAssociation
- annotation.geneDrugInteraction
- annotation.variantTraitAssociation
- annotation.functionalScore
- annotation.additionalAttributes
Table of Contents:
Table of Contents | ||||
---|---|---|---|---|
|