The OpenCGA Variant Storage Engine supports basic operations to work with variant datasets.


Indexing variants does not apply any modification to the generic pipeline. The input file format is VCF, accepting different variations like gVCF or aggregated VCFs


Files are converted Biodata models. The metadata and the data are serialized into two separated files. The metadata is stored into a file named <inputFileName>.file.json.gz serializing in json a single instance of the biodata model VariantSource, which mainly contains the header and some general stats. Along with this file, the real variants data is stored in a file named <inputFileName>.variants.avro.gz with a set of variant records described as the biodata model Variant.

VCF files are read using the library HTSJDK, which provides a syntactic validation of the data. Further actions on the validation will be taken, like duplicate or overlapping variants detection.

By default, malformed variants will be skipped and written into a third optional file named <inputFileName>.malformed.txt . If the transform step generates this file, a curation process should be taken to repair the file. Otherwise, the variants would be skipped.

All the variants in the transform step will be normalized as defined here: Variant Normalization. This will help to unify the variants representation, since the VCF specification allows multiple ways of referring to a variant and some ambiguities.


Loading variants from multiple files into a single database will effectively merge them. In most of the scenarios, with a good normalization, merging variants is strait forward. But in some other scenarios, with multiple alternates or overlapping variants, the merge requires more logic. More information at Merge Mode.

Details about load are dependent on the implementation.


  • You can not load two files with the same sample in the same study. See OpenCGA#158.
    There is an exception for this limitation for the scenarios where the variants were split in multiple files (by chromosome, by type, ...). In this case, you can use the parameter --load-split-data. See OpenCGA#696
  • You can not index two files with the same name (e.g. /data/sample1/my.vcf.gz  and /data/sample2/my.vcf.gz) in the same study. This limitation should not be a problem in any real scenario, where every VCF file usually has a different name. If two files have the same name, the most likely situation is that they contain the same samples, and this is already forbidden by the previous limitation.


As part of the enrichment step, some extra information can be added to the variants database as Annotations. This VariantAnnotation can be fetch from Cellbase or read from local file provided by the user. The model of the variant annotation is defined in the project Biodata, in variantAnnotation.avdl


Variant Storage Engine can make use of different annotators to produce the annotation for the variants.

The annotator can be modified at the annotating step, and the default value is defined in the storage-configuration.yml file:

  • annotator: "cellbase_rest"

Previous to version v1.3.0: Parameter "annotationSource" should be used instead of "annotator". See OpenCGA#747.

CellBase Annotator

CellBase Variant Annotation


CellBase REST Annotator

This is the default annotator for OpenCGA. This Annotator connects to a CellBase installation using the REST API.

This is an example of cellbase annotation using a REST call:

CellBase Direct Annotator

The CellBaseDirectAnnotator creates a connection directly with the CellBase database. This requires a local installation of CellBase, which takes some resources, but it speeds up the annotation step removing network time.



  • annotator.cellbase.exclude: "expression,hgvs,repeats,cytoband"
  • annotator.cellbase.use_cache: true
  • annotator.cellbase.imprecise_variants: false  # Imprecise variants supported by cellbase (REST only)

Custom annotator


Custom annotation

The VariantAnnotation model includes a field for adding extra annotation attributes. This field is intended to contain custom annotation provided by the end user.

Additional attributes can be grouped by source. Each source will contain a set of key-value attributes creating this structure:

Code Block
VariantAnnotation = {
  // ... 
  "additionalAttributes" : {
    "<source1>" : {
      "attribute" : {
    "<source2>" : {
      "attribute" : {

OpenCGA Storage is able to load this custom annotation from 3 different formats: GFF, BED and VCF. When loading the new annotation data, the user has to provide a name for the new custom annotation. Because the structure of these file formats is slightly different, the information loaded won't be the same.

GFF and BED files describe features within a region, providing a chromosome, start and end. All the variants between the start and end will be annotated with the information.

Calculate Statistics

Pre-calculated stats are useful for filtering variants. There are two types of statistics, per variant, and global statistics. Per variant statistics are stored in the variants database, within the StudyEntry. Global statistics are stored in Catalog.

Variant Stats (intra variant)
These stats are calculated for each variant,

This stats are intra-study, calculated within a given cohort.


Cohorts are defined as a arbitrary group of samples. Cohorts can be defined in Catalog, either selecting samples one by one or selecting all samples that share some attributes like population or phenotype.

If a cohort is modified after calculating the statistics, the existing statistics became INVALID.

By default, in each study, there is defined the cohort ALL that contains all the samples loaded in the study. Every time that new samples are loaded in the study, this cohort is modified, and the statistics have to be recomputed.

Stats models

  • Variant Stats (intra variant)
    These stats are calculated for each variant, and for a set of samples (cohort).

    Code Block
    	// Total number of alleles in called genotypes. Do not include missing alleles
    	int alleleCount
    	// Number of reference alleles found in this variants
    	int refAlleleCount
    	// Number of main alternate alleles found in this variants. Do not include secondary alternates
    	int altAlleleCount
    	// Reference allele frequency calculated from refAlleleCount and alleleCount, in the range (0,1)
    	float refAlleleFreq
    	// Alternate allele frequency calculated from altAlleleCount and alleleCount, in the range (0,1)
    	float altAlleleFreq
    	// Count for each genotype found
    	map<int> genotypeCount
    	// Genotype frequency for each genotype found
    	map<float> genotypeFreq
    	// Number of missing alleles
    	int missingAlleleCount
    	// Number of missing genotypes
    	int missingGenotypeCount
    	// Minor allele frequency
    	float maf
    	// Minor genotype frequency
    	float mgf
    	// Allele with minor frequency
    	string mafAllele
    	// Genotype with minor frequency
    	string mgfGenotype

  • Variant Global Stats (inter variant)

Aggregated statistics

StatuscolourYellowtitlePENDINGUsually, public studies do not provide samples data. In this situations is not possible to calculate the statistics. Instead, the statistics can be extracted from the INFO column. Unfortunately, there is no standard way for defining multi-cohort statistics in the VCF format. Instead, OpenCGA recognizes three different formats for representing statistics.



Export / Query and variant filter

The main goal for indexing variant data into OpenCGA Storage is to be able to make queries and extract this data in a efficient way. This operation, executed via gRPC or with direct connection, allows to export a large quantity of variants from a database. It can work together with Import, be used only to provide input data to external analysis, or generate reports.

See Querying Variant Data to see all the possible filters over variants.

When exporting variants, some metadata files are generated, containing information regarding the studies, files and samples from the exported data.

There are multiple possible output formats:

  • VCF
  • JSON
  • AVRO

Export frequencies (statistics)

Export frequencies (statistics) is an special case of export. Instead of export full variants, only the variant cohort statistics are exported.

To export variant frequencies, use the command variant export-frequencies in the command line.

Code Block variant export-frequencies -s <study> --output-format <vcf|tsv|cellbase|json> variant export-frequencies -s <study> --output-format <vcf|tsv|cellbase|json

As for variants export, there are multiple possible output formats:

  • VCF : Standard VCF format without samples information, with the stats as values in the INFO column.

    Code Block
    ##FILTER=<ID=.,Description="No FILTER info">
    ##FILTER=<ID=PASS,Description="Valid variant">
    ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes, for each ALT allele, in the same order as listed">
    ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, calculated from AC and AN, in the range (0,1), in the same order as listed">
    ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
    ##INFO=<ID=AFK_AF,Number=A,Type=Float,Description="Allele frequency in the C1 cohort calculated from AC and AN, in the range (0,1), in the same order as listed">
    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
    22    16050115    .    G    A    .    PASS    AC=1;AF=0.001;AN=1000;AFK_AF=0.002008
    22    16050213    .    C    T    .    PASS    AC=1;AF=0.001;AN=1000;AFK_AF=0
    22    16050319    .    C    T    .    PASS    AC=1;AF=0.001;AN=1000;AFK_AF=0
    22    16050607    .    G    A    .    PASS    AC=2;AF=0.002;AN=1000;AFK_AF=0.004016

  • TSV (Tab Separated Values). Simple format with each cohort in one column.

    Code Block
    #CHR    POS    REF    ALT    ALL_AN    ALL_AC    ALL_AF    ALL_HET    ALL_HOM
    22    16050213    C    T    1000    1    0.001    0.002    0.0
    22    16050607    G    A    1000    2    0.002    0.004    0.0
    22    16050740    A    -    1000    1    0.001    0.002    0.0
    22    16050840    C    G    1000    13    0.013    0.026    0.0
    22    16051075    G    A    1000    2    0.002    0.004    0.0
    22    16051249    T    C    1000    91    0.091    0.162    0.01
    22    16051453    A    C    998    74    0.074    0.144    0.004
    22    16051453    A    G    926    2    0.002    0.144    0.004
    22    16051723    A    -    1000    12    0.012    0.024    0.0
    22    16051816    T    G    1000    2    0.002    0.004    0.0

  • JSON. Variant model just with minimal information and statistics.

    Code Block

  • Population Frequencies (Cellbase mode). Specific JSON format for import into Cellbase variation. It is a Variant model with VariantAnnotation with PupulationFrequencies.

    Code Block
    titlePopulationFrequencies / Cellbase



