Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Overview

Genomic variant data model plays a crucial role not only in OpenCGA but also in OpenCB suite. Variant data model provides a generic way of representing any variant with any other interesting information associated with it. Variant data model is heavily used in OpenCGA when loading VCF files or when exporting query results. Variant data model is implemented in OpenCB Biodata project, this allows the rest of OpenCB projects such as CellBase to use it.

Goals

Main goals of variant data model include:

  • To be able represent any type of variant (SNV, INDEL) or structural variant (INSERTION, DELETION, CNV, TRANSLOCATION, ...), this includes phased variants and non-diploid organisms.
  • To provide a file-format agnostic solution of storing genomic variant data from VCF, gVCF, microarrays, ...
  • To allow bioinformaticians to add valuable and rich annotations for researchers and clinicians 

Main Features

Some of the main features of the variant data model include:

Design

A high level representation of the variant looks like this:


id

The variant ID

names

Other names used for this genomic variation

chromosome

Chromosome where the genomic variation occurred

start

Normalized position where the genomic variation starts

end

Normalized position where the genomic variation ends

length

Length of the genomic variation, which depends on the variation type

type

Type of variation: single nucleotide, indel or structural variation.

strand

Reference strand for this variant

reference

Reference allele

alternate

Alternate allele

studies

studyId

Unique identifier of the study

secondaryAlternates

Alternate alleles that appear along with a variant alternate

files

fileId

Unique identifier of the source file

data

File related data that depend on the format of the file the variant was initially read from.

call

variantId

Original call position for the variant, if the file was normalized.

alleleIndex

Alternate allele index of the original multi-allellic variant call in which was decomposed

sampleDataKeys

Data keys for each sample data

samples

sampleIdUnique sample identifier
fileIndexRelative position of the file within the StudyEntry
dataSample data

stats

cohortIdCohort identifier

alleleCount

Total number of alleles in called genotypeCounters. Does not include missing alleles
refAlleleCountNumber of reference alleles found in this variant
altAlleleCountNumber of main alternate alleles found in this variants. Does not include secondary alternates
refAlleleFreqReference allele frequency calculated from refAlleleCount and alleleCount, in the range (0,1)
altAlleleFreqAlternate allele frequency calculated from altAlleleCount and alleleCount, in the range (0,1)
missingAlleleCountNumber of missing alleles
missingGenotypeCountNumber of missing genotypeCounters
genotypeCountCount for each genotype found
genotypeFreqGenotype frequency for each genotype found
filterCountNumber of samples with non-missing genotype with that specific filter
filterFreqFrequency of each filter. Count divided by the number of non-missing samples
qualityAvgThe weighted average of the Quality, computed only for non-missing samples
mafMinor allele frequency
mgfMinor genotype frequency
mafAlleleAllele with minor frequency
mgfGenotypeGenotype with minor frequency
scores
issues
annotation




id

String

The variant ID

names

List<String>

Other names used for this genomic variation

chromosome

String

Chromosome where the genomic variation occurred

start

int

Normalized position where the genomic variation starts

end

int

Normalized position where the genomic variation ends

length

int

Length of the genomic variation, which depends on the variation type

type

VariantType

Type of variation: single nucleotide, indel or structural variation.

strand

String

Reference strand for this variant

reference

String

Reference allele

alternate

String

Alternate allele

studies

List<StudyEntry>

Information specific to each study the variant was read from, such as samples or statistics

studyId

string

Unique identifier of the study

secondaryAlternates

Alternate alleles that appear along with a variant alternate

files

List<FileEntry>

List of files from the study where the variant was present

fileId

Unique identifier of the source file

data

File related data that depend on the format of the file the variant was initially read from.

call


variantId

Original call position for the variant, if the file was normalized.

alleleIndex

Alternate allele index of the original multi-allellic variant call in which was decomposed

sampleDataKeys

List<String>

Data keys for each sample data

samples

List<SampleEntry>

Genotypes and other sample-related information. Each position is related with one sample. The content are lists of values in the same order than the sampleDataKeys array. The length of this lists must be the same as the sampleDataKeys field.

sampleIdUnique sample identifier
fileIndexRelative position of the file within the StudyEntry
dataSample data

stats

List<VariantStats>

Statistics of the genomic variation, such as its alleles/genotypeCounters count or its minimum allele frequency, grouped by cohort id.

cohortIdCohort identifier

alleleCount

Total number of alleles in called genotypeCounters. Does not include missing alleles
refAlleleCountNumber of reference alleles found in this variant
altAlleleCountNumber of main alternate alleles found in this variants. Does not include secondary alternates
refAlleleFreqReference allele frequency calculated from refAlleleCount and alleleCount, in the range (0,1)
altAlleleFreqAlternate allele frequency calculated from altAlleleCount and alleleCount, in the range (0,1)
missingAlleleCountNumber of missing alleles
missingGenotypeCountNumber of missing genotypeCounters
genotypeCountCount for each genotype found
genotypeFreqGenotype frequency for each genotype found
filterCountNumber of samples with non-missing genotype with that specific filter
filterFreqFrequency of each filter. Count divided by the number of non-missing samples
qualityAvgThe weighted average of the Quality, computed only for non-missing samples
mafMinor allele frequency
mgfMinor genotype frequency
mafAlleleAllele with minor frequency
mgfGenotypeGenotype with minor frequency
scores
issues
annotation




{
    "id": "1:69511:A:G",
    "names": ["rs75062661"],
    "chromosome": "1",
    "start": 69511,
    "end": 69511,
    "strand": "+",
    "length": 1,
    "type": "SNV",
    "reference": "A",
    "alternate": "G",
    "studies": [
        {
            "studyId": "demo@family:corpasome",
            "files": [
                {
                    "fileId": "quartet.variants.annotated.vcf.gz"
                    "call" : {
                    
                    },
                    "data": {
                        "ABHom": "0.982",
                        "AC": "8",
                        "AF": "1.00",
                        "AN": "8",
                        "BaseQRankSum": "2.089",
                        "DB": "true",
                        "DP": "331",
                        "Dels": "0.00",
                        "EFF": "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5||CODING|NM_001005484.1|1|1)",
                        "FILTER": "VQSRTrancheSNP99.90to100.00",
                        "FS": "8.817",
                        "HaplotypeScore": "2.4399",
                        "MLEAC": "8",
                        "MLEAF": "1.00",
                        "MQ": "15.47",
                        "MQ0": "145",
                        "MQRankSum": "-0.047",
                        "OND": "0.018",
                        "QD": "12.97",
                        "QUAL": "4293.01",
                        "ReadPosRankSum": "1.662",
                        "SB": "-1.450e+03",
                        "VCF_ID": "rs75062661",
                        "VQSLOD": "-14.4975",
                        "culprit": "MQ",
                        "set": "FilteredInAll"
                    }
                }
            ],
            "secondaryAlternates": [],
            "sampleDataKeys": ["GT", "AD", "DP", "GQ", "PL"],
            "samples": [
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "2,171", "173", "99", "2218,228,0"]
                },
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "0,33", "34", "60", "508,60,0"]
                },
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "0,61", "63", "93", "777,93,0"]
                },
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "0,61", "61", "96", "790,96,0"]}
            ],
            "issues": [],
            "scores": [],
            "stats": {"ALL": {"alleleCount": 8,
                                             "altAlleleCount": 8,
                                             "altAlleleFreq": 1.0,
                                             "filterCount": {"PASS": 0,
                                                             "VQSRTrancheSNP99.90to100.00": 1},
                                             "filterFreq": {"PASS": 0.0,
                                                            "VQSRTrancheSNP99.90to100.00": 1.0},
                                             "genotypeCount": {"0/0": 0,
                                                               "0/1": 0,
                                                               "1/1": 4},
                                             "genotypeFreq": {"0/0": 0.0,
                                                              "0/1": 0.0,
                                                              "1/1": 1.0},
                                             "maf": 0.0,
                                             "mafAllele": "A",
                                             "mgf": 0.0,
                                             "mgfGenotype": "0/0",
                                             "missingAlleleCount": 0,
                                             "missingGenotypeCount": 0,
                                             "qualityAvg": 4293.01,
                                             "refAlleleCount": 0,
                                             "refAlleleFreq": 0.0}
                                             },
                           }],

              "annotation": {
                "additionalAttributes": {
                    "opencga": {
                        "attribute": {
                            "annotationId": "CURRENT",
                            "release": "1"
                        }
                    }
                },
                "alternate": "G",
                "chromosome": "1",
                "consequenceTypes": [
                    {
                        "biotype": "protein_coding",
                        "cdnaPosition": 421,
                        "cdsPosition": 421,
                        "codon": "Aca/Gca",
                        "ensemblGeneId": "ENSG00000186092",
                        "ensemblTranscriptId": "ENST00000335137",
                                                   "exonOverlap": [{"number": "1/1",
                                                                    "percentage": 0.108932465}],
                                                   "geneName": "OR4F5",
                                                   "proteinVariantAnnotation": {"alternate": "ALA",
                                                                                "features": [{"description": "GPCR, "
                                                                                                             "rhodopsin-like, "
                                                                                                             "7TM",
                                                                                              "end": 280,
                                                                                              "id": "IPR017452",
                                                                                              "start": 34},
                                                                                             {"end": 182,
                                                                                              "start": 90,
                                                                                              "type": "disulfide "
                                                                                                      "bond"},
                                                                                             {"description": "Helical; "
                                                                                                             "Name=4",
                                                                                              "end": 151,
                                                                                              "start": 133,
                                                                                              "type": "transmembrane "
                                                                                                      "region"},
                                                                                             {"description": "Olfactory "
                                                                                                             "receptor "
                                                                                                             "4F5",
                                                                                              "end": 305,
                                                                                              "id": "PRO_0000150547",
                                                                                              "start": 1,
                                                                                              "type": "chain"}],
                                                                                "keywords": ["Cell "
                                                                                             "membrane",
                                                                                             "Complete "
                                                                                             "proteome",
                                                                                             "Disulfide "
                                                                                             "bond",
                                                                                             "G-protein "
                                                                                             "coupled "
                                                                                             "receptor",
                                                                                             "Membrane",
                                                                                             "Olfaction",
                                                                                             "Receptor",
                                                                                             "Reference "
                                                                                             "proteome",
                                                                                             "Sensory "
                                                                                             "transduction",
                                                                                             "Transducer",
                                                                                             "Transmembrane",
                                                                                             "Transmembrane "
                                                                                             "helix"],
                                                                                "position": 141,
                                                                                "reference": "THR",
                                                                                "substitutionScores": [{"description": "tolerated",
                                                                                                        "score": 0.63,
                                                                                                        "source": "sift"},
                                                                                                       {"description": "benign",
                                                                                                        "score": 0.003,
                                                                                                        "source": "polyphen"}],
                                                                                "uniprotAccession": "Q8NH21"},
                                                   "sequenceOntologyTerms": [{"accession": "SO:0001583",
                                                                              "name": "missense_variant"}],
                                                   "strand": "+",
                                                   "transcriptAnnotationFlags": ["CCDS",
                                                                                 "basic"]},
                                                  {"sequenceOntologyTerms": [{"accession": "SO:0001566",
                                                                              "name": "regulatory_region_variant"}]}],
                             "conservation": [{"score": 1.149999976158142,
                                               "source": "gerp"},
                                              {"score": 0.1289999932050705,
                                               "source": "phastCons"},
                                              {"score": -0.527999997138977,
                                               "source": "phylop"}],
                             "cytoband": [{"chromosome": "1",
                                           "end": 2300000,
                                           "name": "p36.33",
                                           "stain": "gneg",
                                           "start": 1}],
                             "displayConsequenceType": "missense_variant",
                             "functionalScore": [{"score": -0.7899999618530273,
                                                  "source": "cadd_raw"},
                                                 {"score": 0.03999999910593033,
                                                  "source": "cadd_scaled"}],
                             "geneDrugInteraction": [],
                             "geneTraitAssociation": [],
                             "hgvs": ["ENST00000335137(ENSG00000186092):c.421A>G"],
                             "id": "rs2691305",
                             "populationFrequencies": [{"altAllele": "G",
                                                        "altAlleleFreq": 0.95061594,
                                                        "altHomGenotypeFreq": 0.93263996,
                                                        "hetGenotypeFreq": 0.03595196,
                                                        "population": "ALL",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.049384065,
                                                        "refHomGenotypeFreq": 0.031408086,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9499386,
                                                        "altHomGenotypeFreq": 0.92997545,
                                                        "hetGenotypeFreq": 0.03992629,
                                                        "population": "OTH",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.050061423,
                                                        "refHomGenotypeFreq": 0.03009828,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.999461,
                                                        "altHomGenotypeFreq": 0.99892205,
                                                        "hetGenotypeFreq": 0.0010779734,
                                                        "population": "EAS",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.0005389867,
                                                        "refHomGenotypeFreq": 0.0,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.95083994,
                                                        "altHomGenotypeFreq": 0.9305369,
                                                        "hetGenotypeFreq": 0.040606,
                                                        "population": "AMR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.049160052,
                                                        "refHomGenotypeFreq": 0.028857054,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.97795016,
                                                        "altHomGenotypeFreq": 0.9710086,
                                                        "hetGenotypeFreq": 0.013883217,
                                                        "population": "ASJ",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.022049816,
                                                        "refHomGenotypeFreq": 0.015108207,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.99145377,
                                                        "altHomGenotypeFreq": 0.98848504,
                                                        "hetGenotypeFreq": 0.0059373877,
                                                        "population": "FIN",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.00854624,
                                                        "refHomGenotypeFreq": 0.005577546,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9727796,
                                                        "altHomGenotypeFreq": 0.96255124,
                                                        "hetGenotypeFreq": 0.020456737,
                                                        "population": "NFE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.027220415,
                                                        "refHomGenotypeFreq": 0.016992046,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.6074365,
                                                        "altHomGenotypeFreq": 0.47664425,
                                                        "hetGenotypeFreq": 0.26158446,
                                                        "population": "AFR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.39256352,
                                                        "refHomGenotypeFreq": 0.2617713,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.95853204,
                                                        "altHomGenotypeFreq": 0.94338477,
                                                        "hetGenotypeFreq": 0.03029453,
                                                        "population": "MALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.041467976,
                                                        "refHomGenotypeFreq": 0.02632071,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.94091445,
                                                        "altHomGenotypeFreq": 0.91947174,
                                                        "hetGenotypeFreq": 0.04288538,
                                                        "population": "FEMALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.05908557,
                                                        "refHomGenotypeFreq": 0.03764288,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.84222084,
                                                        "altHomGenotypeFreq": 0.77478045,
                                                        "hetGenotypeFreq": 0.1348808,
                                                        "population": "ALL",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.15777917,
                                                        "refHomGenotypeFreq": 0.090338774,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9404255,
                                                        "altHomGenotypeFreq": 0.9191489,
                                                        "hetGenotypeFreq": 0.04255319,
                                                        "population": "OTH",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.05957447,
                                                        "refHomGenotypeFreq": 0.038297873,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 1.0,
                                                        "altHomGenotypeFreq": 1.0,
                                                        "hetGenotypeFreq": 0.0,
                                                        "population": "EAS",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.0,
                                                        "refHomGenotypeFreq": 0.0,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9410377,
                                                        "altHomGenotypeFreq": 0.9103774,
                                                        "hetGenotypeFreq": 0.061320756,
                                                        "population": "AMR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.058962263,
                                                        "refHomGenotypeFreq": 0.028301887,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9672131,
                                                        "altHomGenotypeFreq": 0.9508197,
                                                        "hetGenotypeFreq": 0.032786883,
                                                        "population": "ASJ",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.032786883,
                                                        "refHomGenotypeFreq": 0.016393442,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9918478,
                                                        "altHomGenotypeFreq": 0.98913044,
                                                        "hetGenotypeFreq": 0.0054347827,
                                                        "population": "FIN",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.008152174,
                                                        "refHomGenotypeFreq": 0.0054347827,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9637507,
                                                        "altHomGenotypeFreq": 0.94847214,
                                                        "hetGenotypeFreq": 0.03055722,
                                                        "population": "NFE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.03624925,
                                                        "refHomGenotypeFreq": 0.02097064,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.5886525,
                                                        "altHomGenotypeFreq": 0.41246733,
                                                        "hetGenotypeFreq": 0.3523703,
                                                        "population": "AFR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.4113475,
                                                        "refHomGenotypeFreq": 0.23516238,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.8381471,
                                                        "altHomGenotypeFreq": 0.7682737,
                                                        "hetGenotypeFreq": 0.13974673,
                                                        "population": "MALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.16185293,
                                                        "refHomGenotypeFreq": 0.09197956,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.84750646,
                                                        "altHomGenotypeFreq": 0.78322285,
                                                        "hetGenotypeFreq": 0.12856731,
                                                        "population": "FEMALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.1524935,
                                                        "refHomGenotypeFreq": 0.08820986,
                                                        "study": "GNOMAD_GENOMES"}],
                             "reference": "A",
                             "repeat": [{"chromosome": "1",
                                         "copyNumber": 2.0,
                                         "end": 87112,
                                         "id": "9119",
                                         "percentageMatch": 0.992904,
                                         "source": "genomicSuperDup",
                                         "start": 10001},
                                        {"chromosome": "1",
                                         "copyNumber": 2.0,
                                         "end": 87112,
                                         "id": "14903",
                                         "percentageMatch": 0.995437,
                                         "source": "genomicSuperDup",
                                         "start": 18393}],
                             "start": 69511,
                             "traitAssociation": [{"additionalProperties": [{"name": "mutationSomaticStatus_in_source_file",
                                                                             "value": "Confirmed "
                                                                                      "somatic "
                                                                                      "variant"}],
                                                   "alleleOrigin": [],
                                                   "bibliography": [],
                                                   "ethnicity": "Z",
                                                   "genomicFeatures": [{"featureType": "gene",
                                                                        "xrefs": {"symbol": "OR4F5"}},
                                                                       {"featureType": "gene",
                                                                        "xrefs": {"symbol": "8301"}}],
                                                   "heritableTraits": [],
                                                   "id": "COSM4144171",
                                                   "somaticInformation": {"histologySubtype": "neoplasm",
                                                                          "primaryHistology": "other",
                                                                          "primarySite": "thyroid",
                                                                          "sampleSource": "",
                                                                          "tumourOrigin": ""},
                                                   "source": {"name": "cosmic"},
                                                   "submissions": []}],
                             "variantTraitAssociation": {"clinvar": [],
                                                         "cosmic": [{"geneName": "OR4F5",
                                                                     "histologySubtype": "neoplasm",
                                                                     "mutationId": "COSM4144171",
                                                                     "mutationSomaticStatus": "Confirmed "
                                                                                              "somatic "
                                                                                              "variant",
                                                                     "primaryHistology": "other",
                                                                     "primarySite": "thyroid",
                                                                     "sampleSource": "",
                                                                     "siteSubtype": "",
                                                                     "tumourOrigin": ""}]}}}



Implementation

Variant data model is implemented in OpenCB Biodata project, this allows the resto of OpenCB projects such as CellBase, Oskar to 


Table of Contents:


  • No labels