Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.



Overview

Genomic variant data model plays a crucial role not only in OpenCGA but also in OpenCB suite. Variant data model provides a generic way of representing any variant with any other interesting information associated with it. Variant data model is heavily used in OpenCGA when loading VCF files or when exporting query results. Variant data model is implemented in OpenCB Biodata project, this allows the rest of OpenCB projects such as CellBase to use it.

Goals

Main goals of variant data model include:

  • To be able represent any type of variant (SNV, INDEL) or structural variant (INSERTION, DELETION, CNV, TRANSLOCATION, ...), this includes phased variants and non-diploid organisms.
  • To provide a file-format agnostic solution of storing genomic variant data from VCF, gVCF, microarrays, ...
  • To allow bioinformaticians to add valuable and rich annotations for researchers and clinicians 

Main Features

Some of the main features of the variant data model include:

Design

A high level representation of the variant looks like this:



id

String

The variant ID

names

List<String>

Other names used for this genomic variation

chromosome

String

Chromosome where the genomic variation occurred

start

int

Normalized position where the genomic variation starts

end

int

Normalized position where the genomic variation ends

reference

String

Reference allele

alternate

String

Alternate allele

length

int

Length of the genomic variation, which depends on the variation type

type

VariantType

Type of variation: single nucleotide, indel or structural variation.

SNVSO:0001483
SNPSO:0000694
MNVSO:0002007
MNPSO:0001013
INDELSO:1000032
INSERTIONSO:0000667
DELETIONSO:0000159
TRANSLOCATIONSO:0000199
INVERSIONSO:1000036
CNVSO:0001019
DUPLICATIONSO:1000035
BREAKEND
SYMBOLIC


strand

String

Reference strand for this variant

reference

String

Reference allele

alternate

String

Alternate allele

sv

StructuralVariation

Information regarding Structural Variants

ciStartLeft

int

Confidence interval around START for imprecise variants - left

ciStartRight

int

Confidence interval around START for imprecise variants - right

ciEndLeft

int

Confidence interval around END for imprecise variants - left

ciEndRight

int

Confidence interval around END for imprecise variants - right

copyNumber

int

Number of copies for CNV variants

leftSvInsSeq

String

Left inserted sequence for long INSERTIONS

rightSvInsSeq

string

Rightinserted sequence for long INSERTIONS

type

StructuralVariantType

Structural variation type

COPY_NUMBER_GAINSO:0001742
COPY_NUMBER_LOSSSO:0001743
TANDEM_DUPLICATIONSO:1000173


breakend

Breakend


mate

BreakendMate


chromosome
position
ciPositionLeftConfidence interval arount BREAKEND position - left
ciPositionRightConfidence interval arount BREAKEND position - right


orientation

BreakendOrientation


SE

Start - End

t[p[  piece extending to the right of p is joined after t

SS

Start - Start

t]p]  reverse comp piece extending left of p is joined after t

ES

End - Start

]p]t  piece extending to the left of p is joined before t

EE

End - End

[p[t reverse comp piece extending right of p is joined before t


insSeq

String

Sequence inserted between the two breakends



studies

List<StudyEntry>

Information specific to each study the variant was read from, such as samples or statistics

studyId

string

Unique identifier of the study

secondaryAlternates

List<AlternateCoordinate>

Alternate alleles that appear along with a variant alternate

chromosome

String

Chromosome where the genomic variation occurred

start

int

First position 1-based of the alternate

end

int

End position 1-based of the alternate
reference

String

Reference allele

alternate

String

Alternate allele

type

VariantType

Type of variation: single nucleotide, indel or structural variation


files

List<FileEntry>

List of files from the study where the variant was present

fileId

String

Unique identifier of the source file

data

Map<String>

File related data that depend on the format of the file the variant was initially read from.

call

OriginalCall



variantId

Original call position for the variant, if the file was normalized.

alleleIndex

Alternate allele index of the original multi-allellic variant call in which was decomposed



sampleDataKeys

List<String>

Data keys for each sample data

samples

List<SampleEntry>

Genotypes and other sample-related information. Each position is related with one sample. The content are lists of values in the same order than the sampleDataKeys array. The length of this lists must be the same as the sampleDataKeys field.

sampleId

String

Unique sample identifier

fileIndex

int

Relative position of the file within the StudyEntry

data

List<String>

Sample data


stats

List<VariantStats>

Statistics of the genomic variation, such as its alleles/genotypeCounters count or its minimum allele frequency, grouped by cohort id.

cohortId

String

Cohort identifier

alleleCount

int

Total number of alleles in called genotypeCounters. Does not include missing alleles

refAlleleCount

int

Number of reference alleles found in this variant

altAlleleCount

int

Number of main alternate alleles found in this variants. Does not include secondary alternates

refAlleleFreq

float

Reference allele frequency calculated from refAlleleCount and alleleCount, in the range (0,1)

altAlleleFreq

float

Alternate allele frequency calculated from altAlleleCount and alleleCount, in the range (0,1)

missingAlleleCount

int

Number of missing alleles

missingGenotypeCount

int

Number of missing genotypeCounters

genotypeCount

Map<int>

Count for each genotype found

genotypeFreq

Map<float>

Genotype frequency for each genotype found

filterCount

Map<int>

Number of samples with non-missing genotype with that specific filter

filterFreq

Map<float>

Frequency of each filter. Count divided by the number of non-missing samples

qualityAvg

float

The weighted average of the Quality, computed only for non-missing samples

maf

float

Minor allele frequency

mgf

float

Minor genotype frequency

mafAllele

string

Allele with minor frequency

mgfGenotype

string

Genotype with minor frequency


scores

List<VariantScore>


idVariant score ID
cohort1Main cohort used for calculating the score
cohort2Optional secondary cohort used for calculating the score
scoreScore value
pValue

Score p value


issues

List<IssueType>


type

IssueType


DUPLICATION
DISCREPANCY
MENDELIAN_ERROR
DE_NOVO


sample

SampleEntry




annotation




Code Block
languagejs
firstline1
titleVariant model example
linenumberstrue
collapsetrue
{
    "id": "1:69511:A:G",
    "names": ["rs75062661"],
    "chromosome": "1",
    "start": 69511,
    "end": 69511,
    "strand": "+",
    "length": 1,
    "type": "SNV",
    "reference": "A",
    "alternate": "G",
    "studies": [
        {
            "studyId": "demo@family:corpasome",
            "files": [
                {
                    "fileId": "quartet.variants.annotated.vcf.gz"
                    "call" : {
                    
                    },
                    "data": {
                        "ABHom": "0.982",
                        "AC": "8",
                        "AF": "1.00",
                        "AN": "8",
                        "BaseQRankSum": "2.089",
                        "DB": "true",
                        "DP": "331",
                        "Dels": "0.00",
                        "EFF": "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5||CODING|NM_001005484.1|1|1)",
                        "FILTER": "VQSRTrancheSNP99.90to100.00",
                        "FS": "8.817",
                        "HaplotypeScore": "2.4399",
                        "MLEAC": "8",
                        "MLEAF": "1.00",
                        "MQ": "15.47",
                        "MQ0": "145",
                        "MQRankSum": "-0.047",
                        "OND": "0.018",
                        "QD": "12.97",
                        "QUAL": "4293.01",
                        "ReadPosRankSum": "1.662",
                        "SB": "-1.450e+03",
                        "VCF_ID": "rs75062661",
                        "VQSLOD": "-14.4975",
                        "culprit": "MQ",
                        "set": "FilteredInAll"
                    }
                }
            ],
            "secondaryAlternates": [],
            "sampleDataKeys": ["GT", "AD", "DP", "GQ", "PL"],
            "samples": [
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "2,171", "173", "99", "2218,228,0"]
                },
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "0,33", "34", "60", "508,60,0"]
                },
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "0,61", "63", "93", "777,93,0"]
                },
                {
                    "sampleId": "",
                    "fileIndex": 0,
                    "data": ["1/1", "0,61", "61", "96", "790,96,0"]}
            ],
            "issues": [],
            "scores": [],
            "stats": {"ALL": {"alleleCount": 8,
                                             "altAlleleCount": 8,
                                             "altAlleleFreq": 1.0,
                                             "filterCount": {"PASS": 0,
                                                             "VQSRTrancheSNP99.90to100.00": 1},
                                             "filterFreq": {"PASS": 0.0,
                                                            "VQSRTrancheSNP99.90to100.00": 1.0},
                                             "genotypeCount": {"0/0": 0,
                                                               "0/1": 0,
                                                               "1/1": 4},
                                             "genotypeFreq": {"0/0": 0.0,
                                                              "0/1": 0.0,
                                                              "1/1": 1.0},
                                             "maf": 0.0,
                                             "mafAllele": "A",
                                             "mgf": 0.0,
                                             "mgfGenotype": "0/0",
                                             "missingAlleleCount": 0,
                                             "missingGenotypeCount": 0,
                                             "qualityAvg": 4293.01,
                                             "refAlleleCount": 0,
                                             "refAlleleFreq": 0.0}
                                             },
                           }],

              "annotation": {
                "additionalAttributes": {
                    "opencga": {
                        "attribute": {
                            "annotationId": "CURRENT",
                            "release": "1"
                        }
                    }
                },
                "alternate": "G",
                "chromosome": "1",
                "consequenceTypes": [
                    {
                        "biotype": "protein_coding",
                        "cdnaPosition": 421,
                        "cdsPosition": 421,
                        "codon": "Aca/Gca",
                        "ensemblGeneId": "ENSG00000186092",
                        "ensemblTranscriptId": "ENST00000335137",
                                                   "exonOverlap": [{"number": "1/1",
                                                                    "percentage": 0.108932465}],
                                                   "geneName": "OR4F5",
                                                   "proteinVariantAnnotation": {"alternate": "ALA",
                                                                                "features": [{"description": "GPCR, "
                                                                                                             "rhodopsin-like, "
                                                                                                             "7TM",
                                                                                              "end": 280,
                                                                                              "id": "IPR017452",
                                                                                              "start": 34},
                                                                                             {"end": 182,
                                                                                              "start": 90,
                                                                                              "type": "disulfide "
                                                                                                      "bond"},
                                                                                             {"description": "Helical; "
                                                                                                             "Name=4",
                                                                                              "end": 151,
                                                                                              "start": 133,
                                                                                              "type": "transmembrane "
                                                                                                      "region"},
                                                                                             {"description": "Olfactory "
                                                                                                             "receptor "
                                                                                                             "4F5",
                                                                                              "end": 305,
                                                                                              "id": "PRO_0000150547",
                                                                                              "start": 1,
                                                                                              "type": "chain"}],
                                                                                "keywords": ["Cell "
                                                                                             "membrane",
                                                                                             "Complete "
                                                                                             "proteome",
                                                                                             "Disulfide "
                                                                                             "bond",
                                                                                             "G-protein "
                                                                                             "coupled "
                                                                                             "receptor",
                                                                                             "Membrane",
                                                                                             "Olfaction",
                                                                                             "Receptor",
                                                                                             "Reference "
                                                                                             "proteome",
                                                                                             "Sensory "
                                                                                             "transduction",
                                                                                             "Transducer",
                                                                                             "Transmembrane",
                                                                                             "Transmembrane "
                                                                                             "helix"],
                                                                                "position": 141,
                                                                                "reference": "THR",
                                                                                "substitutionScores": [{"description": "tolerated",
                                                                                                        "score": 0.63,
                                                                                                        "source": "sift"},
                                                                                                       {"description": "benign",
                                                                                                        "score": 0.003,
                                                                                                        "source": "polyphen"}],
                                                                                "uniprotAccession": "Q8NH21"},
                                                   "sequenceOntologyTerms": [{"accession": "SO:0001583",
                                                                              "name": "missense_variant"}],
                                                   "strand": "+",
                                                   "transcriptAnnotationFlags": ["CCDS",
                                                                                 "basic"]},
                                                  {"sequenceOntologyTerms": [{"accession": "SO:0001566",
                                                                              "name": "regulatory_region_variant"}]}],
                             "conservation": [{"score": 1.149999976158142,
                                               "source": "gerp"},
                                              {"score": 0.1289999932050705,
                                               "source": "phastCons"},
                                              {"score": -0.527999997138977,
                                               "source": "phylop"}],
                             "cytoband": [{"chromosome": "1",
                                           "end": 2300000,
                                           "name": "p36.33",
                                           "stain": "gneg",
                                           "start": 1}],
                             "displayConsequenceType": "missense_variant",
                             "functionalScore": [{"score": -0.7899999618530273,
                                                  "source": "cadd_raw"},
                                                 {"score": 0.03999999910593033,
                                                  "source": "cadd_scaled"}],
                             "geneDrugInteraction": [],
                             "geneTraitAssociation": [],
                             "hgvs": ["ENST00000335137(ENSG00000186092):c.421A>G"],
                             "id": "rs2691305",
                             "populationFrequencies": [{"altAllele": "G",
                                                        "altAlleleFreq": 0.95061594,
                                                        "altHomGenotypeFreq": 0.93263996,
                                                        "hetGenotypeFreq": 0.03595196,
                                                        "population": "ALL",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.049384065,
                                                        "refHomGenotypeFreq": 0.031408086,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9499386,
                                                        "altHomGenotypeFreq": 0.92997545,
                                                        "hetGenotypeFreq": 0.03992629,
                                                        "population": "OTH",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.050061423,
                                                        "refHomGenotypeFreq": 0.03009828,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.999461,
                                                        "altHomGenotypeFreq": 0.99892205,
                                                        "hetGenotypeFreq": 0.0010779734,
                                                        "population": "EAS",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.0005389867,
                                                        "refHomGenotypeFreq": 0.0,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.95083994,
                                                        "altHomGenotypeFreq": 0.9305369,
                                                        "hetGenotypeFreq": 0.040606,
                                                        "population": "AMR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.049160052,
                                                        "refHomGenotypeFreq": 0.028857054,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.97795016,
                                                        "altHomGenotypeFreq": 0.9710086,
                                                        "hetGenotypeFreq": 0.013883217,
                                                        "population": "ASJ",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.022049816,
                                                        "refHomGenotypeFreq": 0.015108207,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.99145377,
                                                        "altHomGenotypeFreq": 0.98848504,
                                                        "hetGenotypeFreq": 0.0059373877,
                                                        "population": "FIN",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.00854624,
                                                        "refHomGenotypeFreq": 0.005577546,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9727796,
                                                        "altHomGenotypeFreq": 0.96255124,
                                                        "hetGenotypeFreq": 0.020456737,
                                                        "population": "NFE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.027220415,
                                                        "refHomGenotypeFreq": 0.016992046,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.6074365,
                                                        "altHomGenotypeFreq": 0.47664425,
                                                        "hetGenotypeFreq": 0.26158446,
                                                        "population": "AFR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.39256352,
                                                        "refHomGenotypeFreq": 0.2617713,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.95853204,
                                                        "altHomGenotypeFreq": 0.94338477,
                                                        "hetGenotypeFreq": 0.03029453,
                                                        "population": "MALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.041467976,
                                                        "refHomGenotypeFreq": 0.02632071,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.94091445,
                                                        "altHomGenotypeFreq": 0.91947174,
                                                        "hetGenotypeFreq": 0.04288538,
                                                        "population": "FEMALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.05908557,
                                                        "refHomGenotypeFreq": 0.03764288,
                                                        "study": "GNOMAD_EXOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.84222084,
                                                        "altHomGenotypeFreq": 0.77478045,
                                                        "hetGenotypeFreq": 0.1348808,
                                                        "population": "ALL",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.15777917,
                                                        "refHomGenotypeFreq": 0.090338774,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9404255,
                                                        "altHomGenotypeFreq": 0.9191489,
                                                        "hetGenotypeFreq": 0.04255319,
                                                        "population": "OTH",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.05957447,
                                                        "refHomGenotypeFreq": 0.038297873,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 1.0,
                                                        "altHomGenotypeFreq": 1.0,
                                                        "hetGenotypeFreq": 0.0,
                                                        "population": "EAS",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.0,
                                                        "refHomGenotypeFreq": 0.0,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9410377,
                                                        "altHomGenotypeFreq": 0.9103774,
                                                        "hetGenotypeFreq": 0.061320756,
                                                        "population": "AMR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.058962263,
                                                        "refHomGenotypeFreq": 0.028301887,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9672131,
                                                        "altHomGenotypeFreq": 0.9508197,
                                                        "hetGenotypeFreq": 0.032786883,
                                                        "population": "ASJ",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.032786883,
                                                        "refHomGenotypeFreq": 0.016393442,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9918478,
                                                        "altHomGenotypeFreq": 0.98913044,
                                                        "hetGenotypeFreq": 0.0054347827,
                                                        "population": "FIN",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.008152174,
                                                        "refHomGenotypeFreq": 0.0054347827,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.9637507,
                                                        "altHomGenotypeFreq": 0.94847214,
                                                        "hetGenotypeFreq": 0.03055722,
                                                        "population": "NFE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.03624925,
                                                        "refHomGenotypeFreq": 0.02097064,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.5886525,
                                                        "altHomGenotypeFreq": 0.41246733,
                                                        "hetGenotypeFreq": 0.3523703,
                                                        "population": "AFR",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.4113475,
                                                        "refHomGenotypeFreq": 0.23516238,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.8381471,
                                                        "altHomGenotypeFreq": 0.7682737,
                                                        "hetGenotypeFreq": 0.13974673,
                                                        "population": "MALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.16185293,
                                                        "refHomGenotypeFreq": 0.09197956,
                                                        "study": "GNOMAD_GENOMES"},
                                                       {"altAllele": "G",
                                                        "altAlleleFreq": 0.84750646,
                                                        "altHomGenotypeFreq": 0.78322285,
                                                        "hetGenotypeFreq": 0.12856731,
                                                        "population": "FEMALE",
                                                        "refAllele": "A",
                                                        "refAlleleFreq": 0.1524935,
                                                        "refHomGenotypeFreq": 0.08820986,
                                                        "study": "GNOMAD_GENOMES"}],
                             "reference": "A",
                             "repeat": [{"chromosome": "1",
                                         "copyNumber": 2.0,
                                         "end": 87112,
                                         "id": "9119",
                                         "percentageMatch": 0.992904,
                                         "source": "genomicSuperDup",
                                         "start": 10001},
                                        {"chromosome": "1",
                                         "copyNumber": 2.0,
                                         "end": 87112,
                                         "id": "14903",
                                         "percentageMatch": 0.995437,
                                         "source": "genomicSuperDup",
                                         "start": 18393}],
                             "start": 69511,
                             "traitAssociation": [{"additionalProperties": [{"name": "mutationSomaticStatus_in_source_file",
                                                                             "value": "Confirmed "
                                                                                      "somatic "
                                                                                      "variant"}],
                                                   "alleleOrigin": [],
                                                   "bibliography": [],
                                                   "ethnicity": "Z",
                                                   "genomicFeatures": [{"featureType": "gene",
                                                                        "xrefs": {"symbol": "OR4F5"}},
                                                                       {"featureType": "gene",
                                                                        "xrefs": {"symbol": "8301"}}],
                                                   "heritableTraits": [],
                                                   "id": "COSM4144171",
                                                   "somaticInformation": {"histologySubtype": "neoplasm",
                                                                          "primaryHistology": "other",
                                                                          "primarySite": "thyroid",
                                                                          "sampleSource": "",
                                                                          "tumourOrigin": ""},
                                                   "source": {"name": "cosmic"},
                                                   "submissions": []}],
                             "variantTraitAssociation": {"clinvar": [],
                                                         "cosmic": [{"geneName": "OR4F5",
                                                                     "histologySubtype": "neoplasm",
                                                                     "mutationId": "COSM4144171",
                                                                     "mutationSomaticStatus": "Confirmed "
                                                                                              "somatic "
                                                                                              "variant",
                                                                     "primaryHistology": "other",
                                                                     "primarySite": "thyroid",
                                                                     "sampleSource": "",
                                                                     "siteSubtype": "",
                                                                     "tumourOrigin": ""}]}}}



Implementation

Variant data model is implemented in OpenCB Biodata project, this allows the rest of OpenCB projects such as CellBase, Oskar to 


Table of Contents:

Table of Contents
indent20px