- Created by Nacho Medina, last modified on Mar 25, 2020
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 16 Next »
Overview
Genomic variant data model plays a crucial role not only in OpenCGA but also in OpenCB suite. Variant data model provides a generic way of representing any variant with any other interesting information associated with it. Variant data model is heavily used in OpenCGA when loading VCF files or when exporting query results. Variant data model is implemented in OpenCB Biodata project, this allows the rest of OpenCB projects such as CellBase to use it.
Goals
Main goals of variant data model include:
- To be able represent any type of variant (SNV, INDEL) or structural variant (INSERTION, DELETION, CNV, TRANSLOCATION, ...), this includes phased variants and non-diploid organisms.
- To provide a file-format agnostic solution of storing genomic variant data from VCF, gVCF, microarrays, ...
- To allow bioinformaticians to add valuable and rich annotations for researchers and clinicians
Main Features
Some of the main features of the variant data model include:
Design
A high level representation of the variant looks like this:
id String | Unique variant ID, this consists of chromosome, position, reference and alternate alleles in this format: chrom:pos:ref:alt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
names List<String> | Other IDs found for this genomic variant across all VCF files indexed | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
chromosome String | The chromosome where the genomic variant is located | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
start int | The 1-based position where the genomic variant starts. For variants coming from VCF files, this position is likely to be normalised, in this case, the original call in the file is stored in studies.files.call (see below) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
end int | The 1-based position where the genomic variant ends. For variants coming from VCF files, this position is likely to be normalised, in this case, the original call in the file is stored in studies.files.call (see below) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reference String | Reference allele. For variants coming from VCF files, this position is likely to be normalised, in this case, the original call in the file is stored in studies.files.call (see below) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
alternate String | Alternate allele. For variants coming from VCF files, this position is likely to be normalised, in this case, the original call in the file is stored in studies.files.call (see below) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
strand String | Reference strand for this variant, by default all variants are represented in the positive strand | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
length int | Length of the genomic variation which depends on the variant type | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
type VariantType | Type of variant, the accepted types and Sequence Ontology (SO) terms are:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sv StructuralVariation | Specific information for Structural Variants
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
studies List<StudyEntry> | Information specific to each study the variant was read from, such as samples or statistics
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
annotation | Variant Annotation object, this is a large data model and is documented independently |
In the next section you can find the variant annotation schema
Variant Annotation
id String | Unique variant ID, this consists of chromosome, position, reference and alternate alleles in this format: chrom:pos:ref:alt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
chromosome String | The chromosome where the genomic variant is located | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
start int | The 1-based position where the genomic variant starts. For variants coming from VCF files, this position is likely to be normalised | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
end int | The 1-based position where the genomic variant ends. For variants coming from VCF files, this position is likely to be normalised | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reference String | Reference allele. For variants coming from VCF files, this position is likely to be normalised | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
alternate String | Alternate allele. For variants coming from VCF files, this position is likely to be normalised | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ancestralAllele String | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
xrefs List<Xref> | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
hgvs List<String> | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
displayConsequenceType String | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
consequenceTypes List<ConsequenceType> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
populationFrequencies List<PopulationFrequency> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
minorAllele String | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
minorAlelleFreq float | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
conservation List<Score> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
geneTraitAssociation List<GeneTraitAssociation> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
geneDrugInteraction List<GeneDrugInteraction> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
traitAssociation List<EvidenceEntry> | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
functionalScore List<Score> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
cytoband List<Cytoband> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
repeat List<Repeat> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
drugs List<Drug> |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
additionalAttributes Map<String, AdditionalAttribute> |
Example
You can see a complete example here:
{ "id": "1:69511:A:G", "names": ["rs75062661"], "chromosome": "1", "start": 69511, "end": 69511, "strand": "+", "length": 1, "type": "SNV", "reference": "A", "alternate": "G", "studies": [ { "studyId": "demo@family:corpasome", "secondaryAlternates": [], "files": [ { "fileId": "quartet.variants.annotated.vcf.gz" "call" : { }, "data": { "ABHom": "0.982", "AC": "8", "AF": "1.00", "AN": "8", "BaseQRankSum": "2.089", "DB": "true", "DP": "331", "Dels": "0.00", "EFF": "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5||CODING|NM_001005484.1|1|1)", "FILTER": "VQSRTrancheSNP99.90to100.00", "FS": "8.817", "HaplotypeScore": "2.4399", "MLEAC": "8", "MLEAF": "1.00", "MQ": "15.47", "MQ0": "145", "MQRankSum": "-0.047", "OND": "0.018", "QD": "12.97", "QUAL": "4293.01", "ReadPosRankSum": "1.662", "SB": "-1.450e+03", "VCF_ID": "rs75062661", "VQSLOD": "-14.4975", "culprit": "MQ", "set": "FilteredInAll" } } ], "sampleDataKeys": ["GT", "AD", "DP", "GQ", "PL"], "samples": [ { "sampleId": "", "fileIndex": 0, "data": ["1/1", "2,171", "173", "99", "2218,228,0"] }, { "sampleId": "", "fileIndex": 0, "data": ["1/1", "0,33", "34", "60", "508,60,0"] }, { "sampleId": "", "fileIndex": 0, "data": ["1/1", "0,61", "63", "93", "777,93,0"] }, { "sampleId": "", "fileIndex": 0, "data": ["1/1", "0,61", "61", "96", "790,96,0"] } ], "stats": [ { "cohortId": "ALL", "alleleCount": 8, "altAlleleCount": 8, "altAlleleFreq": 1.0, "filterCount": { "PASS": 0, "VQSRTrancheSNP99.90to100.00": 1 }, "filterFreq": { "PASS": 0.0, "VQSRTrancheSNP99.90to100.00": 1.0 }, "genotypeCount": { "0/0": 0, "0/1": 0, "1/1": 4 }, "genotypeFreq": { "0/0": 0.0, "0/1": 0.0, "1/1": 1.0 }, "maf": 0.0, "mafAllele": "A", "mgf": 0.0, "mgfGenotype": "0/0", "missingAlleleCount": 0, "missingGenotypeCount": 0, "qualityAvg": 4293.01, "refAlleleCount": 0, "refAlleleFreq": 0.0 } ], "scores": [], "issues": [] } ], "annotation": { "id": "rs2691305", "chromosome": "1", "start": 69511, "reference": "A", "alternate": "G", "hgvs": ["ENST00000335137(ENSG00000186092):c.421A>G"], "displayConsequenceType": "missense_variant", "consequenceTypes": [ { "geneName": "OR4F5", "ensemblGeneId": "ENSG00000186092", "ensemblTranscriptId": "ENST00000335137", "biotype": "protein_coding", "cdnaPosition": 421, "cdsPosition": 421, "codon": "Aca/Gca", "exonOverlap": [ { "number": "1/1", "percentage": 0.108932465 } ], "proteinVariantAnnotation": { "alternate": "ALA", "features": [{"description": "GPCR, " "rhodopsin-like, " "7TM", "end": 280, "id": "IPR017452", "start": 34}, {"end": 182, "start": 90, "type": "disulfide " "bond"}, {"description": "Helical; " "Name=4", "end": 151, "start": 133, "type": "transmembrane " "region"}, {"description": "Olfactory " "receptor " "4F5", "end": 305, "id": "PRO_0000150547", "start": 1, "type": "chain"}], "keywords": ["Cell " "membrane", "Complete " "proteome", "Disulfide " "bond", "G-protein " "coupled " "receptor", "Membrane", "Olfaction", "Receptor", "Reference " "proteome", "Sensory " "transduction", "Transducer", "Transmembrane", "Transmembrane " "helix"], "position": 141, "reference": "THR", "substitutionScores": [{"description": "tolerated", "score": 0.63, "source": "sift"}, {"description": "benign", "score": 0.003, "source": "polyphen"}], "uniprotAccession": "Q8NH21"}, "sequenceOntologyTerms": [{"accession": "SO:0001583", "name": "missense_variant"}], "strand": "+", "transcriptAnnotationFlags": ["CCDS", "basic"]}, {"sequenceOntologyTerms": [{"accession": "SO:0001566", "name": "regulatory_region_variant"}]}], "conservation": [{"score": 1.149999976158142, "source": "gerp"}, {"score": 0.1289999932050705, "source": "phastCons"}, {"score": -0.527999997138977, "source": "phylop"}], "cytoband": [{"chromosome": "1", "end": 2300000, "name": "p36.33", "stain": "gneg", "start": 1}], "functionalScore": [{"score": -0.7899999618530273, "source": "cadd_raw"}, {"score": 0.03999999910593033, "source": "cadd_scaled"}], "geneDrugInteraction": [], "geneTraitAssociation": [], "populationFrequencies": [{"altAllele": "G", "altAlleleFreq": 0.95061594, "altHomGenotypeFreq": 0.93263996, "hetGenotypeFreq": 0.03595196, "population": "ALL", "refAllele": "A", "refAlleleFreq": 0.049384065, "refHomGenotypeFreq": 0.031408086, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9499386, "altHomGenotypeFreq": 0.92997545, "hetGenotypeFreq": 0.03992629, "population": "OTH", "refAllele": "A", "refAlleleFreq": 0.050061423, "refHomGenotypeFreq": 0.03009828, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.999461, "altHomGenotypeFreq": 0.99892205, "hetGenotypeFreq": 0.0010779734, "population": "EAS", "refAllele": "A", "refAlleleFreq": 0.0005389867, "refHomGenotypeFreq": 0.0, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.95083994, "altHomGenotypeFreq": 0.9305369, "hetGenotypeFreq": 0.040606, "population": "AMR", "refAllele": "A", "refAlleleFreq": 0.049160052, "refHomGenotypeFreq": 0.028857054, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.97795016, "altHomGenotypeFreq": 0.9710086, "hetGenotypeFreq": 0.013883217, "population": "ASJ", "refAllele": "A", "refAlleleFreq": 0.022049816, "refHomGenotypeFreq": 0.015108207, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.99145377, "altHomGenotypeFreq": 0.98848504, "hetGenotypeFreq": 0.0059373877, "population": "FIN", "refAllele": "A", "refAlleleFreq": 0.00854624, "refHomGenotypeFreq": 0.005577546, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9727796, "altHomGenotypeFreq": 0.96255124, "hetGenotypeFreq": 0.020456737, "population": "NFE", "refAllele": "A", "refAlleleFreq": 0.027220415, "refHomGenotypeFreq": 0.016992046, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.6074365, "altHomGenotypeFreq": 0.47664425, "hetGenotypeFreq": 0.26158446, "population": "AFR", "refAllele": "A", "refAlleleFreq": 0.39256352, "refHomGenotypeFreq": 0.2617713, "study": "GNOMAD_EXOMES"}, {"altAllele": "G", "altAlleleFreq": 0.84222084, "altHomGenotypeFreq": 0.77478045, "hetGenotypeFreq": 0.1348808, "population": "ALL", "refAllele": "A", "refAlleleFreq": 0.15777917, "refHomGenotypeFreq": 0.090338774, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9404255, "altHomGenotypeFreq": 0.9191489, "hetGenotypeFreq": 0.04255319, "population": "OTH", "refAllele": "A", "refAlleleFreq": 0.05957447, "refHomGenotypeFreq": 0.038297873, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 1.0, "altHomGenotypeFreq": 1.0, "hetGenotypeFreq": 0.0, "population": "EAS", "refAllele": "A", "refAlleleFreq": 0.0, "refHomGenotypeFreq": 0.0, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9410377, "altHomGenotypeFreq": 0.9103774, "hetGenotypeFreq": 0.061320756, "population": "AMR", "refAllele": "A", "refAlleleFreq": 0.058962263, "refHomGenotypeFreq": 0.028301887, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9672131, "altHomGenotypeFreq": 0.9508197, "hetGenotypeFreq": 0.032786883, "population": "ASJ", "refAllele": "A", "refAlleleFreq": 0.032786883, "refHomGenotypeFreq": 0.016393442, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9918478, "altHomGenotypeFreq": 0.98913044, "hetGenotypeFreq": 0.0054347827, "population": "FIN", "refAllele": "A", "refAlleleFreq": 0.008152174, "refHomGenotypeFreq": 0.0054347827, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 0.9637507, "altHomGenotypeFreq": 0.94847214, "hetGenotypeFreq": 0.03055722, "population": "NFE", "refAllele": "A", "refAlleleFreq": 0.03624925, "refHomGenotypeFreq": 0.02097064, "study": "GNOMAD_GENOMES"}, {"altAllele": "G", "altAlleleFreq": 0.5886525, "altHomGenotypeFreq": 0.41246733, "hetGenotypeFreq": 0.3523703, "population": "AFR", "refAllele": "A", "refAlleleFreq": 0.4113475, "refHomGenotypeFreq": 0.23516238, "study": "GNOMAD_GENOMES"}], "repeat": [{"chromosome": "1", "copyNumber": 2.0, "end": 87112, "id": "9119", "percentageMatch": 0.992904, "source": "genomicSuperDup", "start": 10001}, {"chromosome": "1", "copyNumber": 2.0, "end": 87112, "id": "14903", "percentageMatch": 0.995437, "source": "genomicSuperDup", "start": 18393}], "traitAssociation": [{"additionalProperties": [{"name": "mutationSomaticStatus_in_source_file", "value": "Confirmed " "somatic " "variant"}], "alleleOrigin": [], "bibliography": [], "ethnicity": "Z", "genomicFeatures": [{"featureType": "gene", "xrefs": {"symbol": "OR4F5"}}, {"featureType": "gene", "xrefs": {"symbol": "8301"}}], "heritableTraits": [], "id": "COSM4144171", "somaticInformation": {"histologySubtype": "neoplasm", "primaryHistology": "other", "primarySite": "thyroid", "sampleSource": "", "tumourOrigin": ""}, "source": {"name": "cosmic"}, "submissions": []}]} "additionalAttributes": { "opencga": { "attribute": { "annotationId": "CURRENT", "release": "1" } } }, }
Implementation
Variant data model is implemented in OpenCB Biodata project, this allows the rest of OpenCB projects such as CellBase, Oskar to
Table of Contents:
- No labels