Data organisation

OpenCGA uses a hierarchical two-level structure to organize organise datasets. Briefly, Projects, Studies and Cohorts are these are Projects and Studies and are used to organize organise HGVA data and metadata:

Projects can is the top-level and can contain one or more Studies. A project determine a specific studies. Projects are specific for one species and assembly, all studies from the same in a project are stored and indexed together in the same database and, therefore, they share the variant annotation.
Study, in turn, represents a particular data set with/without dataset which can contain samples metadata, and cohorts, and obviously all the genomic variation datavariants. For example, The the 1000 Genomes Project is is defined as a study in OpenCGA . Likewise, The Genome of the Netherlads or the Exome Aggregation Consortium are also two different studies, and so on.Finally, a cohort is simply and belongs to Reference GRCh37 project. You can also define cohorts in the studies, they are just a set of samples defined within a study. For example, populations and super-populations within The 1000 Genomes Project are defined as cohorts. Thus, so EUR, AMR or GBR are examples of cohorts.

Here you You can get more info information about data organisation at OpenCGA Catalog data modelsData Management. Projects and Studies have a unique alias to ease their usage from the command-line and REST API, you can find more information about how to query data programmatically at RESTful Web Services and Clients. Please, see below next section the full list and organisation of the currently available datasets (loaded as studies) in Projects and Studies (datasets) in HVGVA.

Datasets

In this sections you can find all datasets loaded in HGVA and how they are organised in different projects.

Studies

Projects and Studies (see previous section).

HGVA

Project name (alias)

Studies

Version/Date

HGVA Version (date)
Name	Alias	v1 (Dec. 2016)

HGVA

v2 (

Jul

Jan.

2017

2018)
Reference GRCh37 (reference_grch37)	1000

genomes project GRCh37

Genomes Project GRCh37

1kG_phase3

Phase 3 2016-05

To be decided

Phase 3 2016-05
Exome Sequencing Project (ESP6500)	ESP6500	2016-05

To be decided

2016-05
Exome Aggregation Consortium (ExAC)	EXAC	0.3.1 2016-05

To be decided

0.3.1 2016-05
Genome of the Netherlands (GoNL)	GONL	Release 5 2016-05

To be decidedUK10K project

Release 5 2016-05
UK10K Project	UK10k	2016-05	2016-05

To be decided


DiscovEHR	DISCOVEHR	-
Genome Aggregation Database (gnomAD Exomes)	GNOMAD_EXOMES	-
Genome Aggregation Database (gnomAD Genomes)	GNOMAD_GENOMES	-
Spanish Medical Genome Project (MGP)	MGP	2016-12

To be decided

2016-12

Reference GRCh38

(reference_grch38)

1000

genomes project

Genomes Project GRCh38

1kG_phase3

Phase 3 2016-10

To be decided

Phase 3 2016-10
ESP6500	ESP6500	-
UK10K Project (*)	UK10K	-
DiscovEHR (*)	DISCOVEHR	-
Genome Aggregation Database (gnomAD Exomes) (*)	GNOMAD_EXOMES	-
Genome Aggregation Database (gnomAD Genomes) (*)	GNOMAD_GENOMES	-
Cancer GRCh37 (cancer_grch37)	QIMR Berghofer Melanoma	QIMR_Berghofer_Melanoma	2016-12

To be decided

2016-12
Chronic Myeloid Leukemia - Russian Academy of Medical Sciences	RAMS_CML	2016-12

To be decided

2016-12

Platinum

(platinum)

Illumina Platinum

illumina_platinum

2015-08

To be decided

(*) Liftover carried out by Genomics England (GEL)

Variant Anotation

Variant annotation was carried out by the CellBase project. Please, check CellBase documentation for details on additional data sources: Data sources and species

Table of Contents:

Table of Contents

Page tree

Versions Compared

Old Version 16

New Version Current

Key

Data organisation

Datasets

Variant Anotation

Page tree

Page History

Versions Compared

Old Version 16

New Version Current

Key

Data organisation

Datasets

Variant Anotation