Info

OpenCGA v1.3.0 will allow to define internal releases and versions of Catalog data

Overview

OpenCGA v1.2.0 added support for creating internal data releases. The new release field is now included in most of the data models (#616) to indicate the release when the data was first created and registered in the database. Version 1.3.0 adds a new feature which allows creating different versions of the data (#684). This new feature was added to Sample, Individual and Family data models.

The new release in conjunction with the versioning of data allows users to do really powerful queries such as:

Fetch data inserted in a concrete release or between a range of releases.
Fetch all the versions (whole history) of an entry*.
Fetch a concrete version of an entry*.
Look for historic data from older releases*.

* The only supported entries at the moment are Sample, Individual and Family.

Releases

The release is managed by the Project entry and all the data within a Project will be associated the same release number.

Many research institutions need to create deliverable from time to time that will contain everything that has been done so far up to a point. Since version 1.2.0, a new field release is present in most of the data models (#616). All the data (samples, files, individuals, studies...) make sense within a Project context that will manage the current release of the data. Therefore, Projects will not have a release field, but rather a new release counter field showing the current release of the data being ingested at the moment.

Project

Every time a user creates a new project, the project will be created with the release counter field set to 1. Only the owner of the project will be able to increase the release counter by using one new RESTful webservice (/{version}/projects/{project}/increlease) added for this purpose.

Other entries

Entries contained by Project such as Study,Sample,File... will be assigned a release number matching the current value of the release counter of the project where they are contained. This release number cannot be ever modified as it just reflects the moment (release) in which these new data was added.

Querying data by release

Now that a new release tag is present in every entry, it becomes really easy to query data coming from different or concrete releases. All the /xxx/search RESTful webservices were updated to include a new release query parameter. Some example queries can be found below:

Query for data created in release 2: release=2
Query for data created before release 4: release<4

Versioning

Supporting different versions of the metadata is quite useful, specially when doing clinical analysis reports that are based on some concrete data. Because data may change overtime, being able to easily fetch old data supporting a old report could have a lot of sense. OpenCGA v1.3.0 (#684) added version support for Sample, Individual and Family data models. At the moment, we are only supporting versioning in these entries because they will most probably contain all the clinical information. However, version support might be extended to other entries in the future.

Usage

Sample, Individual and Family now have a new version attribute in their data models. This new field is auto-numeric and will be increased with every update only if the user decides to create a new version of the data.

Update web services

Sample, Individual and Family update web services now have a new query parameter called incVersion. This new parameter is a Boolean indicating whether the version of the entry being updated should be increased (create a new version of the entry containing the changes) or not. If set to false, the data from the entry will be overridden by the new values sent during the update process (default behaviour). However, if this is set to true, a new entry will be stored in the database containing the latest data available plus the changes the user is aiming to do.

On the other hand, Individual and Family update web services have another query parameter called updateSampleVersion and updateIndividualVersion respectively. Bearing in mind the data models hierarchy, these new Boolean parameters are used to update the version of the references. For example, we can think of an Individual in the database containing two samples. However, the individual is pointing to old versions of those two samples. If we want the Individual to be updated and point to the latest Sample versions, we would set the updateSampleVersion Boolean field to true.

These parameters could be used on using any possible combination or all at the same time. For example, it would be allowed to call to the individual/update web service passing some fields to be updated and setting updateSampleVersion and incVersion to true. In that case, a new version of the individual will be stored in the database containing the changes the user demanded and with the sample references updated to point to their latest versions.

Info web services

Sample, Individual and Family info web services now contain two new query parameters, a numeric called version and a Boolean called allVersions. If the user does not pass any of these parameters, the info web service will work as expected, returning the latest version of the data being fetched only. If allVersions is set to true, it will return the whole history of the entry, that is, a list containing all the different versions of the entry being requested. Furthermore, the user is also allowed to request a concrete version of the entry using the version parameter.

Search web services

Sample, Individual and Family search web services now contain a new query parameter called snapshot apart from the release parameter added in OpenCGA v1.2.0. Though they might seem pretty much the same, the connotation is quite different and the results obtained may be really different.

If the release parameter is used, the query will return the latest version of the entries that were created in that release number (or range).

If the snapshot parameter is used, the query will return the latest version of the entries in that release. This can be easily understood with an example. Let's imagine we are currently in the release 3 and our sample have 3 versions (1 different version per release), and we specify snapshot=2, the sample information we would be fetching would correspond to the version 2 of the data and not the very latest one.

If we don't specify the snapshot parameter we will always get the latest version available in the database that matches the criteria specified. Obviously, release and snapshot parameters can be used together. If we do this we could do queries such as Give me the latest snapshot available in release 2 of the samples that were created in release 1 for instance.

Export and import

Export data to a different database

Not yet implemented

Available for next 1.3.0 release

The release concept can be normally associated to the concept of deliverables. OpenCGA can be used by a bunch of users that will ingest new data and will be updating it over time to satisfy some kind of deadline. When the time is due, the owner of the project will need to increase the release counter so new ingested data is associated to a new release number to satisfy the requirements for the next deadline.

Once a release is finished, that data could be made available for other kind of users that will only needs access as is (read-only). One of the things OpenCGA will offer in next 1.3.0 release is the option to export old releases of data to a read-only database so other researchers can access that data without interfering the work that might still be in progress in the source database with the next release.

Export

The export option will export complete projects up to a specified release number. This means, that if the release counter is 4 in the project and the user wants to export up to release 3, all the studies, samples, files... created during releases 1, 2 and 3 and the project itself, will be exported.

When exporting a project, it will never export permissions or groups associated to the studies. This information will be lost in the exported file(s). It will only export the data itself and the cross-references.

Import

Importing data from other OpenCGA installation is much more trickier than just exporting the data. For this reason, some restrictions need to be satisfied to guarantee that everything will work properly.

The very first time something is going to be imported to other database, the database should NOT contain any project, study... However, users are allowed.
Imports can be incremental

To finish

Allowed operations over source database: any

Allowed operations over target database: Login, create groups, create users, assign permissions!

Table of Contents:

Page tree

Releases and Versioning