- Created by Nacho Medina, last modified by Joaquín Tárraga Giménez on Jan 07, 2020
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 13 Next »
OpenCGA Alignment Engine provides a solution to storage and process sequence alignment data from Next-Generation Sequencing (NGS) projects. The Alignment Engine supports the most common alignment file formats, i.e.: SAM, BAM and CRAM, and takes the alignment data model specification from GA4GH and the implementation from OpenCB GA4GH. See a full description at Alignment Data Model.
Main features
We do not define or endorse any dedicated unaligned sequence data format. Instead we recommend storing such data in one of the alignment formats (SAM, BAM, or CRAM) with the unmapped flag set. However for completeness, we list the commonest formats below with external links. Genomic Variants. It provides a source of data for analysis and visualization in compatible viewers like GenomeMaps. Allowing a fast reading and filtering for variants will speed up analysis, with fastest and more accurate results.
There are an increasing number of biological formats supported by OpenCGA related with a common NGS pipeline. Within this formats, we focus on Genomic Variants due to the complexity and analysis capabilities
Operations
There is an extensive list of operations that can be executed with the Variant Storage Engine. There operations are:
- Variant Index Pipeline
- Sample Aggregation
- Variant Annotation
- Variant Stats Calculation
- Secondary Index
- Export / Import
Study oriented
The OpenCGA Variant Storage will create an independent database for each project. This database, same way as the projects in Catalog, is divided by studies. This allows to distribute the data into independent studies. Allows queries across multiple studies. Reduces the disk space consumption and the required time to generate the variant annotation by using the same variant annotation across the same database.
We believe that it is important to keep the databases mostly unaware in which format the data was originally stored. A reference to this format will only be stored for specific purposes involving file transfers.
Data model for variants and alignments have been designed and implemented in Java. They explicitly specify the most commonly used fields, and at the same time provide mechanisms for preserving all the information of a certain format. For instance, thefields specified for a variant would be (among others) chromosome, position, reference and alternatives; if a VCF file is being stored, then columns such as INFO are also saved in a key-value data structure.
OpenCGA imports different data models from OpenCB Biodata and GA4GH such as Variant and Alignment data models; while others such as Catalog Data Models have been developed in OpenCGA itself. In next sections you will find
Catalog
Catalog models all the information about users, projects, studies, files, jobs, samples and clinical data among others. This has been developed internally in OpenCGA Catalog component, you can find a more detailed information at Catalog > Catalog Data Models.
Storage Alignment
OpenCGA takes Alignment data model specification from GA4GH and the implementation from OpenCB GA4GH. See a full description at Alignment Data Model.
[develop]$ ./build/bin/opencga.sh alignments Usage: opencga.sh alignments <subcommand> [options] Subcommands: index Index alignment file query Search over indexed alignments stats-run Compute stats for a given alignment file stats-info Compute stats for a given alignment file stats-query Fetch alignment files according to their stats coverage-run Compute coverage for a given alignemnt file coverage-query Query the coverage of an alignment file for regions or genes coverage-ratio Compute coverage ratio from file #1 vs file #2, (e.g. somatic vs germline) bwa BWA is a software package for mapping low-divergent sequences against a large reference genome. samtools Samtools is a program for interacting with high-throughput sequencing data in SAM, BAM and CRAM formats. deeptools Deeptools is a suite of python tools particularly developed for the efficient analysis of high-throughput sequencing data, such as ChIP-seq, RNA-seq or MNase-seq. fastqc A quality control tool for high throughput sequence data.
Table of Contents:
- No labels