You are viewing an old version of this page. View the current version.
OpenCGA provides a set of analysis o compute basic statistics given a variant dataset. In order to get richer statistics, the variant data should comprise annotation and pedigree (samples, phenotypes,...).
OpenCGA computes three types of statistics:
Next sections describe the three types of statistics.
Summary or global stats provides significant information about the variant dataset. It includes:
- The total number of variants.
- The total number of samples.
- The number of variants per chromosome.
- The number of variants per consequence type.
- The number of variants per biotype.
- The number of variants per type (SNV, INDEL,...)
- The number of variants per genotype.
- The Ts/TV ratio or transition-to-transversion ratio.
- A heterozigosity score.
- A missingness score.
- A list of the most affected genes.
- Indel length
- A list of HPO and genes for loss of function (LoF) variants.
- A list of the most frequenct variant traits.
- The number of mendelian error per type of error.
- Relatedness scores (IBD/IBS scores).
Summary statistics are stored in a JSON format file.
Pre-calculated stats are useful for filtering variants. This stats are intra-study, calculated within a given cohort.
Cohorts are defined as a arbitrary group of samples. Cohorts can be defined in Catalog, either selecting samples one by one or selecting all samples that share some attributes like population or phenotype.
If a cohort is modified after calculating the statistics, the existing statistics became INVALID.
By default, in each study, there is defined the cohort ALL that contains all the samples loaded in the study. Every time that new samples are loaded in the study, this cohort is modified, and the statistics have to be recomputed.
There are two types of statistics, per variant, and global statistics. Variant statistics are stored in the variants database, within the StudyEntry. Global statistics are stored in Catalog.
Variant Stats (intra variant)
These stats are calculated for each variant, and for a set of samples (cohort).
Usually, public studies do not provide samples data. In this situations is not possible to calculate the statistics. Instead, the statistics can be extracted from the INFO column. Unfortunately, there is no standard way for defining multi-cohort statistics in the VCF format. Instead, OpenCGA recognizes three different formats for representing statistics.
- BASIC mode
- EVS mode
- EXAC mode
Table of Contents:
- No labels