Variant Stats

Pre-calculated stats are useful for filtering variants. This stats are intra-study, calculated within a given cohort.

Cohorts

Cohorts are defined as a arbitrary group of samples. Cohorts can be defined in Catalog, either selecting samples one by one or selecting all samples that share some attributes like population or phenotype.

If a cohort is modified after calculating the statistics, the existing statistics became INVALID.

By default, in each study, there is defined the cohort ALL that contains all the samples loaded in the study. Every time that new samples are loaded in the study, this cohort is modified, and the statistics have to be recomputed.

Stats models

There are two types of statistics, per variant, and global statistics. Variant statistics are stored in the variants database, within the StudyEntry. Global statistics are stored in Catalog.

Variant Stats (intra variant)
These

OpenCGA provides a set of analysis o compute basic statistics given a variant dataset. In order to get richer statistics, the variant data should comprise annotation and pedigree (samples, phenotypes,...).

OpenCGA computes three types of statistics:

Variant stats
Sample stats
Cohort stats
Family stats

Next sections describe these statistics.

Anchor
variant
variant
Variant stats

Variant stats are calculated for each variant,

and for

in addition, you may specify a set of samples (aka, cohort) in order to take into account only those samples.

Code Block

title	Result

VariantStats
	// Total number of alleles in called genotypes. Does not include missing alleles
	int alleleCount
	// Number of reference alleles found in this variant
	int refAlleleCount
	// Number of main alternate alleles found in this variant. Does not include secondary alternates
	int altAlleleCount
	// Reference allele frequency calculated from refAlleleCount and alleleCount, in the range (0,1)
	float refAlleleFreq
	// Alternate allele frequency calculated from altAlleleCount and alleleCount, in the range (0,1)
	float altAlleleFreq
	// Count for each genotype found
	map<int> genotypeCount
	// Genotype frequency for each genotype found
	map<float> genotypeFreq
	// Number of missing alleles
	int missingAlleleCount
	// Number of missing genotypes
	int missingGenotypeCount
	// Minor allele frequency
	float maf
	// Minor genotype frequency
	float mgf
	// Allele with minor frequency
	string mafAllele
	// Genotype with minor frequency
	string mgfGenotype

Variant Global Stats (inter variant)

Status

colour	Yellow
title	PENDING

Aggregated statistics

Usually, public studies do not provide samples data. In this situations is not possible to calculate the statistics. Instead, the statistics can be extracted from the INFO column. Unfortunately, there is no standard way for defining multi-cohort statistics in the VCF format. Instead, OpenCGA recognizes three different formats for representing statistics.

BASIC mode

EVS mode

EXAC mode

Variant stats include the following values:

The total number of alleles (it does not include missing alleles)
The number of reference alleles found in this variant
The number of main alternate alleles found in this variant (it does not include secondary alternates)
The reference allele frequency, i.e., the quotient of the number of reference alleles divided by the total number of alleles.
The alternate allele frequency, i.e., the quotient of the number of alternate alleles divided by the total number of alleles.
The number of occurrences for each genotype
The frequency for each genotype
The number of missing alleles
The number of missing genotypes
The minor allele frequency (maf)
The minor genotype frequency (mgf)
The allele with the minor frequency
The genotype with the minor frequency

Pre-calculated stats are useful for filtering variants. This stats are intra-study, calculated within a given cohort.

Anchor
summary
summary
Sample stats

Sample stats are calculated for each sample. It includes the following information:

The total number of variants.
The number of variants per chromosome.
The number of variants per consequence type.
The number of variants per biotype.
The number of variants per type (SNV, INDEL,...)
The number of variants per genotype.
The transition-to-transversion ratio (ti/tv ratio).
A heterozigosity score.
A missingness score.
A list of the most affected genes.
The number of variants per indel length
A list of HPO and genes for loss of function (LoF) variants.
A list of the most frequent variant traits.

Summary statistics are stored in a JSON format file.

Table of Contents:

Table of Contents

indent	20px

Page tree

Versions Compared

Old Version 3

New Version Current

Key

Variant Stats

Cohorts

Stats models

Anchor
variant
variant
Variant stats

Aggregated statistics

Anchor
summary
summary
Sample stats

Page tree

Page History

Versions Compared

Old Version 3

New Version Current

Key

Variant Stats

Cohorts

Stats models

AnchorvariantvariantVariant stats

Aggregated statistics

AnchorsummarysummarySample stats

Anchor
variant
variant
Variant stats

Anchor
summary
summary
Sample stats