Overview

A genomic variant is represented by a locus (chromosome + position), reference sequence and list of alternates.

Is common, because of the VCF specification, that the reference and alternate fields contain extra bases not needed for the Variant representation. It is completely valid to specify a variation like chr1:100:AC:AT, which is absolutely the same variant that chr1:101:C:T.

The number of possible combinations to represent the same genomic variant is non-unique, so it is mandatory to normalize the representation of the variant in order to determine when two representations are the same or different variants. A failure to recognize this will frequently result in inaccurate analyses.

Steps

The variant normalization focuses on different aspects of the variant representation to make a full normalization.

Chromosome naming

Due to there is not any standard for the chromosome naming, is common to see different names for the same chromosome, depending on the used tools, by adding a prefix to the name. For example, we can see chr[1-22,X,Y] for the One Thousand Genomes Project. It is known that this is a chromosome, it is no needed to add any prefix for each variant. The list of known chromosome prefixes are: chrom, chrm, chr and ch.

Multi-allelic split

Split multiallelic variants
Reorder genotypes and allele based fields (e.g. AD)

#CHROM POS     REF   ALT    FORMAT SAMPLE1       SAMPLE2        SAMPLE3
chr1   100     A     T,C    GT:AD  0/0:40,1,0    0/1:19,20,1    2/1:0:20,22

#CHROM POS REF ALT FORMAT SAMPLE1 SAMPLE2 SAMPLE3 chr1 100 A T,C GT:AD 0/0:40,1,0 0/1:19,20,1 2/1:0,20,22

chr1 100 A C,T GT:AD 0/0:40,0,1 0/2:19,1,20 1/2:0,22,20

Reference/Alternate Trimming and left alignment

Reference and alternate trimming consists on removing the trailing (right trimming) and leading (left trimming) bases that are identical in both alleles.

Left aligning a variant means shifting the start position of that variant to the left while keeping the same alleles length till it is no longer possible to do so.

Right and Left trimming

Input:

#CHROM POS   REF  ALT
chr1   100   CTC  CCC

Result:

#CHROM POS   REF  ALT
chr1   101   T    C

Indels and empty alleles

Variants in OpenCB does not require any "context base", i.e. allows empty alleles for reference or alternate. Insertions and deletions are represented with an empty alleles for the reference or alternate.

Deletion

Deletion of one base T at position 101

#CHROM POS   REF  ALT
chr1   100   AT   A
chr1   101   TC   C

Both variants will result into the same variant:

#CHROM POS   REF  ALT
chr1   101   T    -

Insertion

Insertion of one C at position 201 (between 200 and 201)

#CHROM POS   REF  ALT
chr1   200   G    GC
chr1   201   A    CA

Both variants will result into the same variant:

#CHROM POS   REF  ALT
chr1   201   -    C

Ambiguous trimming and left alignment

It may happen that, in case of deletion or insertion in a region of repeated nucleotides, the trimming operation can be done in multiple ways, and determining the position of the INDEL is ambiguous. In this example we can find that there are four possible ways for normalize the variant:

#CHR   POS     REF     ALT
chr1   100     CTCTCA  CTCA

chr1   100     CT      -
chr1   101     TC      -
chr1   102     CT      -
chr1   103     TC      -

We guarantee the left alignment by performing first the right trimming. This variant will be normalized as:

#CHR   POS     REF     ALT
chr1   100     CT      -

Left alignment

In order to make a correct left alignment, we need the whole reference genome.

The reference genome can be specified with the parameter referenceGenome. If not provided, the left-alignment may be incomplete.

Genotype encoding

There are, basically, two different ways of representing the genotype alleles, with or without the allele sequence. In the second way, instead of using the allele itself, is used the allele code. A 0 value represents the reference allele of the Variant, and any other value is a 1-based index into the alternate alleles. A dot value will represent a missing value.

Using an encoded version will allow to determine easily when a genotype is reference, homozygous or heterozygous.

Sort unphased genotype alleles
Codify alleles

Input:

chr1   100     A       T      A/A    T/A    A/T    T|A    T/.
chr1   200     A       G,C    1/0    1|0    2/1    1/1    2|0

Result:

chr1   100     A       T      0/0    0/1    0/1    1|0    ./1
chr1   200     A       G,C    0/1    1|0    1/2    1/1    2|0

Skip normalization

In certain scenarios, the normalization process could be undesired. This process can be skipped in OpenCGA with the option normalizationSkip.

Having non-normalized variants is highly discouraged.

Table of Contents:

Table of Contents

indent	20px

Page tree

Versions Compared

Old Version 4

New Version 5

Key

Overview

Steps

Chromosome naming

Multi-allelic split

Reference/Alternate Trimming and left alignment

Right and Left trimming

Indels and empty alleles

Ambiguous trimming and left alignment

Left alignment

Genotype encoding

Skip normalization

Page tree

Page History

Versions Compared

Old Version 4

New Version 5

Key

Overview

Steps

Chromosome naming

Multi-allelic split

Reference/Alternate Trimming and left alignment

Right and Left trimming

Indels and empty alleles

Ambiguous trimming and left alignment

Left alignment

Genotype encoding

Skip normalization