Overview

A genomic variant is represented by a locus (chromosome + position), the reference sequence allele and list of alternates.

Is common, because of the VCF specification, that the reference and alternate fields contain extra bases not needed for the Variant representation. It is completely valid to specify a variation like chr1:100:AC:AT, which is absolutely the same variant that chr1:101:C:T.

The number of possible combinations to represent the same genomic variant is non-unique, so it is mandatory to normalize the representation of the variant in order to determine when two representations are the same or different variants. A failure to recognize this will frequently result in inaccurate analyses.

Steps

The variant normalization focuses on different aspects of the variant representation to make a full normalization.

Chromosome naming

Due to there is not any standard for the chromosome naming,

alternate alleles. Genotypes are represented by the two alleles in the sample at the locus.

Different variant calling tools may use subtly different representations for the same biological sequence variant. If variants called from a sample are to be annotated or those from multiple samples are to be merged it is important that variant calls are normalised to ensure consistent representation; see this vt article or GiaB article for info. In some cases normalisation may also be useful to identify and remove spurious duplicates called within a call set from a single sample.

OpenCGA performs variant normalisation by default when genotypes are loaded into the database. The procedures implemented by OpenCGA v2.0 are described in this document. The approach is similar but not identical to other tools that perform variant normalisation such as bcftools, vt, GATK and vcflib. This means that the representation of variants normalised by OpenCGA may differ from those from other tools.

Normalisation Procedure in OpenCGA v2.0

The normalisation procedure implemented by OpenCGA has been designed to resolve ambiguous representations commonly found in VCF data. The OpenCGA variant data model is not constrained by the VCF file format specification. This allows OpenCGA to represent some genotypes that are difficult for VCF to represent.

The primary aim of OpenCGA normalisation is to standardise variant representation for storage and annotation within the OpenCGA database. A side effect of the ability to export VCF from OpenCGA is that the database of can be used as a VCF normalisation and merging tool. If used in this way users must be mindful of limitations of VCF in the correct representation of some variants.

Each step of the OpenCGA normalisation procedure is described below.

1. Rename chromosomes

Due to the lack of standard for the chromosome naming it is common to see different names for the same chromosome , depending on the used tools, by adding a prefix to the namevariant calling workflow. OpenCGA removes chromosome prefixes (chrom, chrm, chr and ch). For example, we can see chr[1-22,X,Y] for the One Thousand Genomes Project. It is known that this is a chromosome, it is no needed to add any prefix for each variant. The list of known chromosome prefixes are: chrom, chrm, chr and ch.

Multi-allelic split

Split multiallelic variants

Reorder genotypes and allele based fields (e.g. AD)

chr1 and chrX are renamed 1 and X respectively.

2. Encode genotypes

VCF allows two different ways of representing the genotype alleles; with or without explicit allele sequence. OpenCGA normalises to the latter, i.e. an allele code is used instead of the allele itself: A 0 value represents the reference allele, and any other value is a 1-based index into the alternate alleles. An example of mapping from explicit to coded genotype alleles is shown in the following table:

	Input	Result
Encoding 1	#CHR POS REF ALT S1 S2 S3 S4 S5 1 100 A T A/A T/A A/T T\|A T/.	#CHR POS REF ALT S1 S2 S3 S4 S5 1 100 A T 0/0 0/1 0/1 1\|0 ./1

3. Split Multi-allelic records

Multi-allelic VCF records are produced in two main scenarios:

Single-sample: one sample (or individual) is multi-allelic for one specific position, ie. both chromosomes are mutated at the same position with a different allele.
Multi-sample: as a consequence of merging VCF from different samples, ie. different samples with different alleles come together in the same VCF record

Consider this multi-sample VCF input record at chromosome 1 position 100. It lists four samples with their genotypes being; homozygous reference [AA/AA], heterozygous SNP [AA/AT], heterozygous insertion [AT/AAC] and heterozygous deletion [AA/A]:

#CHROM POS    REF REF   ALT       FORMAT  SAMPLE1       SAMPLE2        SAMPLE3
chr1        SAMPLE4
1   100   100  A  AA   T,C  AT,AAC,A  GTAT:AD   0/0:40,1,0,0    0/1:19,20,1,0    2/1:0:20,22

#CHROM POS REF ALT FORMAT SAMPLE1 SAMPLE2 SAMPLE3
chr1 100 A T,C GT:AD

,0  0/3:19,0,0,20

OpenCGA splits such multi-allelic record to create one output record for each alternate allele. Note that the multi-allelic nature of each record is maintained and allele-based fields are reordered. This is shown in the pseudo-VCF below;

#CHROM POS    REF    ALT       FORMAT  SAMPLE1       SAMPLE2        SAMPLE3        SAMPLE4
1      100    AA     AT,AAC,A  GT:AD   0/0:40,1,

0

0,0  0/1:19,20,1,

1

0  2/1:0,20,22

chr1 100 A C,T GT:AD

,0  0/3:19,0,0,20
1      100    AA     AAC,AT,A  GT:AD   0/0:40,0,1,

1

0  0/2:19,1,20,

20

0  1/2:0,22,20

Reference/Alternate Trimming and left alignment

Reference and alternate

,0  0/3:19,0,0,20
1      100    AA     A,AT,AAC  GT:AD   0/0:40,0,1,0  0/1:19,0,20,1  2/1:0,0,20,22  0/3:19,20,0,0

3. Allele Trimming

Allele trimming consists on removing the trailing leading (right left trimming) and leading trailing (left right trimming) bases that are identical in both reference and alternate alleles. Left aligning a variant means shifting the start position of that variant to the left while keeping the same alleles length till it is no longer possible to do so.

Right and Left trimming

Input:

#CHROM POS   REF  ALT
chr1   100   CTC  CCC

Result:

#CHROM POS

trimming requires the variant position to be updated, for right trimming the variant position is unchanged. By convention alleles are "left aligned", i.e. the POS value is minimised. For correct left alignment the flanking sequence of the reference genome may be required. The reference genome can be specified with the parameter referenceGenome.

Simple trimming

The following table shows a basic example of left and right trimming in pseudo-VCF notation.

chr1 101T

	Input	Result
Left trim	#CHROM POS REF ALT 1 100 AA AC	#CHROM POS REF ALT 1 101 A C
Right trim	#CHROM POS REF ALT 1 100 AA CA	#CHROM POS REF ALT	1	100	A C

Indels and empty alleles

Variants in OpenCB does

Trimming InDels

Unlike VCF, variants in OpenCB do not require any "context base", i.e. allows empty alleles for . Trimming can therefore result in empty strings for the reference or alternate . Insertions and deletions are represented with an empty alleles for the reference or alternate.

Deletion

Deletion of one base T at position 101

#CHROM POS

alleles. The following table shows two valid representations of a deletion of 'T' at position 101 and the insertion of 'T' between positions 100 and 101. The table also shows how OpenCGA normalisation results in a unique variant for both deletion and insertion.

	Input	Result
Deletion	#CHROM POS REF ALT chr1 1 100 AT A chr1 1 101 TC C

Both variants will result into the same variant:

#CHROM chr1

#CHROM POS	REF ALT	1 101	T -
Insertion

Insertion of one C at position 201 (between 200 and 201)

chr1 200GGCchr1 201ACA

#CHROM  POS

REF  ALT

AT

Both variants will result into the same variant:

chr1201C

TT

#CHROM  POS

REF  ALT

 101  -

Ambiguous trimming and left alignment

It may happen that, in case of

Trim rightmost first

For deletion or insertion in a region of repeated nucleotides , the trimming operation can be done in multiple ways, and determining the position of the INDEL is ambiguous. In this example we can find that there are . For this input there are four possible ways for normalize the variant:

#CHR POS REF

to normalise the variant. OpenCGA ensures leftmost alignment by performing first the right trimming first

100CTchr1101 -chr1 -chr1

Input

Possible normalisations

OpenCGA result

#CHR  POS  REF    ALT
chr11   100   100  CTCTCA  CTCAchr1

#CHR  POS  REF

 ALT
1

 100  CT   -

101  TC   -

  CT   - 
1

TC

We guarantee the left alignment by performing first the right trimming. This variant will be normalized as:

chr1 100

#CHR

POS

REF

ALT

CT

Left alignment

In order to make a correct left alignment, we need the whole reference genome.

The reference genome can be specified with the parameter referenceGenome. If not provided, the left-alignment may be incomplete.

Genotype encoding

There are, basically, two different ways of representing the genotype alleles, with or without the allele sequence. In the second way, instead of using the allele itself, is used the allele code. A 0 value represents the reference allele of the Variant, and any other value is a 1-based index into the alternate alleles. A dot value will represent a missing value.

Using an encoded version will allow to determine easily when a genotype is reference, homozygous or heterozygous.

Sort unphased genotype alleles
Codify alleles

Input:

chr1   100     A       T      A/A    T/A    A/T    T|A    T/.
chr1   200     A       G,C    1/0    1|0    2/1    1/1    2|0

Result:

chr1   100     A       T      0/0    0/1    0/1    1|0    ./1
chr1   200     A       G,C    0/1    1|0    1/2    1/1    2|0

Skip normalization

In certain scenarios, the normalization

Example

Variants are represented in OpenCGA as JSON objects i.e. not pseudo-VCF records! Example JSON representation of the four variants resulting from normalisation of the single VCF record in the second table is shown [here]

Identification of duplicate variants

A result of normalisation can be the identification of duplicated records in a single file/sample. When OpenCGA v2 encounters this condition on file indexing both duplicates are discarded and an error is logged.

Skip normalization

In certain scenarios the normalisation process could be undesired. This process can be skipped in OpenCGA with the option normalizationSkip.

Having non-normalized variants is highly discouraged.

Table

Use of this option is strongly discouraged.

Multi-nucleotide variants

OpenCGA v2 normalisation is limited to SNPs and InDels. No attempt is made to normalise long, complex MNVs. These are loaded into the database unaltered.

zTable of Contents:

Table of Contents

indent	20px

Page tree

Versions Compared

Old Version 5

New Version 6

Key

Overview

Steps

Chromosome naming

Normalisation Procedure in OpenCGA v2.0

1. Rename chromosomes

Multi-allelic split

2. Encode genotypes

3. Split Multi-allelic records

Reference/Alternate Trimming and left alignment

3. Allele Trimming

Right and Left trimming

Simple trimming

Indels and empty alleles

Trimming InDels

Ambiguous trimming and left alignment

Left alignment

Genotype encoding

Skip normalization

Example

Identification of duplicate variants

Skip normalization

Multi-nucleotide variants

Page tree

Page History

Versions Compared

Old Version 5

New Version 6

Key

Overview

Steps

Chromosome naming

Normalisation Procedure in OpenCGA v2.0

1. Rename chromosomes

Multi-allelic split

2. Encode genotypes

3. Split Multi-allelic records

Reference/Alternate Trimming and left alignment

3. Allele Trimming

Right and Left trimming

Simple trimming

Indels and empty alleles

Trimming InDels

Ambiguous trimming and left alignment

Left alignment

Genotype encoding

Skip normalization

Example

Identification of duplicate variants

Skip normalization

Multi-nucleotide variants