Structural variation (also genomic structural variation) is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, such as deletions, duplications, copy-number variants, insertions, inversions and translocations. Typically a structure variation affects a sequence length about 1Kb to 3Mb, which is larger than SNPs and smaller than chromosome abnormality (though the definitions have some overlap). The definition of structural variation does not imply anything about frequency or phenotypical effects. Many structural variants are associated with genetic diseases, however many are not.
Breakends are a really complex feature from VCF specification that allows an arbitrary rearrangement event.
An arbitrary rearrangement event can be summarized as a set of novel adjacencies. Each adjacency ties together 2 breakends. The two breakends at either end of a novel adjacency are called mates.
There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency.
Despite most of the information contained in both mates is redundant, we want all the information contained in all mates. Also, the annotation is potentially different for each of the mates.
We want to store all the information, and be able to query for any breakend of the adjacency.
In most of the situations the variant caller will only be able to provide a pair of breakends conforming a novel adjacency, unable to relate this with any other pair of breakends. But in simpler scenarios, the caller hay be able to relate pairs of breakends describing more complex rearrangement events. In this case, a field EVENT will be added to the column INFO.
- Reciprocal Translocations
- Duplications (tandem, CNV,...)
Determining the type of event is out of the scope, and can be a further action. No extra processing will be done with the EVENT value.
There are some information specific to breakend variants. This information is available the alternate of the VCF, but it has to be parsed and placed in the proper fields of the Variant model.
The new information available is:
- breakend orientation
- breakend mate location
- breakend insertion sequence (if any)
There are 4 possible junction orientations, depending on how that breakend is connected with the mate. The VCF specification does not provide any naming policy for this orientations, so we have to define our own names.
An initial idea is specify, for the BND and its mate, if the junction is made at the right or left of the position.
|s||t[p[||piece extending to the right of p is joined after t||RL (Right -> Left)|
|s||t]p]||reverse comp piece extending left of p is joined after t||RR (Right -> Right)|
|s||]p]t||piece extending to the left of p is joined before t||LR (Left -> Right)|
|s||[p[t||reverse comp piece extending right of p is joined before t||LL (Left -> Left)|
This is an example of how a breakend from a vcf would be represented in opencb:
Another point to deal is the normalization. Should we normalize these variants? If so, how?
Normalizing the previous variant is easy, and would end up in: 1:11:-:TT[11:200[, where the reference, in the variant model, is an empty string.
But, in case of not having an insertion sequence, we may remove the context base at the alternate, loosing the information regarding the junction orientation.
Having the variant
chr9:100:A:A[chr10:300[ , it can be normalized in different ways:
Do not leave empty alleles for BND
Leave empty reference allele, increase start, and remove context base from alternate, as the orientation information is available in "sv.breakend.orientation"
Same as before, but leave a "-" to indicate the orientation of the junction
Same as before, but leave a "." to indicate the orientation of the junction. See VCFv4.3 5.4.5 Telomeres
Leave a dot "
." to indicate the orientation of the junction.
increase 1 the start position
do not increase the start position
Table of Contents:
- No labels