This operation provides an advanced genome VCF merging to provide a mult-isample dataset suitable for cohort analysis. Along with Annotation and Statistics, this is an optional enrichment operation over the database. [#713] [#757] [#877]
This operation is designed to find a proper value for the unknown genotype values, reading the reference blocks from the gVCF files.
Given a set of samples, the process iterate over all variants where some, but not all samples have missing values (where the value is not present, not the same as the genotype ./.
). A sample can have missing value in three situations:
?/?
and the rest of values with missing .
Executing this operation against all the samples in the database can be really expensive in terms of time and disk usage, because it will fill all the gaps in the spare matrix that the variants table is. To avoid this situation, this operation skips the samples where the genotype is homozygous reference (HOM_REF, 0/0), and the files where all the belonging samples are HOM_REF.
This Operation is only available in HBase Storage Engine in Hadoop |
This operation is slightly different from the general aggregation. It is designed to work only with a family, and will write all the genotypes, even the HOM_REF, and the related sample data that validates the genotype.
In the next figure we can see an example of aggregating multiple variants, from different single-sample files.
The variants from samples 1 and 2 have two overlapping variants. Variants from samples 3 and 4 are reference blocks from a gVCF.
There are some scenarios where the result of the aggregation operation is not obvious, and should be defined and handled carefully:
This scenario consists of having multiple overlapping positions in one variant. This may happen because of many reasons:
<*>
from the VCF spec v4.3 (known as <NON_REF>
at GATK). The sample B will have the genotype <*>/<*>
(i.e. 2/2
where <*>
is the second alternate) for the deletion.Overlap with a split multi-allelic variant
In this scenario, a variant from file A may overlap with many variants produced from the split of a multi-allelic variant from file B. The information in these split variants from B is the same (just rearranged), so we know exactly what is in this position. All the overlapping variants share the same FileEntry.call
, which contains the original call of this variant. We should just take any of them.
Table of Contents: