Date: Fri, 29 Mar 2024 15:31:29 +0000 (GMT) Message-ID: <1096343270.489.1711726289325@web> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_488_1787411852.1711726289321" ------=_Part_488_1787411852.1711726289321 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
An index pipeline is the process of ingesting data into an OpenCGA-Stora= ge backend. We define a general pipeline that is used and extended for the = supported bioformats like variants and alignments. This pipeline is extende= d by additional steps of enrichment.
This concept is represented in Catalog to help the tracking of this stat= us in different files.
Indexing data pipeline consists in two steps, first transform and valida= te the input raw data into an intermediate format, and second, load it into= the selected database. The input file format is VCF, accepting different variations like gVCF or aggregated VCFs<= /p>
Files are converted Biodata models. The metadata and the data are serial= ized into two separated files. The metadata is stored into a file named&nbs= p;<inputFileName>.file.json.gz serializing in json a si= ngle instance of the biodata model Var= iantSource, which mainly contains the header and some general stats. Al= ong with this file, the real variants data is stored in a file named <= em><inputFileName>.variants.avro.gz with a set of variant r= ecords described as the biodata model Varian= t.
VCF files are read using the library HTSJDK, which= provides a syntactic validation of the data. Further actions on the valida= tion will be taken, like duplicate or overlapping variants detection.
By default, malformed variants will be skipped and written into a third = optional file named <inputFileName>.malformed.txt = . If the transform step generates this file, a curation process should be t= aken to repair the file. Otherwise, the variants would be skipped.
All the variants in the transform step will be normalized as defined her= e: Variant Normaliz= ation. This will help to unify the variants representation, since the V= CF specification allows multiple ways of referring to a variant and some am= biguities.
Loading variants from multiple files into a single database will effecti= vely merge them. In most of the scenarios, with a good normalization, mergi= ng variants is strait forward. But in some other scenarios, with multiple a= lternates or overlapping variants, a more complex merge is needed in order = to create a consistent database. This situations can be solved when loading= the file configuring the merge mode, or= a posteriori in the aggregation operation.
Loading process is dependent on the implementation. Here you can se= e some specific information for the two implemented back-ends.
Reference genome in FASTA format used during the normalization step for = a complete left alignment.
Do not execute the no= rmalization process. WARN: INDELs will be stored with = the context base.
Hint to indicate that the input file is in gVCF format.
Indicate that the files to be loaded are part of a family. This will set= loadHomRef to YES if it was in AUTO and execute 'family-index' afterwards.=
Load HOM_REF genotypes. (yes, no, auto)
Default
auto
Build sample index while loading. (yes, no, auto)
Default
auto
Indicate that the variants from a group of samples is split in multiple = files, either by CHROMOSOME or by REGION.= In either case, variants from different files must not overlap.
Indicate the presence of multiple files for the same sample. Each file c= ould be the result of a different vcf-caller or experiment over the same sa= mple.
Load archive data. (yes, no, auto)
Default
auto
Do not include the genotype information.
Index including other sample data fields (i.e. FORMAT fields). Use "all"= , "none", or CSV with the fields to load.
Default
all
Execute post load checks over the database
Default
auto
OpenCGA offers two different implementations for the StorageEngine that = use two different backend databases, each of one with particular properties= .
The MongoDB implementation stores all the variant information in one cen= tralised collection, with some secondary helper collections. In order to me= rge correctly new variants with the already existing data, the engine uses = a stage collection to keep track of the already loaded data. In case of loa= ding multiple files at the same time, this files will first be written into= this stage collection, and then, moved to the variants collection, all at = the same time.
Using this stage collection, the engine is able to solve the complex mer= ge situations when loading the file, without the need of an extra aggregati= on step. Therefore, this storage engine does not implement the aggregation = operation. Depending on the level of detail required, the merge mode can be= configured when loading the files.
For each variant that we load we have to check if the it already exists = in the database, and, in that case, merge the new data with the existing va= riant. Otherwise, create a new variant.
=
For basic mode, there will be unknown values for certain positions. We c= an not determine if the value was missing ( ./. ), reference ( 0/0 ), or a = different genotype. The output value for unknown genotypes can be modified = by the user when querying. By default, the missing genotype ( ./. ) will be= used.
In the advanced mode, the variants have gained a secondary alternate, an= d the field AD (Allele Depth) has been rearranged in order to match with th= e new allele order.
Loading new files will be much faster with basic merge mode. Is is becau= se we don't need now to check if the variant overlaps with any other alread= y existing variant. We only need to know if the variant exists or not in th= e database, which takes a significant amount of time in advance mode.
The storage engine implementation for Hadoop is based on Apache HBas= e and Apache Phoenix. When loading a file, it will be stored = (by default, entirely) in the archive table, and the varia= nts (everything but the reference blocks) will be stored in the var= iants table, using a basic merge mode. Also, from= each variant (unless otherwise specified) only samples with non homoz= ygous reference (HOM_REF, 0/0) genotype will be loaded.
To obtain an advanced merge, including all the overlapping variants and = the reference blocks, see the aggregation operation.
Most of the common queries will go to the variants table, but in case of= requiring some extra information, the archive table can be also queried. T= here is also a third table that contains a secondary index for samples, to = allow instant queries by genotype.
<namespace>:<db-name&=
gt;_variants
<namespace>:<db-name&g=
t;_archive_<study-id>
<namespace>:<db-n=
ame>_sample_index_<study-id>
<namespace>:<db-name&=
gt;_meta
HBase supports multiple table compression a= lgorithms natively. Compression algorithms can be configured for each of th= e tables. By default, SNAPPY compression is used.
Pre-splitting HBase tables is a common technique that reduces the number= of splits and provides a better balance of the regions across the Hadoop c= luster. We can configure the number of pre-splits for each of the tables.= p>
opencga.variant.table.pre=
split.size
Pre-split =
size for the variant table.
opencga.archive.table.pre=
split.size
Pre-split size for the archive table.
=
span>
In order to do an optimal pre-splitting, the storage engine needs to kno= w an approximation of the number of files to be loaded. This number can be = configured with:
With the Apache Solr second= ary indexes we can query by any annotation field in HBase in subsecond = time. But this can not help when querying by sample (or genotype).
Detailed information available here:
By default, the engine writes in the archive table all the information f= rom the variants that are reference blocks with HOM_REF genotype. This info= rmation represents, approximately between a 66% and a 90% of the original g= VCF. So, reducing this part can have a big impact on the final size of the = archive table. This feature can help to some installations with tied disk r= esources, or just because some information is not required at all for the a= nalysis.
The fields to include can be configured using the following configuratio= n parameter:
opencga.archive.fields
QUAL,INFO:DP,FORMAT:GT,AD,DP
When loading multi-sample files, from each variant only samples with non= homozygous reference (HOM_REF, 0/0) genotype will be loaded. We can s= pecify to load all the data from the samples using the following parameter:=
opencga.variant.table.load.reference
The default=
falue is false. (#915)
Current implementation does not store reference calls in the second tabl= e variants. This allows to optimize disk space and improve perform= ance. The assumption is that when a sample genotype is not present then it = was a reference call since all the other genotypes including missing are st= ored.
The problem is that current variant callers are still far from being per= fect and some variants having a reference call show a very low coverage or = quality scores. So, in some use cases, users might need to confirm that reference call was good enough.
A simple solution for this would be treat low quality reference<= /strong> calls as missing calls, so they would be stored i= n the variant table in the same way than missing. By doing this us= ers will know that not present reference calls have a good quality and ther= e is no need to get them.
Users can configure low quality reference block in the configuration fil= e, for instance DP<5 AND GQ<20.
opencga.archive.n=
on-ref.filter
The filter is a list of key, operator, va= lue, separated by ";".
QUAL
, FILTER
, FORMAT:format-key=
code>, INFO:info-key
>
, <
, >=3D
, <=
=3D
: for numerical values=3D
for comma separated values!=3D
Only for FITLER
keysIf a reference block does not have any of= the required fields, will pass the filter, and will be treat as a missing.=
Examples:
QUAL<5;FILTER=3DLowQual,LowGQ;FORMAT:DP<10
FILTER!=3DPASS
--load-split-data
. See
OpenCGA#696Table of Contents: