An index pipeline is the process of ingesting data into an OpenCGA-Storage backend. We define a general pipeline that is used and extended for the supported bioformats like variants and alignments. This pipeline is extended by additional steps of enrichment.
This concept is represented in Catalog to help the tracking of this status in different files.
Indexing data pipeline consists in two steps, first transform and validate the input raw data into an intermediate format, and second, load it into the selected database. The input file format is VCF, accepting different variations like gVCF or aggregated VCFs
Files are converted Biodata models. The metadata and the data are serialized into two separated files. The metadata is stored into a file named <inputFileName>.file.json.gz serializing in json a single instance of the biodata model VariantSource, which mainly contains the header and some general stats. Along with this file, the real variants data is stored in a file named <inputFileName>.variants.avro.gz with a set of variant records described as the biodata model Variant.
VCF files are read using the library HTSJDK, which provides a syntactic validation of the data. Further actions on the validation will be taken, like duplicate or overlapping variants detection.
By default, malformed variants will be skipped and written into a third optional file named <inputFileName>.malformed.txt . If the transform step generates this file, a curation process should be taken to repair the file. Otherwise, the variants would be skipped.
All the variants in the transform step will be normalized as defined here: Variant Normalization. This will help to unify the variants representation, since the VCF specification allows multiple ways of referring to a variant and some ambiguities.
Loading variants from multiple files into a single database will effectively merge them. In most of the scenarios, with a good normalization, merging variants is strait forward. But in some other scenarios, with multiple alternates or overlapping variants, a more complex merge is needed in order to create a consistent database. This situations can be solved when loading the file configuring the merge mode, or a posteriori in the aggregation operation.
Loading process is dependent on the implementation. Here you can see some specific information for the two implemented back-ends.
The MongoDB implementation stores all the variant information in one centralised collection, with some secondary helper collections. In order to merge correctly new variants with the already existing data, the engine uses a stage collection to keep track of the already loaded data. In case of loading multiple files at the same time, this files will first be written into this stage collection, and then, moved to the variants collection, all at the same time.
Using this stage collection, the engine is able to solve the complex merge situations when loading the file, without the need of an extra aggregation step. Therefore, this storage engine does not implement the aggregation operation. Depending on the level of detail required, the merge mode can be configured when loading the files.
For each variant that we load we have to check if the it already exists in the database, and, in that case, merge the new data with the existing variant. Otherwise, create a new variant.
We may find two types of overlapping variants: Variants from different sources that are in the same position with same reference and alternate (same variant), and variants that are not same but their coordinates (over the reference genome) are overlapping.
At this point is when we define two modes of merging variants:
- Basic merge. Only merging variants from different sources that are the same.
- Advanced merge. In addition to the basic merge, add the overlapping variants as secondary alternates and rearrange genotypes.
It is expected to have more unknown values for basic merge than for advanced merge.
In the next figure we can see an example of merging multiple variants, from different single-sample files.
On the left, the input files. On the right we can see the merge result, depending on the merge mode, differences in red.
For basic mode, there will be unknown values for certain positions. We can not determine if the value was missing ( ./. ), reference ( 0/0 ), or a different genotype. The output value for unknown genotypes can be modified by the user when querying. By default, the missing genotype ( ./. ) will be used.
In the advanced mode, the variants have gained a secondary alternate, and the field AD (Allele Depth) has been rearranged in order to match with the new allele order.
Loading new files will be much faster with basic merge mode. Is is because we don't need now to check if the variant overlaps with any other already existing variant. We only need to know if the variant exists or not in the database, which takes a significant amount of time in advance mode.
How to configure the merge mode
The merge mode is defined when the first file is loaded, and can not be changed.
From the command line we should add --merge advanced or --merge basic
Hadoop - HBase
The storage engine implementation for Hadoop is based on Apache HBase and Apache Phoenix. When loading a file, it will be stored (by default, entirely) in the archive table, and the variants (everything but the reference blocks) will be stored in the variants table, using a basic merge mode. To obtain an advanced merge, including all the overlapping variants and the reference blocks, see the aggregation operation.
Most of the common queries will go to the variants table, but in case of requiring some extra information, the archive table can be also queried. There is also a third table that contains a secondary index for samples, to allow instant queries by genotype.
Table Naming Policy
- Variants table
- Archive table
- Sample index table
- Metadata table
Table compression algorithm
HBase supports multiple table compression algorithms natively. Compression algorithms can be configured for each of the tables. By default, SNAPPY compression is used.
Compression for the variant table.
Compression for the archive table.
Compression for the sample index table.
Pre-splitting HBase tables is a common technique that reduces the number of splits and provides a better balance of the regions across the Hadoop cluster. We can configure the number of pre-splits for each of the tables.
Pre-split size for the variant table.
Pre-split size for the archive table.
In order to do an optimal pre-splitting, the storage engine needs to know an approximation of the number of files to be loaded. This number can be configured with:
By default, 5000
Samples index table
With the Apache Solr secondary indexes we can query by any annotation field in HBase in subsecond time. But this can not help when querying by sample (or genotype).
Detailed information available here:
- Index Sample Genotypes in HBase (#838)
- Genotype index intersect in HBase (#862)
- Use variants SampleIndex when reading from MapReduce (#868)
Reduce archive data
By default, the engine writes in the archive table all the information from the variants that are reference blocks with HOM_REF genotype. This information represents, approximately between a 66% and a 90% of the original gVCF. So, reducing this part can have a big impact on the final size of the archive table. This feature can help to some installations with tied disk resources, or just because some information is not required at all for the analysis.
The fields to include can be configured using the following configuration parameter:
The default value is all. It can be configured with a list of fields to be included in the archive table.
- FORMAT to include all the fields from the sample format, or FORMAT:K1,K2,K3,... to include specific fields
- INFO to include all the fields from the file info, or INFO:K1,K2,K3,... to include specific fields
The above line will only include the QUAL, the DP from the file attributes, the GT, AD and DP from the sample data
- You can not load two files with the same sample in the same study. See OpenCGA#158.
There is an exception for this limitation for the scenarios where the variants were split in multiple files (by chromosome, by type, ...). In this case, you can use the parameter
- You can not index two files with the same name (e.g. /data/sample1/my.vcf.gz and /data/sample2/my.vcf.gz) in the same study. This limitation should not be a problem in any real scenario, where every VCF file usually has a different name. If two files have the same name, the most likely situation is that they contain the same samples, and this is already forbidden by the previous limitation.
Table of Contents:
- No labels