Sample Stats contain a basic information for each indexed Sample in a different cohort.
Implementation
Sample Stats is implemented using Hadoop MapReduce over HBase.
Parameters
OpenCGA support different input parameters:
- Variant Query
- Sample list, cohort or query
Output
Files
Sample stats are calculated for each sample. It includes the following information:
- The total number of variants.
- The number of variants per chromosome.
- The number of variants per consequence type.
- The number of variants per biotype.
- The number of variants per type (SNV, INDEL,...)
- The number of variants per genotype.
- The transition-to-transversion ratio (ti/tv ratio).
- A heterozigosity score.
- A missingness score.
- A list of the most affected genes.
- The number of variants per indel length
- A list of HPO and genes for loss of function (LoF) variants.
- A list of the most frequent variant traits.
Summary statistics are stored in a JSON format file.
Index
Pre-computed stats are useful for filtering Samples and can be indexed in OpenCGA Catalog using a predefined variable set
Useful Links