Stats in OpenCGA: syntax

In OpenCGA, stats refers to the arrangement of search results into categories based on indexed field. Results are presented as a list of buckets, where each bucket is composed of 1) the field value and 2) a numerical count of how many matching documents were found for that field. In literature, this stats concept is known as facet or faceting as well. In fact, OpenCGA stats are based on Solr faceted search. In addition, stats

Ranges that allow users to count how many documents are in an interval of a numerical field.
Aggregation functions such as average, maximum, minimum, percentiles,...
Nested faceted search.

The basic syntax for stats (or facets) is:

Range specification

field_name[value1,value2,value3...]:limit

Parameters:

Parameter	Description
field_name	The field name to produce buckets from. Mandatory.
value1,value2,value3...	They are the values of the field name you want to select counts from. They have to be enclosed in square brackets. Optional.
limit	Number of counts to show, i.e., number of buckets. Optional.

Users can query multiple stats by separating field names by semicolons. E.g., chromosome[1,2];biotypes

Ranges

When asking for ranges, the result contains multiple buckets over a numeric field. You must specify the field name, the lower and upper bounds and the step or bucket size.

Range specification

field_name[start..end]:step

Range parameters:

Parameter	Description
field_name	The numeric field name to produce range buckets from. Mandatory
start	Lower bound of the ranges. Mandatory.
end	Upper bound of the ranges. Mandatory.
step	Size of each range bucket produced.

E.g.: gerp[0..5]:0.2

Aggregation functions

Aggregation functions, also called facet functions, analytic functions, or metrics, calculate something interesting over a domain (each facet bucket).

Aggregation specification

aggregation_function(field_name)

List of aggregation functions:

Aggregation function	Description	Example
avg	Average of numeric values	avg(gerp)
min	Minimum value	min(sift)
max	Maximum value	max(caddScaled)
unique	Number of unique values	unique(biotypes)
hll	Distributed cardinality estimate via hyper-log-log algorithm	hll(type)
percentile	Percentile estimates via t-digest algorithm. Calculate the percentiles: 1, 10, 25, 50, 75, 90 and 99th.	percentile(gerp)
sumsq	Sum of squares of field or function	sumsq(caddRaw)

Nested facets

Nested facets allow users to nest bucketing terms, ranges or aggregations. In order to specify nested facets you must use the symbols >>

Some examples:

chromosome>>biotypes

chromosome[1,2,3,4]>>biotypes>>gerp[0..5]:0.25

Page tree

Stats in OpenCGA: syntax

Ranges

Aggregation functions

Nested facets