In OpenCGA, stats refers to the arrangement of search results into categories based on indexed field. Results are presented as a list of buckets, where each bucket is composed of 1) the field value and 2) a numerical count of how many matching documents were found for that field. In literature, this stats concept is known as facet or faceting as well. In fact, OpenCGA stats are based on Solr faceted search.
In addtion, stats allows users to query:
The basic syntax for stats (or facets) is:
field_name[value1,value2,value3...]:limit |
Parameters:
Parameter | Description |
---|---|
field_name | The field name to produce buckets from. Mandatory. |
value1,value2,value3... | They are the values of the field name you want to count. They have to be enclosed in square brackets. Optional. |
limit | Number of counts to show, i.e., number of buckets. Optional. |
E.g.: ...&fields=chromosome[1,2]
Users can query multiple stats by separating field names by semicolons.
E.g.: ...&fields=chromosome[1,2];types
When asking for ranges, the result contains multiple buckets over a numeric field. You must specify the field name, the lower and upper bounds and the step or bucket size.
field_name[start..end]:step |
Range parameters:
Parameter | Description |
---|---|
field_name | The numeric field name to produce range buckets from. Mandatory |
start | Lower bound of the ranges. Mandatory. |
end | Upper bound of the ranges. Mandatory. |
step | Size of each range bucket produced. Mandatory. |
E.g.: ...&fields=gerp[0..10]:0.5
Aggregation functions, also called facet functions, analytic functions, or metrics, calculate something interesting over a domain (each facet bucket).
aggregation_function(field_name) |
List of aggregation functions:
Aggregation function | Description | Example |
---|---|---|
avg | Average of numeric values | avg(gerp) |
min | Minimum value | min(sift) |
max | Maximum value | max(caddScaled) |
unique | Number of unique values | unique(biotypes) |
hll | Distributed cardinality estimate via hyper-log-log algorithm | hll(type) |
percentile | Percentile estimates via t-digest algorithm. Calculate the percentiles: 1, 10, 25, 50, 75, 90 and 99th. | percentile(gerp) |
sumsq | Sum of squares of field or function | sumsq(caddRaw) |
E.g.: ...&fields=percentile(gerp);max(caddScaled)
Nested facets allow users to nest bucketing terms, ranges or aggregations. In order to specify nested facets you must use the symbols >>
E.g.: ...&fields=chromosome[5,6]>>type