Search engines are NoSQL database management systems dedicated to the search for data content. In addition to general optimization for this type of application, the specialization consists in typically offering the following features:
Support for complex search expressions
Full text search
Stemming (reducing inflected words to their stem)
Ranking and grouping of search results
Geospatial search
Distributed search for high scalability
Search Engines are used in OpenCGA as a complementary engine for improving the performance of some queries and aggregations, full text search and faceted queries to Variant database.
Apache Solr
Apache Solr 6.x is highly reliable, scalable and fault tolerant NoSQL database, it provides distributed indexing, replication, load-balanced querying, automated fail over, recovery, centralised configuration and more.
Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.
Index with Search Engine
opencga-analysis.sh variants index-search ...
Variants scheme
The goal is to improve the performance of complex queries helping the current storage engine, not to replace the storage engine. There is no point on loading the whole database in the search engine and duplicate all the data. Only a subset of fields is stored. This keeps controlled the size of the database, and maintains a manageable dataset.
Most of the Variant queries use filters over VariantAnnotation.