OpenCGA uses dependencies from Hortonworks HDP-2.5.0 internally. It has not been tested with other flavours of Hadoop.
Download or pull the version you want to try.
You can build the application from sources executing:
You can customize some configuration parameters adding them to the compilation with
-D<param>=<value>. Some interesting params are:
OPENCGA.INSTALLATION.DIRfor changing the installation directory.
OPENCGA.CLIENT.REST.HOSTThis parameter indicates the address of the REST server. For this tutorial we are going to use a embedded REST server.
OPENCGA.CELLBASE.REST.HOSTto specify the cellbase installation.
OPENCGA.CELLBASE.VERSIONto specify the cellbase version.
OPENCGA.STORAGE.DEFAULT_ENGINEto specify the default storage engine. By default is "mongodb", so we will need to add
--storage-engine hadoopto each command. Compile with
-DOPENCGA.STORAGE.to avoid that.
To see the rest of the configurable parameters, check the default-config profile at the main pom.xml.
For example, to change the default engine and the rest host, execute:
Then copy the application (the content of build folder) into the installation directory, by default and in this tutorial this is /opt/opencga.
- See Download and Installation for more information.
Needless to say, the computer where opencga is installed must have access to the Hadoop cluster.
In order to interact with Hadoop, we need to provide the configuration files. In OpenCGA There are two ways for doing that, depending on the way of accessing to Hadoop.
a) Hadoop node. Full access.
This configuration is for hadoop client nodes (or local installations, or hadoop nodes) where the commands 'hadoop', 'yarn' and 'hbase' are installed, and the client configuration updated. The script
bin/opencga-env.sh will add the configuration files to the java classpath. Nothing else is needed.
In this scenario, you will be able to execute this commands:
b) External server. Read only.
In other case, we need to obtain the configuration files from the cluster hadoop. In this scenario, just copy the configuration files in a folder called
etc in the installation directory. This folder is automatically added to the classpath. With this configuration, you will only be able to execute queries.
To simplify the installation, we are going to use the embedded server for the REST API.
- See Getting started in 5 min for more info.
Indexing a VCF file
For this testing area, we are going to use a sample VCF data from the Platinum genomes. You can use any other file, but all the examples below use the VCF file platinum-genomes-vcf-NA12877_S1.genome.vcf.gz
You can find other files to load in this link: http://swdev.bioinfo.hpc.cam.ac.uk/downloads/datasets/vcf/platinum_genomes/gz/
Once OpenCGA is installed and running, we need to create a new project and study in catalog, and register our VCF file. You can also download all the files from that link
Once everything is set up, just need to load the files. This command line will create an internal job that will be executed by the catalog daemon.
Optionally, we can use the opencga-analysis.sh command line for a synchronous execution:
For testing porpouses, it may be interesting to have an standalone installation of OpenCGA-Storage.You can find another build folder at
opencga/opencga-storage/build/ that contains only the binaries for storage
A simple indexation can be done executing the next command:
Annotate the variants database.
At this point, the last but not least, is annotate the variants. Despite this can be done at the same time than indexing variant files, it may be more clear in separated executions:
This will annotate all the variants without annotation at the database, skipping the already annotated variants.
And we are done! At this point we will be ready to query variants. Here are some examples commands:
- Count number of variants
./opencga.sh variant query --study platinum --count
- Get the first 10 variants from the Chromosome 8
./opencga.sh variant query --study platinum --region 8 --limit 10 --sort
- Count variants in gene BRCA2
./opencga.sh variant query --study platinum --gene BRCA2 --count
You can find the full list of options at the help:
./opencga.sh variant query --help
You can find other query examples at this other tutorial: Querying Variants with the Command Line
- No labels