This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. Nevertheless, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Downloading raw files from the original sources and building the data models can be tricky. We encourage users to use our pre-built data models (json files) and to skip the download of raw files from original sources and the posterior building of the data models. Our pre-built json documents (data models) are available from
You could then directly jump to the Load data models section in this tutorial.
For those users willing to build CellBase knowledgbase from scratch, please follow the sections below.
Please note: allele population frequencies datasets are processed following a different pipeline and special sections can be found below for them.
Download can be done through the Cellbase CLI:
A number of datasets can be downloaded as indicated by the built-in documentation: genome, gene, variation, variation_functional_score, regulation, protein, conservation, clinical_variants, repeats, svs. An option
all is implemented for the
--data parameter to allow downloading all data by a single command. Some datasets (
gene) need the ENSEMBL perl API to be properly installed in order to be fully downloaded. Please note: all data can be downloaded, built and loaded in the database without the ENSEMBL API but some bits may be missing, e.g. gene xrefs.
For example, to download all human (GRCh37) data from all sources and save it into the
/tmp/data/cellbase/v4/ directory, run:
cellbase/build/bin$ ./cellbase.sh download -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -o /tmp/data/cellbase/v4/ -s hsapiens
Please note: ensure you are located within the
cellbase/build/bin directory before running the
download command. Some perl scripts that use the ENSEMBL API may not be properly run otherwise. Also, note that COSMIC server requires login and therefore the CosmicMutantExport.tsv.gz file must be manually downloaded from their web page:
Please, also note that heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even hours. If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database.
Downloading population frequencies datasets
Must be manually downloaded from source repositories:
- GONL: http://www.nlgenome.nl/
- gnomAD: https://data.broadinstitute.org/gnomAD
- 1000 Genomes Project: http://www.internationalgenome.org/
- UK10K: http://www.uk10k.org/data.html
- ESP: http://evs.gs.washington.edu/EVS/
- DiscovEHR: http://discovehrshare.com/downloads
ENSEMBL VCFs for ENSEMBL variation data must be manually downloaded as well:
Build Data Models
The process may be carried out by using the Cellbase CLI:
build process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the
/tmp/data/cellbase/v4/homo_sapiens_grch37/ directory created in section Download Sources and save the result at
Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.
Building variation data models
First, allele population frequencies datasets must be processed. In order to do this, VCF files downloaded in "Downloading population frequencies datasets" Section must be loaded into an OpenCGA installation. Please, refer to Index Pipelines and Getting Started in 5 minutes for further details on how to do this with an OpenCGA installation.
Once the OpenCGA installation is fully loaded, allele population frequencies must be parsed and/or calculated for each study. Please, refer to Operations for further details on how to perform this operation with an OpenCGA installation.
Then, population frequencies must be exported into CellBase compliant json files. Please, refer to <exportVariants> for further details on how to perform this operation with an OpenCGA installation.
With this, population frequencies are ready to be processed by CellBase. However, before being able to run the CellBase CLI that will generate the final .json.gz files, ENSEMBL variation's VCF file must be split into multiple files: one per chromosome:
All data is now ready for generating the final variation json files.
Please note, actual build of the variation collection is currently not performed by using the build command in the CLI, but by annotating ENSEMBL variation VCFs using the variant-annotation command line. Thus, running the following command line for each VCF downloaded in previous step will generate the variation files (below showing one run for chromosome 1):
After completion of the build process, your output directory shall look like:
build was successful, you can proceed to loading the data models into the database.
Load Data Models
Please, note that before loading the data models into the database CellBase configuration.json must have been appropriately configured indicating the database host names, ports, user and password.
CellBase code is open-source and freely available at https://github.com/opencb/cellbase
Use the CellBase CLI to load the data models:
For example, to load all human (GRCh37) data models from the
/tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/created in section "Build Data Models", into the
cellbase_hsapiens_grch37_v4database and creating the indexes as indicated in the
cellbase/build/bin$ ./cellbase.sh load -d variation --database cellbase_hsapiens_grch37_v4 -i /mnt/data/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ -L debug -Dmongodb-index-folder=/home/cafetero/appl/dev/cellbase/cellbase-app/app/mongodb-scripts/
Please, note that the whole loading and indexing process may need ~24h to complete, depending on the available hardware.
After successful load of all data, the corresponding database shall look like:
Sections in this page
- No labels