Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Next »

This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. Nevertheless, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Downloading raw files from the original sources and building the data models can be tricky. We encourage users to use our pre-built data models (json files) and to skip the download of raw files from original sources and the posterior building of the data models. Our pre-built json documents (data models) are available from

http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/

http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch38/mongodb/

You could then directly jump to the Load data models section in this tutorial.

For those users willing to build CellBase knowledgbase from scratch, please follow the sections below.

Allele population frequencies datasets are processed following a different pipeline and special sections can be found below for them.

Download Sources

Download can be done through the Cellbase CLI:

cellbase/build/bin$ ./cellbase.sh download

The following option is required: -d, --data 

Usage:   cellbase.sh download [options]

Options:
      -a, --assembly       STRING     Name of the assembly, if empty the first assembly in configuration.json will be used 
      --common             STRING     Directory where common multi-species data will be downloaded, this is mainly protein and expression 
                                      data [<OUTPUT>/common] 
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Comma separated list of data to download: genome, gene, variation, variation_functional_score, 
                                      regulation, protein, conservation, clinical_variants, repeats, svs and 'all' to download everything 
      -h, --help                      Display this help and exit [false]
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
      -o, --output         STRING     The output directory, species folder will be created [/tmp]
      -s, --species        STRING     Name of the species to be downloaded, valid format include 'Homo sapiens' or 'hsapiens' [Homo 
                                      sapiens]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]

A number of datasets can be downloaded as indicated by the built-in documentation: genome, gene, variation, variation_functional_score, regulation, protein, conservation, clinical_variants, repeats, svs. An option all is implemented for the --data parameter to allow downloading all data by a single command. Some datasets (genome and gene) need the ENSEMBL perl API to be properly installed in order to be fully downloaded. Please note: all data can be downloaded, built and loaded in the database without the ENSEMBL API but some bits may be missing, e.g. gene xrefs.

For example, to download all human (GRCh37) data from all sources and save it into the /tmp/data/cellbase/v4/ directory, run:

cellbase/build/bin$ ./cellbase.sh download -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -o /tmp/data/cellbase/v4/ -s hsapiens

Please note: ensure you are located within the cellbase/build/bin directory before running the download command. Some perl scripts that use the ENSEMBL API may not be properly run otherwise. Also, note that COSMIC server requires login and therefore the CosmicMutantExport.txt.tar.gz file must be manually downloaded from their web page:

https://cancer.sanger.ac.uk/cosmic/download

Please, also note that heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even hours. If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database.

Downloading population frequencies datasets

Must be manually downloaded from source repositories:

ENSEMBL VCFs for ENSEMBL variation data must be manually downloaded as well:

ftp://ftp.ensembl.org/pub/release-90/variation/vcf/homo_sapiens/Homo_sapiens.vcf.gz

ftp://ftp.ensembl.org/pub/release-90/variation/vcf/homo_sapiens/Homo_sapiens_somatic.vcf.gz

Build Data Models

The process may be carried out by using the Cellbase CLI:

cellbase/build/bin$ ./cellbase.sh build
The following options are required: -d, --data -i, --input 

Usage:   cellbase.sh build [options]

Options:
      -a, --assembly         STRING     Name of the assembly, if empty the first assembly in configuration.json will be used 
      --common               STRING     Directory where common multi-species data will be downloaded, this is mainly protein and expression 
                                        data [<OUTPUT>/common] 
      -C, --config           STRING     CellBase configuration.json file. Have a look at 
                                        cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data             STRING     Comma separated list of data to build: genome, genome_info, gene, variation, 
                                        variation_functional_score, regulation, protein, ppi, conservation, drug, clinical_variants, 
                                        repeats, svs. 'all' builds everything. 
      --flexible-gtf-parsing            By default, ENSEMBL GTF format is expected.  Nevertheless, GTF specification is quite loose and 
                                        other GTFs may be provided in which the order of the features is not as systematic as within the 
                                        ENSEMBL's GTFs. Use this option to enable a more flexible parsing of the GTF if it does not strictly 
                                        follow ENSEMBL's GTFs format. Flexible GTF requires more memory and is less efficient. [false]
      -h, --help                        Display this help and exit [false]
    * -i, --input            STRING     Input directory with the downloaded data sources to be loaded 
      -L, --log-level        STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
      -o, --output           STRING     Output directory where the JSON data models are saved [/tmp]
      -s, --species          STRING     Name of the species to be built, valid format include 'Homo sapiens' or 'hsapiens' [Homo sapiens]
      -v, --verbose          BOOLEAN    [Deprecated] Set the level of the logging [false]


The build process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the /tmp/data/cellbase/v4/homo_sapiens_grch37/ directory created in section Download Sources and save the result at /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/:

cellbase/build/bin$ mkdir /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb 
cellbase/build/bin$ ./cellbase.sh build -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -i /tmp/data/cellbase/v4/homo_sapiens_grch37/ -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ -s hsapiens

Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.

Building variation data models

First, allele population frequencies datasets must be processed. In order to do this, VCF files downloaded in "Downloading population frequencies datasets" Section must be loaded into an OpenCGA installation. Please, refer to Index Pipelines and Getting Started in 5 minutes for further details on how to do this with an OpenCGA installation.

Once the OpenCGA installation is fully loaded, allele population frequencies must be parsed and/or calculated for each study. Please, refer to Operations for further details on how to perform this operation with an OpenCGA installation.

Finally, population frequencies must be exported in a CellBase compliant format. Please, refer to <exportVariants> for further details on how to perform this operation with an OpenCGA installation.

Once the population frequencies json files are exported, everything is ready for building the actual models that will be loaded into the database.

Please note, actual build of the variation collection is currently not performed by using the build command in the CLI, but by annotating ENSEMBL variation VCFs using the variant-annotation command line. Thus, running the following command line for each VCF downloaded in previous step will generate the variation files:

cellbase/build/bin$ cellbase-4.5.5/bin/cellbase.sh variant-annotation -a GRCh37 -s hsapiens --exclude variation,populationFrequencies,expression,geneDisease,drugInteraction -i /tmp/data/cellbase/v4/homo_sapiens_grch37/variation/Homo_sapiens.1.vcf.gz -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/variation_chr1.json.gz -Dpopulation-frequencies=/tmp/data/frequencies/chr1.freq.cellbase.json.gz

After completion of the build process, your output directory shall look like:


cellbase/build/bin$ ls /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
clinical_variants.full.json.gz
clinvarVersion.json
conservation_10.json.gz
conservation_11.json.gz
conservation_12.json.gz
conservation_13.json.gz
conservation_14.json.gz
conservation_15.json.gz
conservation_16.json.gz
conservation_17.json.gz
conservation_18.json.gz
conservation_19.json.gz
conservation_1.json.gz
conservation_20.json.gz
conservation_21.json.gz
conservation_22.json.gz
conservation_2.json.gz
conservation_3.json.gz
conservation_4.json.gz
conservation_5.json.gz
conservation_6.json.gz
conservation_7.json.gz
conservation_8.json.gz
conservation_9.json.gz
conservation_M.json.gz
conservation_X.json.gz
conservation_Y.json.gz
cosmic.json.gz
cosmicVersion.json
dgidbVersion.json
dgvVersion.json
disgenetVersion.json
ensemblCoreVersion.json
ensemblRegulationVersion.json
ensemblVariationVersion.json
geneExpressionAtlasVersion.json
gene.json.gz
genome_info.json
genome_info.log
genome_sequence.json.gz
genomeVersion.json
genomicSuperDups.json
gnomadVersion.json
hpoVersion.json
interproVersion.json
mirbaseVersion.json
phastConsVersion.json
phyloPVersion.json
protein.json.gz
prot_func_pred_chr_10.json
prot_func_pred_chr_11.json
prot_func_pred_chr_12.json
prot_func_pred_chr_13.json
prot_func_pred_chr_14.json
prot_func_pred_chr_15.json
prot_func_pred_chr_16.json
prot_func_pred_chr_17.json
prot_func_pred_chr_18.json
prot_func_pred_chr_19.json
prot_func_pred_chr_1.json
prot_func_pred_chr_20.json
prot_func_pred_chr_21.json
prot_func_pred_chr_22.json
prot_func_pred_chr_2.json
prot_func_pred_chr_3.json
prot_func_pred_chr_4.json
prot_func_pred_chr_5.json
prot_func_pred_chr_6.json
prot_func_pred_chr_7.json
prot_func_pred_chr_8.json
prot_func_pred_chr_9.json
prot_func_pred_chr_MT.json
prot_func_pred_chr_X.json
prot_func_pred_chr_Y.json
regulatory_region.json.gz
repeats.json.gz
simpleRepeat.json
structuralVariants.json.gz
toload
uniprotVersion.json
uniprotXrefVersion.json
variation_chr10.json.gz
variation_chr10.somatic.json.gz
variation_chr11.json.gz
variation_chr11.somatic.json.gz
variation_chr12.json.gz
variation_chr12.somatic.json.gz
variation_chr13.json.gz
variation_chr13.somatic.json.gz
variation_chr14.json.gz
variation_chr14.somatic.json.gz
variation_chr15.json.gz
variation_chr15.somatic.json.gz
variation_chr16.json.gz
variation_chr16.somatic.json.gz
variation_chr17.json.gz
variation_chr17.somatic.json.gz
variation_chr18.json.gz
variation_chr18.somatic.json.gz
variation_chr19.json.gz
variation_chr19.somatic.json.gz
variation_chr1.json.gz
variation_chr1.somatic.json.gz
variation_chr20.json.gz
variation_chr20.somatic.json.gz
variation_chr21.json.gz
variation_chr21.somatic.json.gz
variation_chr22.json.gz
variation_chr22.somatic.json.gz
variation_chr2.json.gz
variation_chr2.somatic.json.gz
variation_chr3.json.gz
variation_chr3.somatic.json.gz
variation_chr4.json.gz
variation_chr4.somatic.json.gz
variation_chr5.json.gz
variation_chr5.somatic.json.gz
variation_chr6.json.gz
variation_chr6.somatic.json.gz
variation_chr7.json.gz
variation_chr7.somatic.json.gz
variation_chr8.json.gz
variation_chr8.somatic.json.gz
variation_chr9.json.gz
variation_chr9.somatic.json.gz
variation_chrMT.json.gz
variation_chrX.json.gz
variation_chrX.somatic.json.gz
variation_chrY.json.gz
windowMasker.json

If build was successful, you can proceed to loading the data models into the database: [[Load Data Models]].

Load Data Models

Getting data models

There are two ways of getting the data models that shall populate the CellBase database:

  1. For those users willing to build CellBase knowledgbase from scratch, please follow the tutorial from the Download Sources section.
  2. Download data models from http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/.

Load data models

Please, note that before loading the data models into the database the CellBase code must have been previously compiled with maven and injected with database credentials, as explained at the README file.

CellBase code is open-source and freely available at https://github.com/opencb/cellbase

Use the CellBase CLI to load the data models:

cellbase/build/bin$ ./cellbase.sh load
The following options are required: -i, --input -d, --data     --database 

Usage:   cellbase.sh load [options]

Options:
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Data model type to be loaded, i.e. genome, gene, ... 
    *     --database       STRING     Data model type to be loaded, i.e. genome, gene, ... 
          --field          STRING     Use this parameter when an custom update of the database documents is required. Indicate herethe 
                                      full path to the document field that must be updated, e.g. annotation.populationFrequencies. This 
                                      parameter must be used togetherwith a custom file provided at --input and the data to update 
                                      indicated at --data. 
      -h, --help                      Display this help and exit [false]
    * -i, --input          STRING     Input directory with the JSON data models to be loaded. Can also be used to specify acustom json 
                                      file to be loaded (look at the --field parameter). 
      -l, --loader         STRING     Database specific data loader to be used [org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader]
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
          --num-threads    INT        Number of threads used for loading data into the database [2]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]
      -D                              Dynamic parameters go here [{}]

For example, to load all human (GRCh37) data models from the /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ created in section [[Build Data Models]], into the cellbase_hsapiens_grch37_v4 database and creating the indexes as indicated in the .js scripts within cellbase/cellbase-app/app/mongodb-scripts/, run:

cellbase/build/bin$ ./cellbase.sh load -d variation --database cellbase_hsapiens_grch37_v4 -i /mnt/data/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ -L debug -Dmongodb-index-folder=/home/cafetero/appl/dev/cellbase/cellbase-app/app/mongodb-scripts/

Please, note that the whole loading and indexing process may need ~24h to complete, depending on the available hardware.

Warning notices

Variant annotation provided by default for the variation dataset, when building CellBase data from scratch, is ENSEMBL variation annotation. CellBase pre-annotated variation collection can only be obtained by the pre-built models provided at http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/. Likewise, population frequencies for 1000 genomes project, UK10K project, GoNL project, ExAC, etc., are not included by default if building the models from scratch. These data are obtained by the CellBase team thanks to additional collaborations and will only be found at the already built variation data models provided at: http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/

After successful load of all data, the corresponding database shall look like:

$ mongo mongodb-dev/cellbase_hsapiens_grch37_v4
MongoDB shell version: 3.0.9
connecting to: mongodb-dev/cellbase_hsapiens_grch37_v4
> show collections;
protein_protein_interaction
clinical
protein
conservation
gene
genome_info
variation_functional_score
genome_sequence
regulatory_region
protein_functional_prediction
variation
ftp://ftp.ensembl.org/pub/release-90/variation/vcf/homo_sapiens/Homo_sapiens.vcf.gz,ftp://ftp.ensembl.org/pub/release-90/variation/vcf/homo_sapiens/Homo_sapiens_somatic.vcf.gz

Sections in this page

  • No labels