Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. Nevertheless, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Downloading raw files from the original sources and building the data models can be tricky. We encourage users to use our pre-built data models (json files) and to skip the download of raw files from original sources and the posterior building of the data models. Our pre-built json documents (data models) ara are available from

http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/

You could then directly jump to the Load data models section in this tutorial.

For those users willing to build CellBase knowledgbase from scratch, please follow the sections below.

Download Sources

Download can be done through the Cellbase CLI:

cellbase/build/bin$ ./cellbase.sh download
The following option is required: -d, --data 

Usage:   cellbase.sh download [options]

Options:
      -a, --assembly       STRING     Name of the assembly, if empty the first assembly in configuration.json will be used [GRCh37]
          --common         STRING     Directory where common multi-species data will be downloaded, this is mainly protein and expression 
                                      data [<OUTPUT>/common] 
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Comma separated list of data to download: genome, gene, gene_disease_association, variation, 
                                      variation_functional_score, regulation, protein, conservation, clinical and . 'all' to download 
                                      everything 
      -h, --help                      Display this help and exit [false]
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
      -o, --output         STRING     The output directory, species folder will be created [/tmp]
      -s, --species        STRING     Name of the species to be downloaded, valid format include 'Homo sapiens' or 'hsapiens' [Homo 
                                      sapiens]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]

A number of datasets can be downloaded as indicated by the built-in documentation: genome, gene, gene_disease_association, variation, variation_functional_score, regulation, protein, conservation, clinical. An option all is implemented for the --data parameter to allow downloading all data by a single command. Some datasets (genome and gene) need the ENSEMBL perl API to be properly installed in order to be fully downloaded. Please note: all data can be downloaded, built and loaded in the database without the ENSEMBL API but some bits may be missing, e.g. gene xrefs.

For example, to download all human (GRCh37) data from all sources and save it into the /tmp/data/cellbase/v4/ directory, run:

cellbase/build/bin$ ./cellbase.sh download -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -o /tmp/data/cellbase/v4/ -s hsapiens

Please note: ensure you are located within the cellbase/build/bin directory before running the download command. Some perl scripts that use the ENSEMBL API may not be properly run otherwise.

Please, also note that heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even hours. Downloaded data should look like these:

homo_sapiens_grch37/
├── clinical
│   ├── All.vcf.gz
│   ├── All.vcf.gz.log
│   ├── All.vcf.gz.tbi
│   ├── All.vcf.gz.tbi.log
│   ├── ClinVar_Traits_EFO_Names.csv
│   ├── ClinVar_Traits_EFO_Names.csv.log
│   ├── ClinVar.xml.gz
│   ├── ClinVar.xml.gz.log
│   ├── gwas_catalog.tsv
│   ├── gwas_catalog.tsv.log
│   ├── variant_summary.txt.gz
│   └── variant_summary.txt.gz.log
├── conservation
│   ├── gerp
│   │   ├── hg19.GERP_scores.tar.gz
│   │   └── hg19.GERP_scores.tar.gz.log
│   ├── phastCons
│   │   ├── chr10.phastCons46way.primates.wigFix.gz
│   │   ├── chr10.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr11.phastCons46way.primates.wigFix.gz
│   │   ├── chr11.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr12.phastCons46way.primates.wigFix.gz
│   │   ├── chr12.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr13.phastCons46way.primates.wigFix.gz
│   │   ├── chr13.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr14.phastCons46way.primates.wigFix.gz
│   │   ├── chr14.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr15.phastCons46way.primates.wigFix.gz
│   │   ├── chr15.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr16.phastCons46way.primates.wigFix.gz
│   │   ├── chr16.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr17.phastCons46way.primates.wigFix.gz
│   │   ├── chr17.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr18.phastCons46way.primates.wigFix.gz
│   │   ├── chr18.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr19.phastCons46way.primates.wigFix.gz
│   │   ├── chr19.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr1.phastCons46way.primates.wigFix.gz
│   │   ├── chr1.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr20.phastCons46way.primates.wigFix.gz
│   │   ├── chr20.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr21.phastCons46way.primates.wigFix.gz
│   │   ├── chr21.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr22.phastCons46way.primates.wigFix.gz
│   │   ├── chr22.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr2.phastCons46way.primates.wigFix.gz
│   │   ├── chr2.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr3.phastCons46way.primates.wigFix.gz
│   │   ├── chr3.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr4.phastCons46way.primates.wigFix.gz
│   │   ├── chr4.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr5.phastCons46way.primates.wigFix.gz
│   │   ├── chr5.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr6.phastCons46way.primates.wigFix.gz
│   │   ├── chr6.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr7.phastCons46way.primates.wigFix.gz
│   │   ├── chr7.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr8.phastCons46way.primates.wigFix.gz
│   │   ├── chr8.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr9.phastCons46way.primates.wigFix.gz
│   │   ├── chr9.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrM.phastCons46way.primates.wigFix.gz
│   │   ├── chrM.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrX.phastCons46way.primates.wigFix.gz
│   │   ├── chrX.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrY.phastCons46way.primates.wigFix.gz
│   │   └── chrY.phastCons46way.primates.wigFix.gz.log
│   └── phylop
│       ├── chr10.phyloP46way.primate.wigFix.gz
│       ├── chr10.phyloP46way.primate.wigFix.gz.log
│       ├── chr11.phyloP46way.primate.wigFix.gz
│       ├── chr11.phyloP46way.primate.wigFix.gz.log
│       ├── chr12.phyloP46way.primate.wigFix.gz
│       ├── chr12.phyloP46way.primate.wigFix.gz.log
│       ├── chr13.phyloP46way.primate.wigFix.gz
│       ├── chr13.phyloP46way.primate.wigFix.gz.log
│       ├── chr14.phyloP46way.primate.wigFix.gz
│       ├── chr14.phyloP46way.primate.wigFix.gz.log
│       ├── chr15.phyloP46way.primate.wigFix.gz
│       ├── chr15.phyloP46way.primate.wigFix.gz.log
│       ├── chr16.phyloP46way.primate.wigFix.gz
│       ├── chr16.phyloP46way.primate.wigFix.gz.log
│       ├── chr17.phyloP46way.primate.wigFix.gz
│       ├── chr17.phyloP46way.primate.wigFix.gz.log
│       ├── chr18.phyloP46way.primate.wigFix.gz
│       ├── chr18.phyloP46way.primate.wigFix.gz.log
│       ├── chr19.phyloP46way.primate.wigFix.gz
│       ├── chr19.phyloP46way.primate.wigFix.gz.log
│       ├── chr1.phyloP46way.primate.wigFix.gz
│       ├── chr1.phyloP46way.primate.wigFix.gz.log
│       ├── chr20.phyloP46way.primate.wigFix.gz
│       ├── chr20.phyloP46way.primate.wigFix.gz.log
│       ├── chr21.phyloP46way.primate.wigFix.gz
│       ├── chr21.phyloP46way.primate.wigFix.gz.log
│       ├── chr22.phyloP46way.primate.wigFix.gz
│       ├── chr22.phyloP46way.primate.wigFix.gz.log
│       ├── chr2.phyloP46way.primate.wigFix.gz
│       ├── chr2.phyloP46way.primate.wigFix.gz.log
│       ├── chr3.phyloP46way.primate.wigFix.gz
│       ├── chr3.phyloP46way.primate.wigFix.gz.log
│       ├── chr4.phyloP46way.primate.wigFix.gz
│       ├── chr4.phyloP46way.primate.wigFix.gz.log
│       ├── chr5.phyloP46way.primate.wigFix.gz
│       ├── chr5.phyloP46way.primate.wigFix.gz.log
│       ├── chr6.phyloP46way.primate.wigFix.gz
│       ├── chr6.phyloP46way.primate.wigFix.gz.log
│       ├── chr7.phyloP46way.primate.wigFix.gz
│       ├── chr7.phyloP46way.primate.wigFix.gz.log
│       ├── chr8.phyloP46way.primate.wigFix.gz
│       ├── chr8.phyloP46way.primate.wigFix.gz.log
│       ├── chr9.phyloP46way.primate.wigFix.gz
│       ├── chr9.phyloP46way.primate.wigFix.gz.log
│       ├── chrM.phyloP46way.primate.wigFix.gz
│       ├── chrM.phyloP46way.primate.wigFix.gz.log
│       ├── chrX.phyloP46way.primate.wigFix.gz
│       ├── chrX.phyloP46way.primate.wigFix.gz.log
│       ├── chrY.phyloP46way.primate.wigFix.gz
│       └── chrY.phyloP46way.primate.wigFix.gz.log
├── gene
│   ├── all_gene_disease_associations.txt.gz
│   ├── all_gene_disease_associations.txt.gz.log
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
│   ├── description.txt
│   ├── geneDrug
│   │   ├── dgidb.tsv
│   │   └── dgidb.tsv.log
│   ├── gene_extra_info.log
│   ├── homo_sapiens.cdna.all.fa.gz
│   ├── homo_sapiens.cdna.all.fa.gz.log
│   ├── homo_sapiens.gtf.gz
│   ├── homo_sapiens.gtf.gz.log
│   ├── homo_sapiens.pep.all.fa.gz
│   ├── homo_sapiens.pep.all.fa.gz.log
│   ├── idmapping_selected.tab.gz
│   ├── idmapping_selected.tab.gz.log
│   ├── MotifFeatures.gff.gz
│   ├── MotifFeatures.gff.gz.log
│   └── xrefs.txt
├── gene_disease_association
│   ├── all_gene_disease_associations.txt.gz
│   ├── all_gene_disease_associations.txt.gz.log
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│   └── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
├── genome
│   ├── Homo_sapiens.GRCh37.fa.gz
│   └── Homo_sapiens.GRCh37.fa.gz.log
├── regulation
│   ├── AnnotatedFeatures.gff.gz
│   ├── AnnotatedFeatures.gff.gz.log
│   ├── hsa_MTI.xls
│   ├── hsa_MTI.xls.log
│   ├── MotifFeatures.gff.gz
│   ├── MotifFeatures.gff.gz.log
│   ├── RegulatoryFeatures_MultiCell.gff.gz
│   ├── RegulatoryFeatures_MultiCell.gff.gz.log
│   ├── targetScanS.txt.gz
│   └── targetScanS.txt.gz.log
├── variation
│   ├── allele_code.txt.gz
│   ├── allele_code.txt.gz.log
│   ├── allele.txt.gz
│   ├── allele.txt.gz.log
│   ├── attrib.txt.gz
│   ├── attrib.txt.gz.log
│   ├── attrib_type.txt.gz
│   ├── attrib_type.txt.gz.log
│   ├── genotype_code.txt.gz
│   ├── genotype_code.txt.gz.log
│   ├── motif_feature_variation.txt.gz
│   ├── motif_feature_variation.txt.gz.log
│   ├── phenotype_feature_attrib.txt.gz
│   ├── phenotype_feature_attrib.txt.gz.log
│   ├── phenotype_feature.txt.gz
│   ├── phenotype_feature.txt.gz.log
│   ├── phenotype.txt.gz
│   ├── phenotype.txt.gz.log
│   ├── population_genotype.txt.gz
│   ├── population_genotype.txt.gz.log
│   ├── population.txt.gz
│   ├── population.txt.gz.log
│   ├── seq_region.txt.gz
│   ├── seq_region.txt.gz.log
│   ├── source.txt.gz
│   ├── source.txt.gz.log
│   ├── structural_variation_feature.txt.gz
│   ├── structural_variation_feature.txt.gz.log
│   ├── study.txt.gz
│   ├── study.txt.gz.log
│   ├── transcript_variation.txt
│   ├── transcript_variation.txt.gz.log
│   ├── transcript_variation.txt.tmp
│   ├── variation_feature.txt
│   ├── variation_feature.txt.gz.log
│   ├── variation.sorted.txt
│   ├── variation_synonym.txt
│   ├── variation_synonym.txt.gz.log
│   ├── variation.txt
│   └── variation.txt.gz.log
└── variation_functional_score
    ├── whole_genome_SNVs.tsv.gz
    └── whole_genome_SNVs.tsv.gz.log

If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database: [[Build & Load Data]].

Build Data Models

The process may be carried out by using the Cellbase CLI:

cellbase/build/bin$ ./cellbase.sh build
The following options are required: -d, --data -i, --input 

Usage:   cellbase.sh build [options]

Options:
      -a, --assembly       STRING     Name of the assembly, if empty the first assembly in configuration.json will be used 
          --common         STRING     Directory where common multi-species data will be downloaded, this is mainly protein and expression 
                                      data [<OUTPUT>/common] 
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Comma separated list of data to build: genome, gene, disgenet, hpo, variation, cadd, regulation, 
                                      protein, conservation, drug, clinvar, cosmic and GWAS CAatalog. 'all' build everything. 
      -h, --help                      Display this help and exit [false]
    * -i, --input          STRING     Input directory with the downloaded data sources to be loaded 
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
      -o, --output         STRING     Output directory where the JSON data models are saved [/tmp]
      -s, --species        STRING     Name of the species to be built, valid format include 'Homo sapiens' or 'hsapiens' [Homo sapiens]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]

The build process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the /tmp/data/cellbase/v4/homo_sapiens_grch37/ directory created in section [[Download Sources]] and save the result at /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/: cellbase/build/bin$ mkdir /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb cellbase/build/bin$ ./cellbase.sh build -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -i /tmp/data/cellbase/v4/homo_sapiens_grch37/ -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ -s hsapiens Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.

After completion of the build process, your output directory shall look like:

cellbase/build/bin$ ls /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
cadd.json.gz
clinvar.json.gz
conservation_10.json.gz
conservation_11.json.gz
conservation_12.json.gz
conservation_13.json.gz
conservation_14.json.gz
conservation_15.json.gz
conservation_16.json.gz
conservation_17.json.gz
conservation_18.json.gz
conservation_19.json.gz
conservation_1.json.gz
conservation_20.json.gz
conservation_21.json.gz
conservation_22.json.gz
conservation_2.json.gz
conservation_3.json.gz
conservation_4.json.gz
conservation_5.json.gz
conservation_6.json.gz
conservation_7.json.gz
conservation_8.json.gz
conservation_9.json.gz
conservation_M.json.gz
conservation_X.json.gz
conservation_Y.json.gz
cosmic.json.gz
gene.json.gz
genome_info.json
genome_sequence.json.gz
protein.json.gz
protein_protein_interaction.json.gz
prot_func_pred_chr_10.json.gz
prot_func_pred_chr_11.json.gz
prot_func_pred_chr_12.json.gz
prot_func_pred_chr_13.json.gz
prot_func_pred_chr_14.json.gz
prot_func_pred_chr_15.json.gz
prot_func_pred_chr_16.json.gz
prot_func_pred_chr_17.json.gz
prot_func_pred_chr_18.json.gz
prot_func_pred_chr_19.json.gz
prot_func_pred_chr_1.json.gz
prot_func_pred_chr_20.json.gz
prot_func_pred_chr_21.json.gz
prot_func_pred_chr_22.json.gz
prot_func_pred_chr_2.json.gz
prot_func_pred_chr_3.json.gz
prot_func_pred_chr_4.json.gz
prot_func_pred_chr_5.json.gz
prot_func_pred_chr_6.json.gz
prot_func_pred_chr_7.json.gz
prot_func_pred_chr_8.json.gz
prot_func_pred_chr_9.json.gz
prot_func_pred_chr_MT.json.gz
prot_func_pred_chr_X.json.gz
prot_func_pred_chr_Y.json.gz
regulatory_region.json.gz
variation_chr10.json.gz
variation_chr11.json.gz
variation_chr12.json.gz
variation_chr13.json.gz
variation_chr14.json.gz
variation_chr15.json.gz
variation_chr16.json.gz
variation_chr17.json.gz
variation_chr18.json.gz
variation_chr19.json.gz
variation_chr1.json.gz
variation_chr20.json.gz
variation_chr21.json.gz
variation_chr22.json.gz
variation_chr2.json.gz
variation_chr3.json.gz
variation_chr4.json.gz
variation_chr5.json.gz
variation_chr6.json.gz
variation_chr7.json.gz
variation_chr8.json.gz
variation_chr9.json.gz
variation_chrMT.json.gz
variation_chrX.json.gz
variation_chrY.json.gz

If build was successful, you can proceed to loading the data models into the database: [[Load Data Models]].

Anchor
loaddatamodels
loaddatamodels

Load Data Models

Getting data models

There are two ways of getting the data models that shall populate the CellBase database:

  1. For those users willing to build CellBase knowledgbase from scratch, please follow the tutorial [[Download Sources]]
  2. Download data models from http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/.

Load data models

Please, note that before loading the data models into the database the CellBase code must have been previously compiled with maven and injected with database credentials, as explained at the [[README.md | https://github.com/opencb/cellbase/blob/develop/README.md]] file README file.

Use the CellBase CLI to load the data models:

cellbase/build/bin$ ./cellbase.sh load
The following options are required: -i, --input -d, --data     --database 

Usage:   cellbase.sh load [options]

Options:
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Data model type to be loaded, i.e. genome, gene, ... 
    *     --database       STRING     Data model type to be loaded, i.e. genome, gene, ... 
          --field          STRING     Use this parameter when an custom update of the database documents is required. Indicate herethe 
                                      full path to the document field that must be updated, e.g. annotation.populationFrequencies. This 
                                      parameter must be used togetherwith a custom file provided at --input and the data to update 
                                      indicated at --data. 
      -h, --help                      Display this help and exit [false]
    * -i, --input          STRING     Input directory with the JSON data models to be loaded. Can also be used to specify acustom json 
                                      file to be loaded (look at the --field parameter). 
      -l, --loader         STRING     Database specific data loader to be used [org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader]
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
          --num-threads    INT        Number of threads used for loading data into the database [2]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]
      -D                              Dynamic parameters go here [{}]

For example, to load all human (GRCh37) data models from the /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ created in section [[Build Data Models]], into the cellbase_hsapiens_grch37_v4 database and creating the indexes as indicated in the .js scripts within cellbase/cellbase-app/app/mongodb-scripts/, run:

cellbase/build/bin$ ./cellbase.sh load -d variation --database cellbase_hsapiens_grch37_v4 -i /mnt/data/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ -L debug -Dmongodb-index-folder=/home/cafetero/appl/dev/cellbase/cellbase-app/app/mongodb-scripts/

Please, note that the whole loading and indexing process may need ~24h to complete, depending on the available hardware.

Warning notices

Variant annotation provided by default for the variation dataset, when building CellBase data from scratch, is ENSEMBL variation annotation. CellBase pre-annotated variation collection can only be obtained by the pre-built models provided at http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/. Likewise, population frequencies for 1000 genomes project, UK10K project, GoNL project, ExAC, etc., are not included by default if building the models from scratch. These data are obtained by the CellBase team thanks to additional collaborations and will only be found at the already built variation data models provided at: http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/

After successful load of all data, the corresponding database shall look like:

$ mongo mongodb-dev/cellbase_hsapiens_grch37_v4
MongoDB shell version: 3.0.9
connecting to: mongodb-dev/cellbase_hsapiens_grch37_v4
> show collections;
protein_protein_interaction
clinical
protein
conservation
gene
genome_info
variation_functional_score
genome_sequence
regulatory_region
protein_functional_prediction
variation

Sections in this page

Table of Contents