Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
This tutorial
is not entirely up-to-date and some details may not work with current stable version. We are working to update all documentation.This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. Nevertheless, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Downloading raw files from the original sources and building the data models can be tricky. We encourage users to use our pre-built data models (json files) and to skip the download of raw files from original sources and the posterior building of the data models. Our pre-built json documents (data models) are available from
http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/
http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch38/mongodb/
You could then directly jump to the Load data models section in this tutorial.
For those users willing to build CellBase knowledgbase from scratch, please follow the sections below. Anchor
Download Sources
Download Allele population frequencies datasets are processed following a different pipeline and special sections can be found below for them.
Anchor | ||||
---|---|---|---|---|
|
Download Sources
Download can be done through the Cellbase CLI:
Code Block |
---|
cellbase/build/bin$ ./cellbase.sh download
The following option is required: -d, --data
Usage: cellbase.sh download [options]
Options:
-a, --assembly STRING Name of the assembly, if empty the first assembly in configuration.json will be used
--common STRING Directory where common multi-species data will be downloaded, this is mainly protein and expression
data [<OUTPUT>/common]
-C, --config STRING CellBase configuration.json file. Have a look at
cellbase/cellbase-core/src/main/resources/configuration.json for an example
* -d, --data STRING Comma separated list of data to download: genome, gene, variation, variation_functional_score,
regulation, protein, conservation, clinical_variants, repeats, svs and 'all' to download everything
-h, --help Display this help and exit [false]
-L, --log-level STRING Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
-o, --output STRING The output directory, species folder will be created [/tmp]
-s, --species STRING Name of the species to be downloaded, valid format include 'Homo sapiens' or 'hsapiens' [Homo
sapiens]
-v, --verbose BOOLEAN [Deprecated] Set the level of the logging [false]
|
A number of datasets can be downloaded as indicated by the built-in documentation: genome, gene, gene_disease_association, variation, variation_functional_score, regulation, protein, conservation, clinical_variants, repeats, svs. An option all
is implemented for the --data
parameter to allow downloading all data by a single command. Some datasets (genome
and gene
) need the ENSEMBL perl API to be properly installed in order to be fully downloaded. Please note: all data can be downloaded, built and loaded in the database without the ENSEMBL API but some bits may be missing, e.g. gene xrefs.
For example, to download all human (GRCh37) data from all sources and save it into the /tmp/data/cellbase/v4/
directory, run:
cellbase/build/bin$ ./cellbase.sh download -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -o /tmp/data/cellbase/v4/ -s hsapiens
Please note: ensure you are located within the cellbase/build/bin
directory before running the download
command. Some perl scripts that use the ENSEMBL API may not be properly run otherwise. Also, note that COSMIC server requires login and therefore the CosmicMutantExport.txt.tar.gz file must be manually downloaded from their web page:
https://cancer.sanger.ac.uk/cosmic/download
Please, also note that heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even hours. Downloaded data should look like these:
homo_sapiens_grch37/
├── clinical
│ ├── All.vcf.gz
│ ├── All.vcf.gz.log
│ ├── All.vcf.gz.tbi
│ ├── All.vcf.gz.tbi.log
│ ├── ClinVar_Traits_EFO_Names.csv
│ ├── ClinVar_Traits_EFO_Names.csv.log
│ ├── ClinVar.xml.gz
│ ├── ClinVar.xml.gz.log
│ ├── gwas_catalog.tsv
│ ├── gwas_catalog.tsv.log
│ ├── variant_summary.txt.gz
│ └── variant_summary.txt.gz.log
├── conservation
│ ├── gerp
│ │ ├── hg19.GERP_scores.tar.gz
│ │ └── hg19.GERP_scores.tar.gz.log
│ ├── phastCons
│ │ ├── chr10.phastCons46way.primates.wigFix.gz
│ │ ├── chr10.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr11.phastCons46way.primates.wigFix.gz
│ │ ├── chr11.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr12.phastCons46way.primates.wigFix.gz
│ │ ├── chr12.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr13.phastCons46way.primates.wigFix.gz
│ │ ├── chr13.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr14.phastCons46way.primates.wigFix.gz
│ │ ├── chr14.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr15.phastCons46way.primates.wigFix.gz
│ │ ├── chr15.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr16.phastCons46way.primates.wigFix.gz
│ │ ├── chr16.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr17.phastCons46way.primates.wigFix.gz
│ │ ├── chr17.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr18.phastCons46way.primates.wigFix.gz
│ │ ├── chr18.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr19.phastCons46way.primates.wigFix.gz
│ │ ├── chr19.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr1.phastCons46way.primates.wigFix.gz
│ │ ├── chr1.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr20.phastCons46way.primates.wigFix.gz
│ │ ├── chr20.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr21.phastCons46way.primates.wigFix.gz
│ │ ├── chr21.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr22.phastCons46way.primates.wigFix.gz
│ │ ├── chr22.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr2.phastCons46way.primates.wigFix.gz
│ │ ├── chr2.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr3.phastCons46way.primates.wigFix.gz
│ │ ├── chr3.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr4.phastCons46way.primates.wigFix.gz
│ │ ├── chr4.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr5.phastCons46way.primates.wigFix.gz
│ │ ├── chr5.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr6.phastCons46way.primates.wigFix.gz
│ │ ├── chr6.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr7.phastCons46way.primates.wigFix.gz
│ │ ├── chr7.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr8.phastCons46way.primates.wigFix.gz
│ │ ├── chr8.phastCons46way.primates.wigFix.gz.log
│ │ ├── chr9.phastCons46way.primates.wigFix.gz
│ │ ├── chr9.phastCons46way.primates.wigFix.gz.log
│ │ ├── chrM.phastCons46way.primates.wigFix.gz
│ │ ├── chrM.phastCons46way.primates.wigFix.gz.log
│ │ ├── chrX.phastCons46way.primates.wigFix.gz
│ │ ├── chrX.phastCons46way.primates.wigFix.gz.log
│ │ ├── chrY.phastCons46way.primates.wigFix.gz
│ │ └── chrY.phastCons46way.primates.wigFix.gz.log
│ └── phylop
│ ├── chr10.phyloP46way.primate.wigFix.gz
│ ├── chr10.phyloP46way.primate.wigFix.gz.log
│ ├── chr11.phyloP46way.primate.wigFix.gz
│ ├── chr11.phyloP46way.primate.wigFix.gz.log
│ ├── chr12.phyloP46way.primate.wigFix.gz
│ ├── chr12.phyloP46way.primate.wigFix.gz.log
│ ├── chr13.phyloP46way.primate.wigFix.gz
│ ├── chr13.phyloP46way.primate.wigFix.gz.log
│ ├── chr14.phyloP46way.primate.wigFix.gz
│ ├── chr14.phyloP46way.primate.wigFix.gz.log
│ ├── chr15.phyloP46way.primate.wigFix.gz
│ ├── chr15.phyloP46way.primate.wigFix.gz.log
│ ├── chr16.phyloP46way.primate.wigFix.gz
│ ├── chr16.phyloP46way.primate.wigFix.gz.log
│ ├── chr17.phyloP46way.primate.wigFix.gz
│ ├── chr17.phyloP46way.primate.wigFix.gz.log
│ ├── chr18.phyloP46way.primate.wigFix.gz
│ ├── chr18.phyloP46way.primate.wigFix.gz.log
│ ├── chr19.phyloP46way.primate.wigFix.gz
│ ├── chr19.phyloP46way.primate.wigFix.gz.log
│ ├── chr1.phyloP46way.primate.wigFix.gz
│ ├── chr1.phyloP46way.primate.wigFix.gz.log
│ ├── chr20.phyloP46way.primate.wigFix.gz
│ ├── chr20.phyloP46way.primate.wigFix.gz.log
│ ├── chr21.phyloP46way.primate.wigFix.gz
│ ├── chr21.phyloP46way.primate.wigFix.gz.log
│ ├── chr22.phyloP46way.primate.wigFix.gz
│ ├── chr22.phyloP46way.primate.wigFix.gz.log
│ ├── chr2.phyloP46way.primate.wigFix.gz
│ ├── chr2.phyloP46way.primate.wigFix.gz.log
│ ├── chr3.phyloP46way.primate.wigFix.gz
│ ├── chr3.phyloP46way.primate.wigFix.gz.log
│ ├── chr4.phyloP46way.primate.wigFix.gz
│ ├── chr4.phyloP46way.primate.wigFix.gz.log
│ ├── chr5.phyloP46way.primate.wigFix.gz
│ ├── chr5.phyloP46way.primate.wigFix.gz.log
│ ├── chr6.phyloP46way.primate.wigFix.gz
│ ├── chr6.phyloP46way.primate.wigFix.gz.log
│ ├── chr7.phyloP46way.primate.wigFix.gz
│ ├── chr7.phyloP46way.primate.wigFix.gz.log
│ ├── chr8.phyloP46way.primate.wigFix.gz
│ ├── chr8.phyloP46way.primate.wigFix.gz.log
│ ├── chr9.phyloP46way.primate.wigFix.gz
│ ├── chr9.phyloP46way.primate.wigFix.gz.log
│ ├── chrM.phyloP46way.primate.wigFix.gz
│ ├── chrM.phyloP46way.primate.wigFix.gz.log
│ ├── chrX.phyloP46way.primate.wigFix.gz
│ ├── chrX.phyloP46way.primate.wigFix.gz.log
│ ├── chrY.phyloP46way.primate.wigFix.gz
│ └── chrY.phyloP46way.primate.wigFix.gz.log
├── gene
│ ├── all_gene_disease_associations.txt.gz
│ ├── all_gene_disease_associations.txt.gz.log
│ ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│ ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
│ ├── description.txt
│ ├── geneDrug
│ │ ├── dgidb.tsv
│ │ └── dgidb.tsv.log
│ ├── gene_extra_info.log
│ ├── homo_sapiens.cdna.all.fa.gz
│ ├── homo_sapiens.cdna.all.fa.gz.log
│ ├── homo_sapiens.gtf.gz
│ ├── homo_sapiens.gtf.gz.log
│ ├── homo_sapiens.pep.all.fa.gz
│ ├── homo_sapiens.pep.all.fa.gz.log
│ ├── idmapping_selected.tab.gz
│ ├── idmapping_selected.tab.gz.log
│ ├── MotifFeatures.gff.gz
│ ├── MotifFeatures.gff.gz.log
│ └── xrefs.txt
├── gene_disease_association
│ ├── all_gene_disease_associations.txt.gz
│ ├── all_gene_disease_associations.txt.gz.log
│ ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│ └── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
├── genome
│ ├── Homo_sapiens.GRCh37.fa.gz
│ └── Homo_sapiens.GRCh37.fa.gz.log
├── regulation
│ ├── AnnotatedFeatures.gff.gz
│ ├── AnnotatedFeatures.gff.gz.log
│ ├── hsa_MTI.xls
│ ├── hsa_MTI.xls.log
│ ├── MotifFeatures.gff.gz
│ ├── MotifFeatures.gff.gz.log
│ ├── RegulatoryFeatures_MultiCell.gff.gz
│ ├── RegulatoryFeatures_MultiCell.gff.gz.log
│ ├── targetScanS.txt.gz
│ └── targetScanS.txt.gz.log
├── variation
│ ├── allele_code.txt.gz
│ ├── allele_code.txt.gz.log
│ ├── allele.txt.gz
│ ├── allele.txt.gz.log
│ ├── attrib.txt.gz
│ ├── attrib.txt.gz.log
│ ├── attrib_type.txt.gz
│ ├── attrib_type.txt.gz.log
│ ├── genotype_code.txt.gz
│ ├── genotype_code.txt.gz.log
│ ├── motif_feature_variation.txt.gz
│ ├── motif_feature_variation.txt.gz.log
│ ├── phenotype_feature_attrib.txt.gz
│ ├── phenotype_feature_attrib.txt.gz.log
│ ├── phenotype_feature.txt.gz
│ ├── phenotype_feature.txt.gz.log
│ ├── phenotype.txt.gz
│ ├── phenotype.txt.gz.log
│ ├── population_genotype.txt.gz
│ ├── population_genotype.txt.gz.log
│ ├── population.txt.gz
│ ├── population.txt.gz.log
│ ├── seq_region.txt.gz
│ ├── seq_region.txt.gz.log
│ ├── source.txt.gz
│ ├── source.txt.gz.log
│ ├── structural_variation_feature.txt.gz
│ ├── structural_variation_feature.txt.gz.log
│ ├── study.txt.gz
│ ├── study.txt.gz.log
│ ├── transcript_variation.txt
│ ├── transcript_variation.txt.gz.log
│ ├── transcript_variation.txt.tmp
│ ├── variation_feature.txt
│ ├── variation_feature.txt.gz.log
│ ├── variation.sorted.txt
│ ├── variation_synonym.txt
│ ├── variation_synonym.txt.gz.log
│ ├── variation.txt
│ └── variation.txt.gz.log
└── variation_functional_score
├── whole_genome_SNVs.tsv.gz
└── whole_genome_SNVs.tsv.gz.log
If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database.
Build Data Models
The process may be carried out by using the Cellbase CLI:
cellbase/build/bin$ ./cellbase.sh build
The following options are required: -d, --data -i, --input
Usage: cellbase.sh build [options]
Options:
-a, --assembly STRING Name of the assembly, if empty the first assembly in configuration.json will be used
--common STRING Directory where common multi-species data will be downloaded, this is mainly protein and expression
data [<OUTPUT>/common]
-C, --config STRING CellBase configuration.json file. Have a look at
cellbase/cellbase-core/src/main/resources/configuration.json for an example
* -d, --data STRING Comma separated list of data to build: genome, gene, disgenet, hpo, variation, cadd, regulation,
protein, conservation, drug, clinvar, cosmic and GWAS CAatalog. 'all' build everything.
-h, --help Display this help and exit [false]
* -i, --input STRING Input directory with the downloaded data sources to be loaded
-L, --log-level STRING Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
-o, --output STRING Output directory where the JSON data models are saved [/tmp]
-s, --species STRING Name of the species to be built, valid format include 'Homo sapiens' or 'hsapiens' [Homo sapiens]
-v, --verbose BOOLEAN [Deprecated] Set the level of the logging [false]
The build
process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the /tmp/data/cellbase/v4/homo_sapiens_grch37/
directory created in section [[Download Sources]] and save the result at /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
: cellbase/build/bin$ mkdir /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb cellbase/build/bin$ ./cellbase.sh build -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -i /tmp/data/cellbase/v4/homo_sapiens_grch37/ -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ -s hsapiens
Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.
After completion of the build process, your output directory shall look like:
cellbase/build/bin$ ls /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/cadd.json.gz
clinvar.json.gz
conservation_10.json.gz
conservation_11.json.gz
conservation_12.json.gz
conservation_13.json.gz
conservation_14.json.gz
conservation_15.json.gz
conservation_16.json.gz
conservation_17.json.gz
conservation_18.json.gz
conservation_19.json.gz
conservation_1.json.gz
conservation_20.json.gz
conservation_21.json.gz
conservation_22.json.gz
conservation_2.json.gz
conservation_3.json.gz
conservation_4.json.gz
conservation_5.json.gz
conservation_6.json.gz
conservation_7.json.gz
conservation_8.json.gz
conservation_9.json.gz
conservation_M.json.gz
conservation_X.json.gz
conservation_Y.json.gz
cosmic.json.gz
gene.json.gz
genome_info.json
genome_sequence.json.gz
protein.json.gz
protein_protein_interaction.json.gz
prot_func_pred_chr_10.json.gz
prot_func_pred_chr_11.json.gz
prot_func_pred_chr_12.json.gz
prot_func_pred_chr_13.json.gz
prot_func_pred_chr_14.json.gz
prot_func_pred_chr_15.json.gz
prot_func_pred_chr_16.json.gz
prot_func_pred_chr_17.json.gz
prot_func_pred_chr_18.json.gz
prot_func_pred_chr_19.json.gz
prot_func_pred_chr_1.json.gz
prot_func_pred_chr_20.json.gz
prot_func_pred_chr_21.json.gz
prot_func_pred_chr_22.json.gz
prot_func_pred_chr_2.json.gz
prot_func_pred_chr_3.json.gz
prot_func_pred_chr_4.json.gz
prot_func_pred_chr_5.json.gz
prot_func_pred_chr_6.json.gz
prot_func_pred_chr_7.json.gz
prot_func_pred_chr_8.json.gz
prot_func_pred_chr_9.json.gz
prot_func_pred_chr_MT.json.gz
prot_func_pred_chr_X.json.gz
prot_func_pred_chr_Y.json.gz
regulatory_region.json.gz
variation_chr10.json.gz
variation_chr11.json.gz
variation_chr12.json.gz
variation_chr13.json.gz
variation_chr14.json.gz
variation_chr15.json.gz
variation_chr16.json.gz
variation_chr17.json.gz
variation_chr18.json.gz
variation_chr19.json.gz
variation_chr1.json.gz
variation_chr20.json.gz
variation_chr21.json.gz
variation_chr22.json.gz
variation_chr2.json.gz
variation_chr3.json.gz
variation_chr4.json.gz
variation_chr5.json.gz
variation_chr6.json.gz
variation_chr7.json.gz
variation_chr8.json.gz
variation_chr9.json.gz
variation_chrMT.json.gz
variation_chrX.json.gz
variation_chrY.json.gz
If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database.
Downloading population frequencies datasets
Must be manually downloaded from source repositories:
- GONL: http://www.nlgenome.nl/
- gnomAD: https://data.broadinstitute.org/gnomAD
- 1000 Genomes Project: http://www.internationalgenome.org/
- UK10K: http://www.uk10k.org/data.html
- ESP: http://evs.gs.washington.edu/EVS/
- DiscovEHR: http://discovehrshare.com/downloads
Build Data Models
The process may be carried out by using the Cellbase CLI:
Code Block |
---|
cellbase/build/bin$ ./cellbase.sh build
The following options are required: -d, --data -i, --input
Usage: cellbase.sh download [options]
Options:
-a, --assembly STRING Name of the assembly, if empty the first assembly in configuration.json will be used
--common STRING Directory where common multi-species data will be downloaded, this is mainly protein and expression
data [<OUTPUT>/common]
-C, --config STRING CellBase configuration.json file. Have a look at
cellbase/cellbase-core/src/main/resources/configuration.json for an example
* -d, --data STRING Comma separated list of data to build: genome, genome_info, gene, variation,
variation_functional_score, regulation, protein, ppi, conservation, drug, clinical_variants,
repeats, svs. 'all' builds everything.
--flexible-gtf-parsing By default, ENSEMBL GTF format is expected. Nevertheless, GTF specification is quite loose and
other GTFs may be provided in which the order of the features is not as systematic as within the
ENSEMBL's GTFs. Use this option to enable a more flexible parsing of the GTF if it does not strictly
follow ENSEMBL's GTFs format. Flexible GTF requires more memory and is less efficient. [false]
-h, --help Display this help and exit [false]
* -i, --input STRING Input directory with the downloaded data sources to be loaded
-L, --log-level STRING Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
-o, --output STRING Output directory where the JSON data models are saved [/tmp]
-s, --species STRING Name of the species to be built, valid format include 'Homo sapiens' or 'hsapiens' [Homo sapiens]
-v, --verbose BOOLEAN [Deprecated] Set the level of the logging [false]
|
The build
process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the /tmp/data/cellbase/v4/homo_sapiens_grch37/
directory created in section Download Sources and save the result at /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
:
Code Block |
---|
cellbase/build/bin$ mkdir /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb
cellbase/build/bin$ ./cellbase.sh build -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -i /tmp/data/cellbase/v4/homo_sapiens_grch37/ -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ -s hsapiens |
Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.
After completion of the build process, your output directory shall look like:
Code Block |
---|
cellbase/build/bin$ ls /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
clinical_variants.full.json.gz
clinvarVersion.json
conservation_10.json.gz
conservation_11.json.gz
conservation_12.json.gz
conservation_13.json.gz
conservation_14.json.gz
conservation_15.json.gz
conservation_16.json.gz
conservation_17.json.gz
conservation_18.json.gz
conservation_19.json.gz
conservation_1.json.gz
conservation_20.json.gz
conservation_21.json.gz
conservation_22.json.gz
conservation_2.json.gz
conservation_3.json.gz
conservation_4.json.gz
conservation_5.json.gz
conservation_6.json.gz
conservation_7.json.gz
conservation_8.json.gz
conservation_9.json.gz
conservation_M.json.gz
conservation_X.json.gz
conservation_Y.json.gz
cosmic.json.gz
cosmicVersion.json
dgidbVersion.json
dgvVersion.json
disgenetVersion.json
ensemblCoreVersion.json
ensemblRegulationVersion.json
ensemblVariationVersion.json
geneExpressionAtlasVersion.json
gene.json.gz
genome_info.json
genome_info.log
genome_sequence.json.gz
genomeVersion.json
genomicSuperDups.json
gnomadVersion.json
hpoVersion.json
interproVersion.json
mirbaseVersion.json
phastConsVersion.json
phyloPVersion.json
protein.json.gz
prot_func_pred_chr_10.json
prot_func_pred_chr_11.json
prot_func_pred_chr_12.json
prot_func_pred_chr_13.json
prot_func_pred_chr_14.json
prot_func_pred_chr_15.json
prot_func_pred_chr_16.json
prot_func_pred_chr_17.json
prot_func_pred_chr_18.json
prot_func_pred_chr_19.json
prot_func_pred_chr_1.json
prot_func_pred_chr_20.json
prot_func_pred_chr_21.json
prot_func_pred_chr_22.json
prot_func_pred_chr_2.json
prot_func_pred_chr_3.json
prot_func_pred_chr_4.json
prot_func_pred_chr_5.json
prot_func_pred_chr_6.json
prot_func_pred_chr_7.json
prot_func_pred_chr_8.json
prot_func_pred_chr_9.json
prot_func_pred_chr_MT.json
prot_func_pred_chr_X.json
prot_func_pred_chr_Y.json
regulatory_region.json.gz
repeats.json.gz
simpleRepeat.json
structuralVariants.json.gz
toload
uniprotVersion.json
uniprotXrefVersion.json
variation_chr10.json.gz
variation_chr10.somatic.json.gz
variation_chr11.json.gz
variation_chr11.somatic.json.gz
variation_chr12.json.gz
variation_chr12.somatic.json.gz
variation_chr13.json.gz
variation_chr13.somatic.json.gz
variation_chr14.json.gz
variation_chr14.somatic.json.gz
variation_chr15.json.gz
variation_chr15.somatic.json.gz
variation_chr16.json.gz
variation_chr16.somatic.json.gz
variation_chr17.json.gz
variation_chr17.somatic.json.gz
variation_chr18.json.gz
variation_chr18.somatic.json.gz
variation_chr19.json.gz
variation_chr19.somatic.json.gz
variation_chr1.json.gz
variation_chr1.somatic.json.gz
variation_chr20.json.gz
variation_chr20.somatic.json.gz
variation_chr21.json.gz
variation_chr21.somatic.json.gz
variation_chr22.json.gz
variation_chr22.somatic.json.gz
variation_chr2.json.gz
variation_chr2.somatic.json.gz
variation_chr3.json.gz
variation_chr3.somatic.json.gz
variation_chr4.json.gz
variation_chr4.somatic.json.gz
variation_chr5.json.gz
variation_chr5.somatic.json.gz
variation_chr6.json.gz
variation_chr6.somatic.json.gz
variation_chr7.json.gz
variation_chr7.somatic.json.gz
variation_chr8.json.gz
variation_chr8.somatic.json.gz
variation_chr9.json.gz
variation_chr9.somatic.json.gz
variation_chrMT.json.gz
variation_chrX.json.gz
variation_chrX.somatic.json.gz
variation_chrY.json.gz
windowMasker.json |
If build
was successful, you can proceed to loading the data models into the database: [[Load Data Models]].
Anchor | ||||
---|---|---|---|---|
|
Load Data Models
Getting data models
There are two ways of getting the data models that shall populate the CellBase database:
- For those users willing to build CellBase knowledgbase from scratch, please follow the tutorial from the Download Sources section.
- Download data models from http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/.
Load data models
Please, note that before loading the data models into the database the CellBase code must have been previously compiled with maven and injected with database credentials, as explained at the README file.
CellBase code is open-source and freely available at https://github.com/opencb/cellbase
Use the CellBase CLI to load the data models:
cellbase/build/bin$ ./cellbase.sh load
The following options are required: -i, --input -d, --data --database
Usage: cellbase.sh load [options]
Options:
-C, --config STRING CellBase configuration.json file. Have a look at
cellbase/cellbase-core/src/main/resources/configuration.json for an example
* -d, --data STRING Data model type to be loaded, i.e. genome, gene, ...
* --database STRING Data model type to be loaded, i.e. genome, gene, ...
--field STRING Use this parameter when an custom update of the database documents is required. Indicate herethe
full path to the document field that must be updated, e.g. annotation.populationFrequencies. This
parameter must be used togetherwith a custom file provided at --input and the data to update
indicated at --data.
-h, --help Display this help and exit [false]
* -i, --input STRING Input directory with the JSON data models to be loaded. Can also be used to specify acustom json
file to be loaded (look at the --field parameter).
-l, --loader STRING Database specific data loader to be used [org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader]
-L, --log-level STRING Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
--num-threads INT Number of threads used for loading data into the database [2]
-v, --verbose BOOLEAN [Deprecated] Set the level of the logging [false]
-D Dynamic parameters go here [{}]
For example, to load all human (GRCh37) data models from the /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
created in section [[Build Data Models]], into the cellbase_hsapiens_grch37_v4
database and creating the indexes as indicated in the .js
scripts within cellbase/cellbase-app/app/mongodb-scripts/
, run:
cellbase/build/bin$ ./cellbase.sh load -d variation --database cellbase_hsapiens_grch37_v4 -i /mnt/data/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ -L debug -Dmongodb-index-folder=/home/cafetero/appl/dev/cellbase/cellbase-app/app/mongodb-scripts/
Please, note that the whole loading and indexing process may need ~24h to complete, depending on the available hardware.
Warning notices
Variant annotation provided by default for the variation
dataset, when building CellBase data from scratch, is ENSEMBL variation annotation. CellBase pre-annotated variation
collection can only be obtained by the pre-built models provided at http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/. Likewise, population frequencies for 1000 genomes project, UK10K project, GoNL project, ExAC, etc., are not included by default if building the models from scratch. These data are obtained by the CellBase team thanks to additional collaborations and will only be found at the already built variation
data models provided at: http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homosapiensgrch37/mongodb/
After successful load of all data, the corresponding database shall look like:
$ mongo mongodb-dev/cellbase_hsapiens_grch37_v4
MongoDB shell version: 3.0.9
connecting to: mongodb-dev/cellbase_hsapiens_grch37_v4
> show collections;
protein_protein_interaction
clinical
protein
conservation
gene
genome_info
variation_functional_score
genome_sequence
regulatory_region
protein_functional_prediction
variation
Sections in this page
Table of Contents |
---|