Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code availability

CellBase is open-source and freely available at https://github.com/opencb/cellbase

Download data sources

The first step to creating a CellBase instance is to download the data files. Download can be done through the CellBase CLI:

...


A number of datasets can be downloaded as indicated by the built-in documentation: `genome, gene, gene_disease_association, variation, variation_functional_score, regulation, protein, conservation, clinical`. An option `all` is implemented for the `--data` parameter to allow downloading all data by a single command. Some datasets (`genome` and `gene`) need the ENSEMBL perl API to be properly installed in order to be fully downloaded. Please note: all data can be downloaded, built and loaded in the database without the ENSEMBL API but some bits may be missing, e.g. gene xrefs.

For example, to download all human (GRCh37) data from all sources and save it into the `/tmp/data/cellbase/v4/` directory, run:

`cellbase/build/bin$ ./cellbase.sh download -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -o /tmp/data/cellbase/v4/ -s hsapiens`

**Please note:** ensure you are located within the `cellbase/build/bin` directory before running the `download` command. Some perl scripts that use the ENSEMBL API may not be properly run otherwise.

Please, also note that heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even hours. Downloaded data should look like these:
```
homo_sapiens_grch37/
├── clinical
│   ├── All.vcf.gz
│   ├── All.vcf.gz.log
│   ├── All.vcf.gz.tbi
│   ├── All.vcf.gz.tbi.log
│   ├── ClinVar_Traits_EFO_Names.csv
│   ├── ClinVar_Traits_EFO_Names.csv.log
│   ├── ClinVar.xml.gz
│   ├── ClinVar.xml.gz.log
│   ├── gwas_catalog.tsv
│   ├── gwas_catalog.tsv.log
│   ├── variant_summary.txt.gz
│   └── variant_summary.txt.gz.log
├── conservation
│   ├── gerp
│   │   ├── hg19.GERP_scores.tar.gz
│   │   └── hg19.GERP_scores.tar.gz.log
│   ├── phastCons
│   │   ├── chr10.phastCons46way.primates.wigFix.gz
│   │   ├── chr10.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr11.phastCons46way.primates.wigFix.gz
│   │   ├── chr11.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr12.phastCons46way.primates.wigFix.gz
│   │   ├── chr12.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr13.phastCons46way.primates.wigFix.gz
│   │   ├── chr13.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr14.phastCons46way.primates.wigFix.gz
│   │   ├── chr14.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr15.phastCons46way.primates.wigFix.gz
│   │   ├── chr15.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr16.phastCons46way.primates.wigFix.gz
│   │   ├── chr16.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr17.phastCons46way.primates.wigFix.gz
│   │   ├── chr17.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr18.phastCons46way.primates.wigFix.gz
│   │   ├── chr18.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr19.phastCons46way.primates.wigFix.gz
│   │   ├── chr19.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr1.phastCons46way.primates.wigFix.gz
│   │   ├── chr1.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr20.phastCons46way.primates.wigFix.gz
│   │   ├── chr20.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr21.phastCons46way.primates.wigFix.gz
│   │   ├── chr21.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr22.phastCons46way.primates.wigFix.gz
│   │   ├── chr22.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr2.phastCons46way.primates.wigFix.gz
│   │   ├── chr2.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr3.phastCons46way.primates.wigFix.gz
│   │   ├── chr3.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr4.phastCons46way.primates.wigFix.gz
│   │   ├── chr4.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr5.phastCons46way.primates.wigFix.gz
│   │   ├── chr5.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr6.phastCons46way.primates.wigFix.gz
│   │   ├── chr6.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr7.phastCons46way.primates.wigFix.gz
│   │   ├── chr7.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr8.phastCons46way.primates.wigFix.gz
│   │   ├── chr8.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr9.phastCons46way.primates.wigFix.gz
│   │   ├── chr9.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrM.phastCons46way.primates.wigFix.gz
│   │   ├── chrM.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrX.phastCons46way.primates.wigFix.gz
│   │   ├── chrX.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrY.phastCons46way.primates.wigFix.gz
│   │   └── chrY.phastCons46way.primates.wigFix.gz.log
│   └── phylop
│       ├── chr10.phyloP46way.primate.wigFix.gz
│       ├── chr10.phyloP46way.primate.wigFix.gz.log
│       ├── chr11.phyloP46way.primate.wigFix.gz
│       ├── chr11.phyloP46way.primate.wigFix.gz.log
│       ├── chr12.phyloP46way.primate.wigFix.gz
│       ├── chr12.phyloP46way.primate.wigFix.gz.log
│       ├── chr13.phyloP46way.primate.wigFix.gz
│       ├── chr13.phyloP46way.primate.wigFix.gz.log
│       ├── chr14.phyloP46way.primate.wigFix.gz
│       ├── chr14.phyloP46way.primate.wigFix.gz.log
│       ├── chr15.phyloP46way.primate.wigFix.gz
│       ├── chr15.phyloP46way.primate.wigFix.gz.log
│       ├── chr16.phyloP46way.primate.wigFix.gz
│       ├── chr16.phyloP46way.primate.wigFix.gz.log
│       ├── chr17.phyloP46way.primate.wigFix.gz
│       ├── chr17.phyloP46way.primate.wigFix.gz.log
│       ├── chr18.phyloP46way.primate.wigFix.gz
│       ├── chr18.phyloP46way.primate.wigFix.gz.log
│       ├── chr19.phyloP46way.primate.wigFix.gz
│       ├── chr19.phyloP46way.primate.wigFix.gz.log
│       ├── chr1.phyloP46way.primate.wigFix.gz
│       ├── chr1.phyloP46way.primate.wigFix.gz.log
│       ├── chr20.phyloP46way.primate.wigFix.gz
│       ├── chr20.phyloP46way.primate.wigFix.gz.log
│       ├── chr21.phyloP46way.primate.wigFix.gz
│       ├── chr21.phyloP46way.primate.wigFix.gz.log
│       ├── chr22.phyloP46way.primate.wigFix.gz
│       ├── chr22.phyloP46way.primate.wigFix.gz.log
│       ├── chr2.phyloP46way.primate.wigFix.gz
│       ├── chr2.phyloP46way.primate.wigFix.gz.log
│       ├── chr3.phyloP46way.primate.wigFix.gz
│       ├── chr3.phyloP46way.primate.wigFix.gz.log
│       ├── chr4.phyloP46way.primate.wigFix.gz
│       ├── chr4.phyloP46way.primate.wigFix.gz.log
│       ├── chr5.phyloP46way.primate.wigFix.gz
│       ├── chr5.phyloP46way.primate.wigFix.gz.log
│       ├── chr6.phyloP46way.primate.wigFix.gz
│       ├── chr6.phyloP46way.primate.wigFix.gz.log
│       ├── chr7.phyloP46way.primate.wigFix.gz
│       ├── chr7.phyloP46way.primate.wigFix.gz.log
│       ├── chr8.phyloP46way.primate.wigFix.gz
│       ├── chr8.phyloP46way.primate.wigFix.gz.log
│       ├── chr9.phyloP46way.primate.wigFix.gz
│       ├── chr9.phyloP46way.primate.wigFix.gz.log
│       ├── chrM.phyloP46way.primate.wigFix.gz
│       ├── chrM.phyloP46way.primate.wigFix.gz.log
│       ├── chrX.phyloP46way.primate.wigFix.gz
│       ├── chrX.phyloP46way.primate.wigFix.gz.log
│       ├── chrY.phyloP46way.primate.wigFix.gz
│       └── chrY.phyloP46way.primate.wigFix.gz.log
├── gene
│   ├── all_gene_disease_associations.txt.gz
│   ├── all_gene_disease_associations.txt.gz.log
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
│   ├── description.txt
│   ├── geneDrug
│   │   ├── dgidb.tsv
│   │   └── dgidb.tsv.log
│   ├── gene_extra_info.log
│   ├── homo_sapiens.cdna.all.fa.gz
│   ├── homo_sapiens.cdna.all.fa.gz.log
│   ├── homo_sapiens.gtf.gz
│   ├── homo_sapiens.gtf.gz.log
│   ├── homo_sapiens.pep.all.fa.gz
│   ├── homo_sapiens.pep.all.fa.gz.log
│   ├── idmapping_selected.tab.gz
│   ├── idmapping_selected.tab.gz.log
│   ├── MotifFeatures.gff.gz
│   ├── MotifFeatures.gff.gz.log
│   └── xrefs.txt
├── gene_disease_association
│   ├── all_gene_disease_associations.txt.gz
│   ├── all_gene_disease_associations.txt.gz.log
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│   └── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
├── genome
│   ├── Homo_sapiens.GRCh37.fa.gz
│   └── Homo_sapiens.GRCh37.fa.gz.log
├── regulation
│   ├── AnnotatedFeatures.gff.gz
│   ├── AnnotatedFeatures.gff.gz.log
│   ├── hsa_MTI.xls
│   ├── hsa_MTI.xls.log
│   ├── MotifFeatures.gff.gz
│   ├── MotifFeatures.gff.gz.log
│   ├── RegulatoryFeatures_MultiCell.gff.gz
│   ├── RegulatoryFeatures_MultiCell.gff.gz.log
│   ├── targetScanS.txt.gz
│   └── targetScanS.txt.gz.log
├── variation
│   ├── allele_code.txt.gz
│   ├── allele_code.txt.gz.log
│   ├── allele.txt.gz
│   ├── allele.txt.gz.log
│   ├── attrib.txt.gz
│   ├── attrib.txt.gz.log
│   ├── attrib_type.txt.gz
│   ├── attrib_type.txt.gz.log
│   ├── genotype_code.txt.gz
│   ├── genotype_code.txt.gz.log
│   ├── motif_feature_variation.txt.gz
│   ├── motif_feature_variation.txt.gz.log
│   ├── phenotype_feature_attrib.txt.gz
│   ├── phenotype_feature_attrib.txt.gz.log
│   ├── phenotype_feature.txt.gz
│   ├── phenotype_feature.txt.gz.log
│   ├── phenotype.txt.gz
│   ├── phenotype.txt.gz.log
│   ├── population_genotype.txt.gz
│   ├── population_genotype.txt.gz.log
│   ├── population.txt.gz
│   ├── population.txt.gz.log
│   ├── seq_region.txt.gz
│   ├── seq_region.txt.gz.log
│   ├── source.txt.gz
│   ├── source.txt.gz.log
│   ├── structural_variation_feature.txt.gz
│   ├── structural_variation_feature.txt.gz.log
│   ├── study.txt.gz
│   ├── study.txt.gz.log
│   ├── transcript_variation.txt
│   ├── transcript_variation.txt.gz.log
│   ├── transcript_variation.txt.tmp
│   ├── variation_feature.txt
│   ├── variation_feature.txt.gz.log
│   ├── variation.sorted.txt
│   ├── variation_synonym.txt
│   ├── variation_synonym.txt.gz.log
│   ├── variation.txt
│   └── variation.txt.gz.log
└── variation_functional_score
    ├── whole_genome_SNVs.tsv.gz
    └── whole_genome_SNVs.tsv.gz.log
```



If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database: [[Build & Load Data]].