Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. However, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Thus, if you want to skip the sections below, you can directly download json documents from http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ and jump to the [[Load Data Models]] tutorial.

For those users willing to build CellBase knowledgbase from scratch, please follow the sections below.###


Download data sources


Download can be done through the Cellbase CellBase CLI:
```


cellbase/build/bin$ ./cellbase.sh download
The following option is required: -d, --data

Usage:   cellbase.sh download [options]

Options:
      -a, --assembly       STRING     Name of the assembly, if empty the first assembly in configuration.json will be used [GRCh37]
          --common         STRING     Directory where common multi-species data will be downloaded, this is mainly protein and expression
                                      data [<OUTPUT>/common]
      -C, --config         STRING     CellBase configuration.json file. Have a look at
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example
    * -d, --data           STRING     Comma separated list of data to download: genome, gene, gene_disease_association, variation,
                                      variation_functional_score, regulation, protein, conservation, clinical and . 'all' to download
                                      everything
      -h, --help                      Display this help and exit [false]
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
      -o, --output         STRING     The output directory, species folder will be created [/tmp]
      -s, --species        STRING     Name of the species to be downloaded, valid format include 'Homo sapiens' or 'hsapiens' [Homo
                                      sapiens]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]

...


A number of datasets can be downloaded as indicated by the built-in documentation: `genome, gene, gene_disease_association, variation, variation_functional_score, regulation, protein, conservation, clinical`. An option `all` is implemented for the `--data` parameter to allow downloading all data by a single command. Some datasets (`genome` and `gene`) need the ENSEMBL perl API to be properly installed in order to be fully downloaded. Please note: all data can be downloaded, built and loaded in the database without the ENSEMBL API but some bits may be missing, e.g. gene xrefs.

For example, to download all human (GRCh37) data from all sources and save it into the `/tmp/data/cellbase/v4/` directory, run:

`cellbase/build/bin$ ./cellbase.sh download -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -o /tmp/data/cellbase/v4/ -s hsapiens`

**Please note:** ensure you are located within the `cellbase/build/bin` directory before running the `download` command. Some perl scripts that use the ENSEMBL API may not be properly run otherwise.

Please, also note that heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even hours. Downloaded data should look like these:

```
homo_sapiens_grch37/
├── clinical
│   ├── All.vcf.gz
│   ├── All.vcf.gz.log
│   ├── All.vcf.gz.tbi
│   ├── All.vcf.gz.tbi.log
│   ├── ClinVar_Traits_EFO_Names.csv
│   ├── ClinVar_Traits_EFO_Names.csv.log
│   ├── ClinVar.xml.gz
│   ├── ClinVar.xml.gz.log
│   ├── gwas_catalog.tsv
│   ├── gwas_catalog.tsv.log
│   ├── variant_summary.txt.gz
│   └── variant_summary.txt.gz.log
├── conservation
│   ├── gerp
│   │   ├── hg19.GERP_scores.tar.gz
│   │   └── hg19.GERP_scores.tar.gz.log
│   ├── phastCons
│   │   ├── chr10.phastCons46way.primates.wigFix.gz
│   │   ├── chr10.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr11.phastCons46way.primates.wigFix.gz
│   │   ├── chr11.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr12.phastCons46way.primates.wigFix.gz
│   │   ├── chr12.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr13.phastCons46way.primates.wigFix.gz
│   │   ├── chr13.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr14.phastCons46way.primates.wigFix.gz
│   │   ├── chr14.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr15.phastCons46way.primates.wigFix.gz
│   │   ├── chr15.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr16.phastCons46way.primates.wigFix.gz
│   │   ├── chr16.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr17.phastCons46way.primates.wigFix.gz
│   │   ├── chr17.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr18.phastCons46way.primates.wigFix.gz
│   │   ├── chr18.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr19.phastCons46way.primates.wigFix.gz
│   │   ├── chr19.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr1.phastCons46way.primates.wigFix.gz
│   │   ├── chr1.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr20.phastCons46way.primates.wigFix.gz
│   │   ├── chr20.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr21.phastCons46way.primates.wigFix.gz
│   │   ├── chr21.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr22.phastCons46way.primates.wigFix.gz
│   │   ├── chr22.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr2.phastCons46way.primates.wigFix.gz
│   │   ├── chr2.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr3.phastCons46way.primates.wigFix.gz
│   │   ├── chr3.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr4.phastCons46way.primates.wigFix.gz
│   │   ├── chr4.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr5.phastCons46way.primates.wigFix.gz
│   │   ├── chr5.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr6.phastCons46way.primates.wigFix.gz
│   │   ├── chr6.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr7.phastCons46way.primates.wigFix.gz
│   │   ├── chr7.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr8.phastCons46way.primates.wigFix.gz
│   │   ├── chr8.phastCons46way.primates.wigFix.gz.log
│   │   ├── chr9.phastCons46way.primates.wigFix.gz
│   │   ├── chr9.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrM.phastCons46way.primates.wigFix.gz
│   │   ├── chrM.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrX.phastCons46way.primates.wigFix.gz
│   │   ├── chrX.phastCons46way.primates.wigFix.gz.log
│   │   ├── chrY.phastCons46way.primates.wigFix.gz
│   │   └── chrY.phastCons46way.primates.wigFix.gz.log
│   └── phylop
│       ├── chr10.phyloP46way.primate.wigFix.gz
│       ├── chr10.phyloP46way.primate.wigFix.gz.log
│       ├── chr11.phyloP46way.primate.wigFix.gz
│       ├── chr11.phyloP46way.primate.wigFix.gz.log
│       ├── chr12.phyloP46way.primate.wigFix.gz
│       ├── chr12.phyloP46way.primate.wigFix.gz.log
│       ├── chr13.phyloP46way.primate.wigFix.gz
│       ├── chr13.phyloP46way.primate.wigFix.gz.log
│       ├── chr14.phyloP46way.primate.wigFix.gz
│       ├── chr14.phyloP46way.primate.wigFix.gz.log
│       ├── chr15.phyloP46way.primate.wigFix.gz
│       ├── chr15.phyloP46way.primate.wigFix.gz.log
│       ├── chr16.phyloP46way.primate.wigFix.gz
│       ├── chr16.phyloP46way.primate.wigFix.gz.log
│       ├── chr17.phyloP46way.primate.wigFix.gz
│       ├── chr17.phyloP46way.primate.wigFix.gz.log
│       ├── chr18.phyloP46way.primate.wigFix.gz
│       ├── chr18.phyloP46way.primate.wigFix.gz.log
│       ├── chr19.phyloP46way.primate.wigFix.gz
│       ├── chr19.phyloP46way.primate.wigFix.gz.log
│       ├── chr1.phyloP46way.primate.wigFix.gz
│       ├── chr1.phyloP46way.primate.wigFix.gz.log
│       ├── chr20.phyloP46way.primate.wigFix.gz
│       ├── chr20.phyloP46way.primate.wigFix.gz.log
│       ├── chr21.phyloP46way.primate.wigFix.gz
│       ├── chr21.phyloP46way.primate.wigFix.gz.log
│       ├── chr22.phyloP46way.primate.wigFix.gz
│       ├── chr22.phyloP46way.primate.wigFix.gz.log
│       ├── chr2.phyloP46way.primate.wigFix.gz
│       ├── chr2.phyloP46way.primate.wigFix.gz.log
│       ├── chr3.phyloP46way.primate.wigFix.gz
│       ├── chr3.phyloP46way.primate.wigFix.gz.log
│       ├── chr4.phyloP46way.primate.wigFix.gz
│       ├── chr4.phyloP46way.primate.wigFix.gz.log
│       ├── chr5.phyloP46way.primate.wigFix.gz
│       ├── chr5.phyloP46way.primate.wigFix.gz.log
│       ├── chr6.phyloP46way.primate.wigFix.gz
│       ├── chr6.phyloP46way.primate.wigFix.gz.log
│       ├── chr7.phyloP46way.primate.wigFix.gz
│       ├── chr7.phyloP46way.primate.wigFix.gz.log
│       ├── chr8.phyloP46way.primate.wigFix.gz
│       ├── chr8.phyloP46way.primate.wigFix.gz.log
│       ├── chr9.phyloP46way.primate.wigFix.gz
│       ├── chr9.phyloP46way.primate.wigFix.gz.log
│       ├── chrM.phyloP46way.primate.wigFix.gz
│       ├── chrM.phyloP46way.primate.wigFix.gz.log
│       ├── chrX.phyloP46way.primate.wigFix.gz
│       ├── chrX.phyloP46way.primate.wigFix.gz.log
│       ├── chrY.phyloP46way.primate.wigFix.gz
│       └── chrY.phyloP46way.primate.wigFix.gz.log
├── gene
│   ├── all_gene_disease_associations.txt.gz
│   ├── all_gene_disease_associations.txt.gz.log
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
│   ├── description.txt
│   ├── geneDrug
│   │   ├── dgidb.tsv
│   │   └── dgidb.tsv.log
│   ├── gene_extra_info.log
│   ├── homo_sapiens.cdna.all.fa.gz
│   ├── homo_sapiens.cdna.all.fa.gz.log
│   ├── homo_sapiens.gtf.gz
│   ├── homo_sapiens.gtf.gz.log
│   ├── homo_sapiens.pep.all.fa.gz
│   ├── homo_sapiens.pep.all.fa.gz.log
│   ├── idmapping_selected.tab.gz
│   ├── idmapping_selected.tab.gz.log
│   ├── MotifFeatures.gff.gz
│   ├── MotifFeatures.gff.gz.log
│   └── xrefs.txt
├── gene_disease_association
│   ├── all_gene_disease_associations.txt.gz
│   ├── all_gene_disease_associations.txt.gz.log
│   ├── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt
│   └── ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt.log
├── genome
│   ├── Homo_sapiens.GRCh37.fa.gz
│   └── Homo_sapiens.GRCh37.fa.gz.log
├── regulation
│   ├── AnnotatedFeatures.gff.gz
│   ├── AnnotatedFeatures.gff.gz.log
│   ├── hsa_MTI.xls
│   ├── hsa_MTI.xls.log
│   ├── MotifFeatures.gff.gz
│   ├── MotifFeatures.gff.gz.log
│   ├── RegulatoryFeatures_MultiCell.gff.gz
│   ├── RegulatoryFeatures_MultiCell.gff.gz.log
│   ├── targetScanS.txt.gz
│   └── targetScanS.txt.gz.log
├── variation
│   ├── allele_code.txt.gz
│   ├── allele_code.txt.gz.log
│   ├── allele.txt.gz
│   ├── allele.txt.gz.log
│   ├── attrib.txt.gz
│   ├── attrib.txt.gz.log
│   ├── attrib_type.txt.gz
│   ├── attrib_type.txt.gz.log
│   ├── genotype_code.txt.gz
│   ├── genotype_code.txt.gz.log
│   ├── motif_feature_variation.txt.gz
│   ├── motif_feature_variation.txt.gz.log
│   ├── phenotype_feature_attrib.txt.gz
│   ├── phenotype_feature_attrib.txt.gz.log
│   ├── phenotype_feature.txt.gz
│   ├── phenotype_feature.txt.gz.log
│   ├── phenotype.txt.gz
│   ├── phenotype.txt.gz.log
│   ├── population_genotype.txt.gz
│   ├── population_genotype.txt.gz.log
│   ├── population.txt.gz
│   ├── population.txt.gz.log
│   ├── seq_region.txt.gz
│   ├── seq_region.txt.gz.log
│   ├── source.txt.gz
│   ├── source.txt.gz.log
│   ├── structural_variation_feature.txt.gz
│   ├── structural_variation_feature.txt.gz.log
│   ├── study.txt.gz
│   ├── study.txt.gz.log
│   ├── transcript_variation.txt
│   ├── transcript_variation.txt.gz.log
│   ├── transcript_variation.txt.tmp
│   ├── variation_feature.txt
│   ├── variation_feature.txt.gz.log
│   ├── variation.sorted.txt
│   ├── variation_synonym.txt
│   ├── variation_synonym.txt.gz.log
│   ├── variation.txt
│   └── variation.txt.gz.log
└── variation_functional_score
    ├── whole_genome_SNVs.tsv.gz
    └── whole_genome_SNVs.tsv.gz.log

```

If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database: [[Build & Load Data]].