Working with Alignment Data

This tutorial details how to use the OpenCGA alignment command line to run the alignment/mapping pipeline steps. The alignment pipeline outputs alignments in BAM files from raw sequence data in FastQ format files. BAM files can be used for further analysis, such as alignment statistics, coverage computation or variant calling.

Prerequisites

A working setup of OpenCGA is required to setup a testing environment, please follow the steps on installation guide.

In addition, you need to download the following data files:

Raw sequence data file: input.fastq

The alignment pipeline

Quality control step: FastQC subcommand

In order to use the input.fastq file, it has to be linked to the OpenCGA catalog:

$ ./build/bin/opencga.sh files link -i ~/input.fastq --path fastq/ --parents

Once linked the FastQ file, you can run the FastQC command:

$ ./build/bin/opencga.sh alignments fastqc --file input.fastq

For the input.fastq file, the FastQC command creates a report file called input_fastqc.html that can be downloaded from the OpenCGA catalog to the local directory /tmp by using the following command:

$ ./build/bin/opencga.sh files download --file input_fastqc.html --to /tmp

Here is the FastQC report file: input_fastqc.html.

Mapping step: BWA subcommand

asdf

fastqc -> bwa index -> bwa mem -> samtools sam to bam -> samtools sort bam -> alignmnent index -> alignment queryIngesting Clinical Data (creating Variable Sets and Annotation Sets)

We are going to use the Variable Sets and Annotation Sets used in the examples of the Annotation and Clinical Data section. Here are the files needed to load those Variable Sets and Annotation Sets using the command line: demo.tar.gz

First, we will need to load both Variable Sets. To do so, we will run the following command lines:

./opencga.sh variables create --json demo/individual_vs.json -n individual_private_details --confidential --description "Private details of the individual" -s 1kG_phase3 --of yaml
./opencga.sh variables create --json demo/sample_vs.json -n sample_metadata --description "Sample origin" -s 1kG_phase3 --of yaml

From that moment on, we can annotate using any of the Variable Sets any of the Annotable entries. For example, to annotate both the sample and the individual we created we will run the following commands:

# Annotate the sample sample1 using the variable set 'sample_metadata'
./opencga.sh samples annotation-sets-create --annotation-set-name sampleAnnotName --annotations demo/sample_as.json --id sample1 --variable-set-id sample_metadata


# Annotate the individual individual1 using the variable set 'individual_private_details'
./opencga.sh individuals annotation-sets-create --annotation-set-name individualAnnotName --annotations demo/individual_as.json --id individual1 --variable-set-id individual_private_details

Querying Clinical Data

Querying individuals

# Querying all individuals annotated with gender = MALE. Result: The only individual we have created
./opencga.sh individuals search --annotation gender=MALE --variable-set individual_private_details

# Querying all individuals annotated with age < 60. Result: None because the individual we annotated has age = 60
./opencga.sh individuals search --annotation "age<60" --variable-set individual_private_details

# But we can obtain it if we change the query to age <= 60 as follows
./opencga.sh individuals search --annotation "age<=60" --variable-set individual_private_details

# Querying all individuals with age <= 60 and gender = FEMALE. No results because our individual is a MALE.
./opencga.sh individuals search --annotation "age<=60;gender=FEMALE" --variable-set individual_private_details

# Now we change the query to age <=60 and gender = MALE. We get again the individual we expected.
./opencga.sh individuals search --annotation "age<=60;gender=MALE" --variable-set individual_private_details

Querying samples

# Querying all samples annotated with tissue = "umbilical cord blood". Result: The only sample we have created
./opencga.sh samples search --annotation tissue="umbilical cord blood" --variable-set sample_metadata


# Querying all samples annotated with tissue = "umbilical cord blood" and cell type = "multipotent progenitor"
./opencga.sh samples search --annotation "tissue=umbilical cord blood;cell_type=multipotent progenitor" --variable-set sample_metadata

Table of Contents:

Page tree