Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Prerequisites 

A working setup of OpenCGA is required to setup a Testing environment, please follow the steps on installation guide.

The alignment pipeline

Historically, the alignment pipeline performs alignment for This tutorial details how to use the OpenCGA alignment command line to run the alignment/mapping pipeline steps. The alignment pipeline outputs alignments in BAM files from raw sequence data in FastQ format and outputs alignments in BAM files. BAM files can be used for further analysis, such as alignment statistics, coverage computation or variant calling.

This tutorial details how to run the OpenCGA commands to perform the alignment pipeline. In near future, OpenCGA will provide a single high-level command to execute each alignment pipeline command.

asdf

fastqc -> bwa index -> bwa mem -> samtools sam to bam -> samtools sort bam -> alignmnent index -> alignment queryIngesting Clinical Data (creating Variable Sets and Annotation Sets)

We are going to use the Variable Sets and Annotation Sets used in the examples of the Annotation and Clinical Data section. Here are the files needed to load those Variable Sets and Annotation Sets using the command line: demo.tar.gz

First, we will need to load both Variable Sets. To do so, we will run the following command lines

Prerequisites 

A working setup of OpenCGA is required to setup a testing environment, please follow the steps on installation guide.

In addition, you need to download the following data files:

The alignment pipeline

Quality control for raw sequence data: FastQC subcommand

In order to use the input.fastq file, it has to be linked to the OpenCGA catalog:

Code Block
languagebash
themeRDark
$ ./opencga.sh variablesfiles createlink --jsoni demo~/individual_vsinput.jsonfastq -n individual_private_details --confidential --description "Private details of the individual" -s 1kG_phase3 --of yaml-path test/ --parents

Once linked the FastQ file, you can run the FastQC command:

Code Block
languagebash
themeRDark
$ ./opencga.sh alignments fastqc-run --file input.fastq

For the input.fastq file, the FastQC command creates a report file called input_fastqc.html that can be downloaded from the OpenCGA catalog to the local directory /tmp by using the following command:

Code Block
languagebash
themeRDark
$ ./opencga.sh files download --file input_fastqc.html --to /tmp


Here is the FastQC report file: input_fastqc.html.

Mapping raw sequences: BWA subcommand

First, link the reference.fasta file to the OpenCGA catalog:

Code Block
languagebash
themeRDark
$ ./opencga.sh variablesfiles createlink --jsoni demo~/sample_vsreference.jsonfasta --n sample_metadata --description "Sample origin" -s 1kG_phase3 --of yaml
From that moment on, we can annotate using any of the Variable Sets any of the Annotable entries. For example, to annotate both the sample and the individual we created we will run the following commands
path test/ --parents

Then, you can run the bwa index command to index database sequences in the FASTA format:

Code Block
languagebash
themeRDark
$ ./opencga.sh alignments bwa-run --command index --fasta-file reference.fata

Internally, the index for the reference.fasta file created by the bwa index command consists of the following files:

# Annotate the sample sample1 using the variable set 'sample_metadata'
Code Block
languagebash
themeRDark
reference.fasta.bwt
reference.fasta.pac
reference.fasta.ann
reference.fasta.amb
reference.fasta.sa

Once created the index, you can map the FastQ file by using the bwa mem command:

Code Block
languagebash
themeRDark
$ ./opencga.sh samplesalignments annotationbwa-sets-createrun --annotationcommand mem -set-name sampleAnnotName index-base-annotations demo/sample_as.json --id sample1file reference.fasta --fastq1-file input.fastq --variablesam-set-id sample_metadata


# Annotate the individual individual1 using the variable set 'individual_private_details'file output.sam


In the previous command, the result alignments are saved in SAM format in the output.sam file .

Converting to and sorting BAM files

In order to convert a SAM file into BAM file, use the samtools view command with the parameter b (for more parameters details, see http://www.htslib.org/doc/samtools-view.1.html):

Code Block
languagebash
themeRDark
$ ./opencga.sh individualsalignments annotationsamtools-sets-createrun --annotationinput-set-name individualAnnotName --annotations demo/individual_as.jsonfile alignments.sam --output-filename alignments.bam --idcommand individual1view --variable-set-id individual_private_details

Querying Clinical Data

Querying individuals
Db=null

To sort a BAM file, use the samtools sort command:

# Querying all individuals annotated with gender = MALE. Result: The only individual we have created .
Code Block
languagebash
themeRDark
$ ./opencga.sh individualsalignments searchsamtools-run --input-annotation gender=MALEfile alignments.bam --variableoutput-set individual_private_details

# Querying all individuals annotated with age < 60. Result: None because the individual we annotated has age = 60
filename alignments.sorted.bam --command sort

Indexing and querying BAM files

Once the BAM file is sorted, you can index. A BAM index consists of a BAI file. Use the following command to create the index file (it will be called alignments.sorted.bam.bai):

Code Block
languagebash
themeRDark
$ ./opencga.sh individualsalignments searchindex --annotation "age<60" --variable-set individual_private_details

# But we can obtain it if we change the query to age <= 60 as follows
file alignments.sorted.bam

To query for alignments use the query command applying a set of filters (e.g., regions, insert size, maximum number of hits of mismatches, minimum mapping quality...). The following command query alignments for a given region in chromosome 20 and from the position 500000 to 1000000:

Code Block
languagebash
themeRDark
$ ./opencga.sh individualsalignments searchquery --annotation "age<=60"file alignments.sorted.bam --variable-set individual_private_details

# Querying all individuals with age <= 60 and gender = FEMALE. No results because our individual is a MALE.region 20:500000-1000000  --study study1 --rpc REST

Computing and querying BAM coverage

In order to compute the BAM coverage, use the command coverage-run. The coverage is saved in a BigWig format file.

Code Block
languagebash
themeRDark
$ ./opencga.sh individualsalignments searchcoverage-run --annotation "age<=60;gender=FEMALE"file alignments.sorted.bam --variablewindow-set individual_private_details

# Now we change the query to age <=60 and gender = MALE. We get again the individual we expected.
size 50

To query for the coverage in a given region use the command coverage-query. The following command query for the coverage from the position 78700 to 78900 in chromosome 20:

Code Block
languagebash
themeRDark
$ ./opencga.sh individualsalignments searchcoverage-query --annotation "age<=60;gender=MALE" --variable-set individual_private_details
Querying samples
file alignments.sorted.bam --region 20:78700-78900 --study study1


Computing  BAM statistics

In order to compute the statistics for a given BAM file, use the command stats-run:

Code Block
languagebash
themeRDark
# Querying all samples annotated with tissue = "umbilical cord blood". Result: The only sample we have created
$ ./opencga.sh alignments stats-run --file alignments.sorted.bam

BAM statistics can be viewed by executing the command stats-info:

Code Block
languagebash
themeRDark
$ ./opencga.sh samplesalignments searchstats-info --annotation tissue="umbilical cord blood" --variable-set sample_metadata


# Querying all samples annotated with tissue = "umbilical cord blood" and cell type = "multipotent progenitor"
file alignments.sorted.bam

In addition, OpenCGA provides the command stats-query to fetch BAM files according to their statistics. The following command fetch BAM files whose average mapping quality is greater than 35:

Code Block
languagebash
themeRDark
$ ./opencga.sh samplesalignments searchstats-query --average-annotationquality "tissue=umbilical cord blood;cell_type=multipotent progenitor" --variable-set sample_metadata
>25"




Table of Contents:

Table of Contents
indent20px