Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

To describe alignments with additional information about the fragment and the read, we adopt the read alignment schema defined by the Global Alliance for Genomics and Health (GA4GH). A ReadAlignment object is equivalent to an alignment line in a SAM file. Note that we use a plain text file to store the header lines of a SAM file (i.e., lines with program and chromosomes information).

Data serialization systems such as Avro, Thrift and Google Protobuf can use the ReadAlignment schema and then read and write alignments in a compact, fast and binary data format. In addition, ReadAlignment objects can be stored in the Parquet format and then can be processed by a number of different systems: Spark, Hive, Impala and others.

Schema

A read alignment object consists of the following fields:

  • id (string)
    The read alignment ID. This ID is unique within the read group this alignment belongs to. For performance reasons, this field may be omitted by a backend. If provided, its intended use is to make caching and UI display easier for genome browsers and other lightweight clients.

  • read_group_id (string)
    The ID of the read group this read belongs to. Every read must belong to exactly one read group.

  • fragment_name (string)
    The fragment name. Equivalent to QNAME (query template name) in SAM.

  • improper_placement (boolean)
    The orientation and the distance between reads from the fragment are inconsistent with the sequencing protocol (inverse of SAM flag 0x2).

  • duplicate_fragment (boolean)
    The fragment is a PCR or optical duplicate (SAM flag 0x400).

  • number_reads (integer)
    The number of reads in the fragment (extension to SAM flag 0x1).

  • fragment_length (integer)
    The observed length of the fragment, equivalent to TLEN in SAM.

  • read_number (integer)
    The read ordinal in the fragment, 0-based and less than numberReads. This field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly represent multiple reads per fragment.

  • failed_vendor_quality_checks (boolean)
    The read fails platform or vendor quality checks (SAM flag 0x200).

  • alignment (see LinearAlignment at GA4GH's documentation)
    The alignment for this alignment message. This field will be null if the read is unmapped.

  • secondary_alignment (boolean)
    Whether this alignment is secondary. Equivalent to SAM flag 0x100. A secondary alignment represents an alternative to the primary alignment for this read. Aligners may return secondary alignments if a read can map ambiguously to multiple coordinates in the genome. By convention, each read has one and only one alignment where both secondaryAlignment and supplementaryAlignment are false.

  • supplementary_alignment (boolean)
    Whether this alignment is supplementary. Equivalent to SAM flag 0x800. Supplementary alignments are used in the representation of a chimeric alignment. In a chimeric alignment, a read is split into multiple linear alignments that map to different reference contigs. The first linear alignment in the read will be designated as the representative alignment; the remaining linear alignments will be designated as supplementary alignments. These alignments may have different mapping quality scores.

    In each linear alignment in a chimeric alignment, the read will be hard clipped. The alignedSequence and alignedQuality fields in the alignment message will only represent the bases for its respective linear alignment.

  • aligned_sequence (string)
    The bases of the read sequence contained in this alignment record (equivalent to SEQ in SAM).

  • aligned_quality (integer)
    The quality of the read sequence contained in this alignment message (equivalent to QUAL in SAM.
    alignedSequence and alignedQuality may be shorter than the full read sequence and quality. This will occur if the alignment is part of a chimeric alignment, or if the read was trimmed. When this occurs, the CIGAR for this read will begin/end with a hard clip operator that will indicate the length of the excised sequence.
  • next_mate_position (see Position at GA4GH's documentation)
    The mapping of the primary alignment of the (readNumber+1) % numberReads read in the fragment. It replaces mate position and mate strand in SAM.

  • info (map<string, ListValue >)
    A map of additional read alignment information used to store SAM's optional fields (more information at https://samtools.github.io/hts-specs/SAMtags.pdf).

Example

This section shows an alignment in SAM format and the equivalent ReadAlignment.

Example of alignment in SAM format:

Code Block
languagetext
themeRDark
linenumberstrue
1_229454865_229455276_0:0:0_0:0:0_0    83    1    229455177    60    100M    =    229454865    -412    AGTGCTATTTGGATTCATCCCATATGGGCCCCATCTTGTGGTCTGAGGCCTGACAGGGCTCACCTGCAAGCTCGGTTCTCTGCTGTCTTTGATATGGACT    ????????????????????????????????????????????????????????????????????????????????????????????????????    NM:i:0    AS:i:100    XS:i:0

ReadAlignment object for the previous alignment (JSON format):

Code Block
languagetext
themeRDark
linenumberstrue
{
  "id" : {
    "string" : "1_229454865_229455276_0:0:0_0:0:0_0"
  },
  "readGroupId" : "no-group",
  "fragmentName" : "1",
  "improperPlacement" : {
    "boolean" : true
  },
  "duplicateFragment" : {
    "boolean" : false
  },
  "numberReads" : {
    "int" : 2
  },
  "fragmentLength" : {
    "int" : -412
  },
  "readNumber" : {
    "int" : 0
  },
  "failedVendorQualityChecks" : {
    "boolean" : false
  },
  "alignment" : {
    "org.ga4gh.models.LinearAlignment" : {
      "position" : {
        "referenceName" : "1",
        "position" : 229455176,
        "strand" : "NEG_STRAND"
      },
      "mappingQuality" : {
        "int" : 60
      },
      "cigar" : [ {
        "operation" : "ALIGNMENT_MATCH",
        "operationLength" : 100,
        "referenceSequence" : null
      } ]
    }
  },
  "secondaryAlignment" : {
    "boolean" : false
  },
  "supplementaryAlignment" : {
    "boolean" : false
  },
  "alignedSequence" : {
    "string" : "AGTGCTATTTGGATTCATCCCATATGGGCCCCATCTTGTGGTCTGAGGCCTGACAGGGCTCACCTGCAAGCTCGGTTCTCTGCTGTCTTTGATATGGACT"
  },
  "alignedQuality" : [ 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30 ],
  "nextMatePosition" : {
    "org.ga4gh.models.Position" : {
      "referenceName" : "1",
      "position" : 229454865,
      "strand" : "POS_STRAND"
    }
  },
  "info" : {
    "AS" : [ "i", "100" ],
    "XS" : [ "i", "0" ],
    "NM" : [ "i", "0" ]
  }
}

More information and references


Include Page
oskar:Read Alignment
oskar:Read Alignment

Table of Contents:

Table of Contents
indent20px