2 Alignments

Principle

Reads coming from the sequencer are just sequences, but have no context. The Alignment takes each Read and maps it against a Reference Genome to get its Genomic Position.

mapping.png

The Alignment tries to find the best position on the Genome. It's computationally expensive, but straightforward for clean reads.

Some Difficulties:
  • Allow for variations in the reads (sequencing errors, polymorphisme, SNVs/INDELs...)
  • Reads mapping to several regions of the genome (pseudo-genes, repeated regions)
  • Difficult regions of the genome (AAAAAAAAAAAAAAAAA, ATATATATATATATATATATATAT, ...)
  • Highly variable regions of the genomes (ex: HLA)
  • Chimeric reads: that have different parts that map to different regions of the genome. Can be a hint of structural variants, recombination,...
  • Reads that map nowhere: shitty reads, foreign reads (bacteria, virus, contamination)

2.1. Reference Genomes

A representation of the human genome. Comparison of sequenced data (after alignment) against the reference genome allows to highlight variations.

2.1.1. A small genome

In the DATA directory, there are only the sequences for 2 chromosomes :

  • chromosome 12
  • mitochondria
ls -l DATA/*fasta
bash

2.1.2. What are the sizes of the chromosome 12 and of the mitochondria ?

cat DATA/reference.chr12.fasta | grep -v '>' | tr -d '\n' | wc -c
cat DATA/reference.chrMT.fasta | grep -v '>' | tr -d '\n' | wc -c
bash
Command explained

grep -v '>' ignores the lines that start with > (the sequence identifiers)
tr -d '\n' removes the end-of-line characters (so that the whole sequence is on a unique line)
wc -c counts the number of characters

2.1.3. What is the start of the sequence for the chromosome 12 ?

less DATA/reference.chr12.fasta
bash
Type q to quit

2.1.4. Why is that ? How many characters are there before it changes ?

cat DATA/reference.chr12.fasta | grep -v '>' | tr -d "\n" | grep -o . | cat -n | grep -v 'N' | head
bash
Command explained

grep -v '>' ignores the lines that contain > (the sequence identifiers)
tr -d '\n' removes the end-of-line characters (so that the whole sequence is on a unique line)
grep -o . prints each character on a line
cat -n numbers the lines
grep -v 'N' ignores the lines that contain N

2.1.5 Information about the reference genome

Limitations
  • Haploid
  • Gaps (5% > 150Mb): repeats, telomeres/centromeres (where DNA is not fully unpacked durement sequencing preparation)
  • Errors
  • Biased towards European Ancestry
Human Genome Versions
Year Release Name UCSC Version Note
2003 NCBI Build 34 hg16
2004 NCBI Build 35 hg17
2006 NCBI Build 36.1 hg18
2009 GRCh37 hg19 Style widely used
2013 GRCh38 hg38 Current Version
2022 T2T-CHM13 - From Telomere-to-Telomere
Postponed GRCh39 - Postponed due to T2T and Human Pangenome Reference Consortium
GRCh38

GRCh38.png
Blue : Gaps
Pink : Partially Assembled Regions

Pangenome

"No single genome can represent the diversity in the human population. Using a single reference genome creates reference biases, adversely affecting variant discovery, gene–disease association studies and the accuracy of genetic analyses."

The Human Pangemone Reference Consortium was started to help migrate common genomic analysis to use a pangenome so diverse populations are better represented.

2.2. Tools

For Short Reads to do WES/WGS

Aglorithm Sensitivity <100bp Sensitivity >100bp Specificity <100bp Specificity >100bp Speed <100bp Speed >100bp Tandem Low Tandem High
BWA ⭐⭐⭐ ⭐⭐ ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐ ⭐⭐
Bowtie2 ⭐⭐⭐ ⭐⭐ ⭐⭐ ⭐⭐
NovoAlign ⭐⭐⭐ ⭐⭐⭐ ⭐⭐ ⭐⭐⭐ ⭐⭐
Smalt ⭐⭐⭐ ⭐⭐ ⭐⭐ ⭐⭐
Stampy ⭐⭐ ⭐⭐⭐ ⭐⭐ ⭐⭐⭐ ⭐⭐

Also:

  • Minimap2
  • GEM
  • SNAP

For RNAseq:

  • HISAT2
  • STAR
GATK Best Practices

The proposed aligner for the WES/WGS workflow is BWA-MEM

2.3. Command Lines

2.3.1. Create a Reference Genome by merging Reference Chromosomes

cat DATA/reference.chr12.fasta DATA/reference.chrMT.fasta > ref.fa
bash

2.3.2. Indexing Reference Genome

bwa index ref.fa
samtools faidx ref.fa
gatk CreateSequenceDictionary -R ref.fa
bash

Runtime : ~ 3 minutes

2.3.3. Mapping child's Reads against Reference Genome

bwa mem  -R '@RG\tID:child\tLB:unknown\tPL:ILLUMINA\tSM:child\tPU:unknown\tCN:default' ref.fa DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz > child.unsorted.sam
bash
  • bwa mem ➡ the name of the program
  • -R '@RG\tID:child-aa\tSM:child' ➡Create a Read-Group to identify reads
  • ref.fa ➡ Reference Genome, fasta file
  • DATA/child.R1.aa.fq.gz ➡ Forward Reads, fastq file
  • DATA/child.R2.aa.fq.gz ➡ Reverse Reads, fastq file
  • child_unsorted.aa.sam ➡ output, sam file

Runtime 6 minutes

2.4. SAM File Format

SAM

Alignment files are SAM files (Simple Alignment Map).
It is a text file (human readable) composed of a Header (lines starting with "@" and alignments).

2.4.1. Count the number of lines in the sam file of the child?

wc -l child.unsorted.sam 
bash

2.4.2. How many lines are there in the header ?

cat child.unsorted.sam | grep '^@'
bash

The regex expression ^@ mean starting with @

2.4.3. Look at the header lines. Check the length for the chromosome 12 and MT

(header lines start with "@")

2.4.4. How many alignements are there in the SAM file ?

2.4.5. Compare this number with the number of reads in child.R1.fastq.gz et child.R2.fastq.gz

2.4.6. Identifiez les colonnes d'un fichier SAM:

# Column Note
1 READ NAME From the FastQ file
2 FLAG Properties of the Read (as a binary code)
3 CHROMOSOME Chrom. of the alignment site
4 POS Position of the alignment site
5 MAPPING QUALITY
6 CIGAR Compact Representation of the alignment
7 MATE CHROMOSOME Chrom. of the mate's alignment site
8 MATE POS Position of the mate's alignment site
9 DISTANCE READ-MATE Distance between this read and its mate (negative if mate if before)
10 DNA SEQ Sequence from the FastQ file
11 QUALITIES Qualities from the FastQ file
12+ PGM Program Information and Annotations
cat child.unsorted.sam | grep -v '^@' | head -1 | tr "\t" "\n" | cat -n
bash
Command line explained

grep -v '^@' ignores the lines that start with @
head -1 keeps only the 1st line
tr "\t" "\n" transpose the columns in lines
cat -n numbers the lines

2.4.7. Look at the @RG in the header, check that this read group is references in one of the Annotation column.

head child.unsorted.sam

The second column contains "SAM FLAG": meta information about the mapping of the read.

2.4.8. Which SAM flags are the most frequent ?

cat child.unsorted.sam | grep -v '^@' | cut -f2 | sort | uniq -c | sort -n
bash
Command line explained

grep -v '^@' ignores the lines that start with @
cut -f2 keeps only the 2nd column
sort regroups the values
uniq -c counts each values
sort -n sorts the values by count

2.4.9. Check the meaning of those flags

Hexadecimal Decimal Name Note
0x1 1 read paired
0x2 2 read mapped in proper pair
0x4 4 read unmapped
0x8 8 mate unmapped
0x10 16 read reverse strand
0x20 32 mate reverse strand
0x40 64 first in pair
0x80 128 second in pair
0x100 256 not primary alignment it is typically used to flag alternative mappings when multiple mappings are presented in a SAM
0x200 512 read fails platform/vendor quality checks
0x400 1024 read is PCR or optical duplicate
0x800 2048 supplementary alignment indicates that the corresponding alignment line is part of a chimeric alignment

2.4.10. Compare SAM and fastq reads

Look at the following reads, check that the sequence and quality are the same in the SAM file and fastq file.

# ID Flag
#1066 HWI-1KL149:98:C78LJACXX:1:1101:3399:41377 163
#1111 HWI-1KL149:98:C78LJACXX:1:1101:3497:89865 83
To extract the 1066th read of the SAM from the fatsq
  • Get the identifier, flag, sequence and quality
cat child.unsorted.sam | grep -v "^@" | head -1066 | tail -1 | cut -f1,2,10,11 | tr "\t" "\n"
bash
  • Get the sequence from the fastq:
# Check the identifier 
# (here it's HWI-1KL149:98:C78LJACXX:1:1101:3399:41377)
zcat DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz | grep -A3 "HWI-1KL149:98:C78LJACXX:1:1101:3399:41377"


# or more directly 
zcat DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz | grep -A3 `cat child.unsorted.sam | grep -v "^@" | head -1066 | tail -1 | cut -f1`

2.4.11. Look at the CIGAR for the following reads

For the file child.unsorted.sam

How to

Adapt the first command from the previous section. Get only the 6 column.
Use the following commands

  • grep -v "^@"
  • head -XXXX
  • tail -1
  • cut -f6
  • #1
  • #845
  • #787
  • #243
  • #9777
  • #253
Letter Description
M Alignment Match (can be a sequence match or mismatch)
I Insertion to the reference
D Deletion from the reference
S Soft clipping (clipped sequences present in SEQ)
H Hard clipping (clipped sequences NOT present in SEQ)
* Not aligned

2.5. Converting SAM into BAM

SAM files are very BIG, so there are 2 binary formats that intend to reduce this size:

  • BAM: gzipped version of the SAM, most used format
  • CRAM: increasingly used format. The reference genome fasta file is needed to read back the CRAM file

To speed up the algorithms, we'll use samtools to convert the text/SAM file child.unsorted.sam to a binary/BAM file child.unsorted.bam.

Compressed file are smaller, but that a longer time to process as then need to be unpacked on-the-fly. The fact that the compressed file are associated to an index, the process is quickened.

2.5.1. Creating child.unsorted.bam

samtools view -bS -o child.unsorted.bam child.unsorted.sam
bash
Command Line explained
  • Option -S : l'entrée est du SAM
  • Option -b : la sortie est du binaire (BAM)
  • Option -o : nom du fichier de sortie

samtools view is also used to explore a BAM/SAM file

2.5.2. Print samtools view's options

samtools view
bash

2.5.3. Print the body of the BAM file

samtools view child.unsorted.bam | head
bash

By default, only the alignments are shown. Options allow to show the header

2.5.4. Show the header of the BAM file

samtools view -H child.unsorted.bam
bash

2.5.5. Print the whole BAM file

samtools view -h child.unsorted.bam | head
bash

2.5.6. Count the BAM reads coming from the child R1 file

The reads coming from the child.R1.fastq.gz file have the flag "first in pair".

Use :

2.6. Processing

2.6.1. Sorting and Indexing

At this moment, reads in the BAM file are in the same order as the unsorted fastq reads produces by the Sequencer ➡that have no meaningful order.

Algorithms tends to work on particular regions or positions at any given time. So they would work faster if reads that are aligned to adjacent genomic regions were located nearby in the BAM file.

We'll use sambamba to sort the reads.

sambamba sort -o child.sorted.bam child.unsorted.bam
bash
Command line explained
  • -o FILE the name of the output file

This produces a new BAM file child.sorted.bam and it index file child.sorted.bam.bai

2.6.2. Check that the file is sorted

samtools view -H child.sorted.bam 
bash

find the SO keyword.

Check the coordinates (columns 3 and 4) of the first reads.

samtools view child.sorted.bam | cut -f 3,4 | head 
bash

2.6.3. Marking Deplucate Reads

During sample preparation e.g. library construction using PCR, duplicates can arise. They are reads originating from a single fragment of DNA.

Having the same fragment present tens, hundreds or thousands of time would give to must weight to this fragment. Thus reads originating from this fragment should be ignored.

In this step we'll Mark those reads ad duplicates so that then can be ignored when necessary.

Avoid this step in the case of amplicons.
sambamba markdup child.sorted.bam child.dedup.bam
bash

A value of 1024 is added to the flag of the marked alignments, which signifies "read is PCR or optical duplicate".

2.6.4. Read Groups

We can define a Read Groups (with a name and a series of properties) and mark given read as belonging to this read group.

In our simple case where a single library preparation derived from a single biological sample was run on a single lane of a flow cell, all the reads from that lane run belong to the same Read Group.

When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate Read Group.

Read groups have various properties:

  • ID: Read group Identifier
  • PU: Platform Unit
  • SM: Sample
  • PL: Platform/technology used to produce the read
  • LB: DNA preparation library identifier

2.6.5 BQSR

Adjusts systematic technical errors regarding base call quality, using machine learning to model errors empirically.
To build the model, we need to provide

  • our input BAM file (with -I)
  • the reference genome (with -R)
  • a database of known polymorphic sites to skip over with -known-sites

Building the Model

gatk BaseRecalibrator \
	-I child.dedup.bam \
	-R ref.fa \
	-known-sites /DATA/bioinfo/chr12M.dbSNP.146.b37.vcf.gz \
	-known-sites /DATA/bioinfo/chr12M.Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
	-known-sites /DATA/bioinfo/chr12M.1000G_phase1.indels.b37.vcf.gz \
	-O child.table
bash

Runtime ~ 3 minutes

Applying the recalibration

gatk ApplyBQSR \
    -I child.dedup.bam \
    -R ref.fa \
    -bqsr child.table \
    -O child.bqsr.bam     
bash

Runtime ~ 1 minute

2.7. Visualisation

2.7.1. By query

Show only the reads in the region [40340000-40350000] of chromosome 12

samtools view child.bqsr.bam 12:40340000-40350000
bash

2.7.2 With samtools tview

samtools tview allows you to visualise the aligned reads from a BAM file.

samtools tview child.bqsr.bam ref.fa
bash

Type g and go to the following position ➡ 12:2690702

Type n to color by nucleotide, and observe what is happening at the position 2690702

The column 2690702 is here

2690691   2690701   2690711
...........^...............
...........|...............
...........|...............
...........|...............
...........|...............
...........|...............
...........V...............

What would be the base call for this position ?

Type q to quit

2.7.3 With IGV

igv allows to a more comfortable visualisation.

igv child.bqsr.bam
bash

Enter the position ➡ 12:2690702 and look at the reads for this position to infer the base call.

alignment.png