tags:
  - ngs
  - bam
type:
  - course
subject:
  - bioinformatics
class: m2ggb
Year: 2023
yaml

2 Alignments

contact: ludwig@univ-brest.fr

Principle

Reads coming from the sequencer are just sequences, but have no context. The Alignment takes each Read and maps it against a Reference Genome to get its Genomic Position.

The Alignment tries to find the best position on the Genome. It's computationally expensive, but straightforward for clean reads.

Some Difficulties:

Allow for variations in the reads (sequencing errors, polymorphisme, SNVs/INDELs...)
Reads mapping to several regions of the genome (pseudo-genes, repeated regions)
Difficult regions of the genome (AAAAAAAAAAAAAAAAA, ATATATATATATATATATATATAT, ...)
Highly variable regions of the genomes (ex: HLA)
Chimeric reads: that have different parts that map to different regions of the genome. Can be a hint of structural variants, recombination,...
Reads that map nowhere: shitty reads, foreign reads (bacteria, virus, contamination)

2.1. Reference Genomes

A representation of the human genome. Comparison of sequenced data (after alignment) against the reference genome allows to highlight variations.

2.1.1. A small genome

In the DATA directory, there are only the sequences for 2 chromosomes :

chromosome 12
mitochondria

ls -l DATA/*fasta
bash

2.1.2. What are the sizes of the chromosome 12 and of the mitochondria ?

cat DATA/reference.chr12.fasta | grep -v '>' | tr -d '\n' | wc -c
cat DATA/reference.chrMT.fasta | grep -v '>' | tr -d '\n' | wc -c
bash

Command explained

grep -v '>' ignores the lines that start with > (the sequence identifiers)
tr -d '\n' removes the end-of-line characters (so that the whole sequence is on a unique line)
wc -c counts the number of characters

2.1.3. What is the start of the sequence for the chromosome 12 ?

less DATA/reference.chr12.fasta
bash

Type q to quit

2.1.4. Why is that ? How many characters are there before it changes ?

cat DATA/reference.chr12.fasta | grep -v '>' | tr -d "\n" | grep -o . | cat -n | grep -v 'N' | head
bash

Command explained

grep -v '>' ignores the lines that contain > (the sequence identifiers)
tr -d '\n' removes the end-of-line characters (so that the whole sequence is on a unique line)
grep -o . prints each character on a line
cat -n numbers the lines
grep -v 'N' ignores the lines that contain N

2.1.5 Information about the reference genome

Limitations

Haploid
Gaps (5% > 150Mb): repeats, telomeres/centromeres (where DNA is not fully unpacked durement sequencing preparation)
Errors
Biased towards European Ancestry

Human Genome Versions

Year	Release Name	UCSC Version	Note
2003	NCBI Build 34	hg16
2004	NCBI Build 35	hg17
2006	NCBI Build 36.1	hg18
2009	GRCh37	hg19	Style widely used
2013	GRCh38	hg38	Current Version
2022	T2T-CHM13	-	From Telomere-to-Telomere
Postponed	GRCh39	-	Postponed due to T2T and Human Pangenome Reference Consortium

GRCh38

Blue : Gaps
Pink : Partially Assembled Regions

Pangenome

"No single genome can represent the diversity in the human population. Using a single reference genome creates reference biases, adversely affecting variant discovery, gene–disease association studies and the accuracy of genetic analyses."

The Human Pangemone Reference Consortium was started to help migrate common genomic analysis to use a pangenome so diverse populations are better represented.

2.2. Tools

For Short Reads to do WES/WGS

Aglorithm	Sensitivity <100bp	Sensitivity >100bp	Specificity <100bp	Specificity >100bp	Speed <100bp	Speed >100bp	Tandem Low	Tandem High
BWA	⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐
Bowtie2	⭐	⭐⭐⭐	⭐	⭐	⭐⭐	⭐⭐	⭐⭐	⭐
NovoAlign	⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐	⭐	⭐⭐	⭐
Smalt	⭐	⭐⭐⭐	⭐	⭐	⭐⭐	⭐⭐	⭐⭐	⭐
Stampy	⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐	⭐	⭐⭐	⭐

Also:

Minimap2
GEM
SNAP

For RNAseq:

HISAT2
STAR

GATK Best Practices

The proposed aligner for the WES/WGS workflow is BWA-MEM

2.3. Command Lines

2.3.1. Create a Reference Genome by merging Reference Chromosomes

cat DATA/reference.chr12.fasta DATA/reference.chrMT.fasta > ref.fa
bash

2.3.2. Indexing Reference Genome

bwa index ref.fa
samtools faidx ref.fa
gatk CreateSequenceDictionary -R ref.fa
bash

Runtime : ~ 3 minutes

2.3.3. Mapping child's Reads against Reference Genome

bwa mem  -R '@RG\tID:child\tLB:unknown\tPL:ILLUMINA\tSM:child\tPU:unknown\tCN:default' ref.fa DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz > child.unsorted.sam
bash

bwa mem ➡ the name of the program
-R '@RG\tID:child-aa\tSM:child' ➡Create a Read-Group to identify reads
ref.fa ➡ Reference Genome, fasta file
DATA/child.R1.aa.fq.gz ➡ Forward Reads, fastq file
DATA/child.R2.aa.fq.gz ➡ Reverse Reads, fastq file
child_unsorted.aa.sam ➡ output, sam file

Runtime 6 minutes

2.4. SAM File Format

SAM

Alignment files are SAM files (Simple Alignment Map).
It is a text file (human readable) composed of a Header (lines starting with "@" and alignments).

2.4.1. Count the number of lines in the sam file of the child?

wc -l child.unsorted.sam 
bash

2.4.2. How many lines are there in the header ?

cat child.unsorted.sam | grep '^@'
bash

The regex expression ^@ mean starting with @

2.4.3. Look at the header lines. Check the length for the chromosome 12 and MT

(header lines start with "@")

2.4.4. How many alignements are there in the SAM file ?

2.4.5. Compare this number with the number of reads in `child.R1.fastq.gz` et `child.R2.fastq.gz`

2.4.6. Identifiez les colonnes d'un fichier SAM:

#	Column	Note
1	READ NAME	From the FastQ file
2	FLAG	Properties of the Read (as a binary code)
3	CHROMOSOME	Chrom. of the alignment site
4	POS	Position of the alignment site
5	MAPPING QUALITY
6	CIGAR	Compact Representation of the alignment
7	MATE CHROMOSOME	Chrom. of the mate's alignment site
8	MATE POS	Position of the mate's alignment site
9	DISTANCE READ-MATE	Distance between this read and its mate (negative if mate if before)
10	DNA SEQ	Sequence from the FastQ file
11	QUALITIES	Qualities from the FastQ file
12+	PGM	Program Information and Annotations

cat child.unsorted.sam | grep -v '^@' | head -1 | tr "\t" "\n" | cat -n
bash

Command line explained

grep -v '^@' ignores the lines that start with @
head -1 keeps only the 1st line
tr "\t" "\n" transpose the columns in lines
cat -n numbers the lines

2.4.7. Look at the `@RG` in the header, check that this read group is references in one of the Annotation column.

head child.unsorted.sam

The second column contains "SAM FLAG": meta information about the mapping of the read.

2.4.8. Which SAM flags are the most frequent ?

cat child.unsorted.sam | grep -v '^@' | cut -f2 | sort | uniq -c | sort -n
bash

Command line explained

grep -v '^@' ignores the lines that start with @
cut -f2 keeps only the 2nd column
sort regroups the values
uniq -c counts each values
sort -n sorts the values by count

2.4.9. Check the meaning of those flags

With http://broadinstitute.github.io/picard/explain-flags.html

Hexadecimal	Decimal	Name	Note
0x1	1	read paired
0x2	2	read mapped in proper pair
0x4	4	read unmapped
0x8	8	mate unmapped
0x10	16	read reverse strand
0x20	32	mate reverse strand
0x40	64	first in pair
0x80	128	second in pair
0x100	256	not primary alignment	it is typically used to flag alternative mappings when multiple mappings are presented in a SAM
0x200	512	read fails platform/vendor quality checks
0x400	1024	read is PCR or optical duplicate
0x800	2048	supplementary alignment	indicates that the corresponding alignment line is part of a chimeric alignment

2.4.10. Compare SAM and fastq reads

Look at the following reads, check that the sequence and quality are the same in the SAM file and fastq file.

#	ID	Flag
#1066	`HWI-1KL149:98:C78LJACXX:1:1101:3399:41377`	163
#1111	`HWI-1KL149:98:C78LJACXX:1:1101:3497:89865`	83

To extract the 1066th read of the SAM from the fatsq

Get the identifier, flag, sequence and quality

cat child.unsorted.sam | grep -v "^@" | head -1066 | tail -1 | cut -f1,2,10,11 | tr "\t" "\n"
bash

Get the sequence from the fastq:

# Check the identifier 
# (here it's HWI-1KL149:98:C78LJACXX:1:1101:3399:41377)
zcat DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz | grep -A3 "HWI-1KL149:98:C78LJACXX:1:1101:3399:41377"


# or more directly 
zcat DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz | grep -A3 `cat child.unsorted.sam | grep -v "^@" | head -1066 | tail -1 | cut -f1`

2.4.11. Look at the CIGAR for the following reads

For the file child.unsorted.sam

How to

Adapt the first command from the previous section. Get only the 6 column.
Use the following commands

grep -v "^@"
head -XXXX
tail -1
cut -f6

#1
#845
#787
#243
#9777
#253

Letter	Description
M	Alignment Match (can be a sequence match or mismatch)
I	Insertion to the reference
D	Deletion from the reference
S	Soft clipping (clipped sequences present in SEQ)
H	Hard clipping (clipped sequences NOT* present in SEQ)*
*	Not aligned

2.5. Converting SAM into BAM

SAM files are very BIG, so there are 2 binary formats that intend to reduce this size:

BAM: gzipped version of the SAM, most used format
CRAM: increasingly used format. The reference genome fasta file is needed to read back the CRAM file

To speed up the algorithms, we'll use samtools to convert the text/SAM file child.unsorted.sam to a binary/BAM file child.unsorted.bam.

Compressed file are smaller, but that a longer time to process as then need to be unpacked on-the-fly. The fact that the compressed file are associated to an index, the process is quickened.

2.5.1. Creating `child.unsorted.bam`

samtools view -bS -o child.unsorted.bam child.unsorted.sam
bash

Command Line explained

Option -S : l'entrée est du SAM
Option -b : la sortie est du binaire (BAM)
Option -o : nom du fichier de sortie

samtools view is also used to explore a BAM/SAM file

2.5.2. Print samtools view's options

samtools view
bash

2.5.3. Print the body of the BAM file

samtools view child.unsorted.bam | head
bash

By default, only the alignments are shown. Options allow to show the header

2.5.4. Show the header of the BAM file

samtools view -H child.unsorted.bam
bash

2.5.5. Print the whole BAM file

samtools view -h child.unsorted.bam | head
bash

2.5.6. Count the BAM reads coming from the child R1 file

The reads coming from the child.R1.fastq.gz file have the flag "first in pair".

Use :

http://broadinstitute.github.io/picard/explain-flags.html
the -f flag option of samtools view.

2.6. Processing

2.6.1. Sorting and Indexing

At this moment, reads in the BAM file are in the same order as the unsorted fastq reads produces by the Sequencer ➡that have no meaningful order.

Algorithms tends to work on particular regions or positions at any given time. So they would work faster if reads that are aligned to adjacent genomic regions were located nearby in the BAM file.

We'll use sambamba to sort the reads.

sambamba sort -o child.sorted.bam child.unsorted.bam
bash

Command line explained

-o FILE the name of the output file

This produces a new BAM file child.sorted.bam and it index file child.sorted.bam.bai

2.6.2. Check that the file is sorted

samtools view -H child.sorted.bam 
bash

find the SO keyword.

Check the coordinates (columns 3 and 4) of the first reads.

samtools view child.sorted.bam | cut -f 3,4 | head 
bash

2.6.3. Marking Deplucate Reads

During sample preparation e.g. library construction using PCR, duplicates can arise. They are reads originating from a single fragment of DNA.

Having the same fragment present tens, hundreds or thousands of time would give to must weight to this fragment. Thus reads originating from this fragment should be ignored.

In this step we'll Mark those reads ad duplicates so that then can be ignored when necessary.

Avoid this step in the case of amplicons.

sambamba markdup child.sorted.bam child.dedup.bam
bash

A value of 1024 is added to the flag of the marked alignments, which signifies "read is PCR or optical duplicate".

2.6.4. Read Groups

We can define a Read Groups (with a name and a series of properties) and mark given read as belonging to this read group.

In our simple case where a single library preparation derived from a single biological sample was run on a single lane of a flow cell, all the reads from that lane run belong to the same Read Group.

When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate Read Group.

Read groups have various properties:

ID: Read group Identifier
PU: Platform Unit
SM: Sample
PL: Platform/technology used to produce the read
LB: DNA preparation library identifier

2.6.5 BQSR

Adjusts systematic technical errors regarding base call quality, using machine learning to model errors empirically.
To build the model, we need to provide

our input BAM file (with -I)
the reference genome (with -R)
a database of known polymorphic sites to skip over with -known-sites

Building the Model

gatk BaseRecalibrator \
	-I child.dedup.bam \
	-R ref.fa \
	-known-sites /DATA/bioinfo/chr12M.dbSNP.146.b37.vcf.gz \
	-known-sites /DATA/bioinfo/chr12M.Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
	-known-sites /DATA/bioinfo/chr12M.1000G_phase1.indels.b37.vcf.gz \
	-O child.table
bash

Runtime ~ 3 minutes

Applying the recalibration

gatk ApplyBQSR \
    -I child.dedup.bam \
    -R ref.fa \
    -bqsr child.table \
    -O child.bqsr.bam     
bash

Runtime ~ 1 minute

2.7. Visualisation

2.7.1. By query

Show only the reads in the region [40340000-40350000] of chromosome 12

samtools view child.bqsr.bam 12:40340000-40350000
bash

2.7.2 With samtools tview

samtools tview allows you to visualise the aligned reads from a BAM file.

samtools tview child.bqsr.bam ref.fa
bash

Type g and go to the following position ➡ 12:2690702

Type n to color by nucleotide, and observe what is happening at the position 2690702

The column 2690702 is here

2690691   2690701   2690711
...........^...............
...........|...............
...........|...............
...........|...............
...........|...............
...........|...............
...........V...............

What would be the base call for this position ?

Type q to quit

2.7.3 With IGV

igv allows to a more comfortable visualisation.

igv child.bqsr.bam
bash

Enter the position ➡ 12:2690702 and look at the reads for this position to infer the base call.

Next Step: 3 Variant Calling