tags:
- ngs
- bam
type:
- course
subject:
- bioinformatics
class: m2ggb
Year: 2023
yaml
2 Alignments
contact: ludwig@univ-brest.fr
Reads coming from the sequencer are just sequences, but have no context. The Alignment takes each Read and maps it against a Reference Genome to get its Genomic Position.
The Alignment tries to find the best position on the Genome. It's computationally expensive, but straightforward for clean reads.
A representation of the human genome. Comparison of sequenced data (after alignment) against the reference genome allows to highlight variations.
In the DATA
directory, there are only the sequences for 2 chromosomes :
ls -l DATA/*fasta
bash
cat DATA/reference.chr12.fasta | grep -v '>' | tr -d '\n' | wc -c
cat DATA/reference.chrMT.fasta | grep -v '>' | tr -d '\n' | wc -c
bash
grep -v '>'
ignores the lines that start with >
(the sequence identifiers)
tr -d '\n'
removes the end-of-line characters (so that the whole sequence is on a unique line)
wc -c
counts the number of characters
less DATA/reference.chr12.fasta
bash
cat DATA/reference.chr12.fasta | grep -v '>' | tr -d "\n" | grep -o . | cat -n | grep -v 'N' | head
bash
grep -v '>'
ignores the lines that contain >
(the sequence identifiers)
tr -d '\n'
removes the end-of-line characters (so that the whole sequence is on a unique line)
grep -o .
prints each character on a line
cat -n
numbers the lines
grep -v 'N'
ignores the lines that contain N
Year | Release Name | UCSC Version | Note |
---|---|---|---|
2003 | NCBI Build 34 | hg16 | |
2004 | NCBI Build 35 | hg17 | |
2006 | NCBI Build 36.1 | hg18 | |
2009 | GRCh37 | hg19 | Style widely used |
2013 | GRCh38 | hg38 | Current Version |
2022 | T2T-CHM13 | - | From Telomere-to-Telomere |
Postponed | GRCh39 | - | Postponed due to T2T and Human Pangenome Reference Consortium |
Blue : Gaps
Pink : Partially Assembled Regions
"No single genome can represent the diversity in the human population. Using a single reference genome creates reference biases, adversely affecting variant discovery, gene–disease association studies and the accuracy of genetic analyses."
The Human Pangemone Reference Consortium was started to help migrate common genomic analysis to use a pangenome so diverse populations are better represented.
For Short Reads to do WES/WGS
Aglorithm | Sensitivity <100bp | Sensitivity >100bp | Specificity <100bp | Specificity >100bp | Speed <100bp | Speed >100bp | Tandem Low | Tandem High |
---|---|---|---|---|---|---|---|---|
BWA | ⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐ |
Bowtie2 | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐ |
NovoAlign | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐ |
Smalt | ⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐ |
Stampy | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐ | ⭐ | ⭐⭐ | ⭐ |
Also:
For RNAseq:
The proposed aligner for the WES/WGS workflow is BWA-MEM
cat DATA/reference.chr12.fasta DATA/reference.chrMT.fasta > ref.fa
bash
bwa index ref.fa
samtools faidx ref.fa
gatk CreateSequenceDictionary -R ref.fa
bash
Runtime : ~ 3 minutes
bwa mem -R '@RG\tID:child\tLB:unknown\tPL:ILLUMINA\tSM:child\tPU:unknown\tCN:default' ref.fa DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz > child.unsorted.sam
bash
bwa mem
➡ the name of the program-R '@RG\tID:child-aa\tSM:child'
➡Create a Read-Group to identify reads ref.fa
➡ Reference Genome, fasta fileDATA/child.R1.aa.fq.gz
➡ Forward Reads, fastq fileDATA/child.R2.aa.fq.gz
➡ Reverse Reads, fastq filechild_unsorted.aa.sam
➡ output, sam fileRuntime 6 minutes
Alignment files are SAM files (Simple Alignment Map).
It is a text file (human readable) composed of a Header (lines starting with "@" and alignments).
wc -l child.unsorted.sam
bash
cat child.unsorted.sam | grep '^@'
bash
The regex expression ^@
mean starting with @
(header lines start with "@")
child.R1.fastq.gz
et child.R2.fastq.gz
# | Column | Note |
---|---|---|
1 | READ NAME | From the FastQ file |
2 | FLAG | Properties of the Read (as a binary code) |
3 | CHROMOSOME | Chrom. of the alignment site |
4 | POS | Position of the alignment site |
5 | MAPPING QUALITY | |
6 | CIGAR | Compact Representation of the alignment |
7 | MATE CHROMOSOME | Chrom. of the mate's alignment site |
8 | MATE POS | Position of the mate's alignment site |
9 | DISTANCE READ-MATE | Distance between this read and its mate (negative if mate if before) |
10 | DNA SEQ | Sequence from the FastQ file |
11 | QUALITIES | Qualities from the FastQ file |
12+ | PGM | Program Information and Annotations |
cat child.unsorted.sam | grep -v '^@' | head -1 | tr "\t" "\n" | cat -n
bash
grep -v '^@'
ignores the lines that start with @
head -1
keeps only the 1st line
tr "\t" "\n"
transpose the columns in lines
cat -n
numbers the lines
@RG
in the header, check that this read group is references in one of the Annotation column.head child.unsorted.sam
The second column contains "SAM FLAG": meta information about the mapping of the read.
cat child.unsorted.sam | grep -v '^@' | cut -f2 | sort | uniq -c | sort -n
bash
grep -v '^@'
ignores the lines that start with @
cut -f2
keeps only the 2nd column
sort
regroups the values
uniq -c
counts each values
sort -n
sorts the values by count
Hexadecimal | Decimal | Name | Note |
---|---|---|---|
0x1 | 1 | read paired | |
0x2 | 2 | read mapped in proper pair | |
0x4 | 4 | read unmapped | |
0x8 | 8 | mate unmapped | |
0x10 | 16 | read reverse strand | |
0x20 | 32 | mate reverse strand | |
0x40 | 64 | first in pair | |
0x80 | 128 | second in pair | |
0x100 | 256 | not primary alignment | it is typically used to flag alternative mappings when multiple mappings are presented in a SAM |
0x200 | 512 | read fails platform/vendor quality checks | |
0x400 | 1024 | read is PCR or optical duplicate | |
0x800 | 2048 | supplementary alignment | indicates that the corresponding alignment line is part of a chimeric alignment |
Look at the following reads, check that the sequence and quality are the same in the SAM file and fastq file.
# | ID | Flag |
---|---|---|
#1066 | HWI-1KL149:98:C78LJACXX:1:1101:3399:41377 |
163 |
#1111 | HWI-1KL149:98:C78LJACXX:1:1101:3497:89865 |
83 |
1066th
read of the SAM from the fatsqcat child.unsorted.sam | grep -v "^@" | head -1066 | tail -1 | cut -f1,2,10,11 | tr "\t" "\n"
bash
# Check the identifier
# (here it's HWI-1KL149:98:C78LJACXX:1:1101:3399:41377)
zcat DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz | grep -A3 "HWI-1KL149:98:C78LJACXX:1:1101:3399:41377"
# or more directly
zcat DATA/child.R1.fastq.gz DATA/child.R2.fastq.gz | grep -A3 `cat child.unsorted.sam | grep -v "^@" | head -1066 | tail -1 | cut -f1`
For the file child.unsorted.sam
Adapt the first command from the previous section. Get only the 6
Use the following commands
grep -v "^@"
head -XXXX
tail -1
cut -f6
Letter | Description |
---|---|
M | Alignment Match (can be a sequence match or mismatch) |
I | Insertion to the reference |
D | Deletion from the reference |
S | Soft clipping (clipped sequences present in SEQ) |
H | Hard clipping (clipped sequences NOT present in SEQ) |
* | Not aligned |
SAM
files are very BIG, so there are 2 binary formats that intend to reduce this size:
fasta
file is needed to read back the CRAM
fileTo speed up the algorithms, we'll use samtools
to convert the text/SAM file child.unsorted.sam
to a binary/BAM file child.unsorted.bam
.
Compressed file are smaller, but that a longer time to process as then need to be unpacked on-the-fly. The fact that the compressed file are associated to an index, the process is quickened.
child.unsorted.bam
samtools view -bS -o child.unsorted.bam child.unsorted.sam
bash
-S
: l'entrée est du SAM-b
: la sortie est du binaire (BAM)-o
: nom du fichier de sortiesamtools view
is also used to explore a BAM/SAM file
samtools view
bash
samtools view child.unsorted.bam | head
bash
By default, only the alignments are shown. Options allow to show the header
samtools view -H child.unsorted.bam
bash
samtools view -h child.unsorted.bam | head
bash
The reads coming from the child.R1.fastq.gz
file have the flag "first in pair".
Use :
-f flag
option of samtools view
.At this moment, reads in the BAM file are in the same order as the unsorted fastq reads produces by the Sequencer ➡that have no meaningful order.
Algorithms tends to work on particular regions or positions at any given time. So they would work faster if reads that are aligned to adjacent genomic regions were located nearby in the BAM file.
We'll use sambamba
to sort the reads.
sambamba sort -o child.sorted.bam child.unsorted.bam
bash
-o FILE
the name of the output fileThis produces a new BAM file child.sorted.bam
and it index file child.sorted.bam.bai
samtools view -H child.sorted.bam
bash
find the SO keyword.
Check the coordinates (columns 3 and 4) of the first reads.
samtools view child.sorted.bam | cut -f 3,4 | head
bash
During sample preparation e.g. library construction using PCR, duplicates can arise. They are reads originating from a single fragment of DNA.
Having the same fragment present tens, hundreds or thousands of time would give to must weight to this fragment. Thus reads originating from this fragment should be ignored.
In this step we'll Mark those reads ad duplicates so that then can be ignored when necessary.
sambamba markdup child.sorted.bam child.dedup.bam
bash
A value of 1024
is added to the flag of the marked alignments, which signifies "read is PCR or optical duplicate".
We can define a Read Groups (with a name and a series of properties) and mark given read as belonging to this read group.
In our simple case where a single library preparation derived from a single biological sample was run on a single lane of a flow cell, all the reads from that lane run belong to the same Read Group.
When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate Read Group.
Read groups have various properties:
Adjusts systematic technical errors regarding base call quality, using machine learning to model errors empirically.
To build the model, we need to provide
-I
)-R
)-known-sites
gatk BaseRecalibrator \
-I child.dedup.bam \
-R ref.fa \
-known-sites /DATA/bioinfo/chr12M.dbSNP.146.b37.vcf.gz \
-known-sites /DATA/bioinfo/chr12M.Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
-known-sites /DATA/bioinfo/chr12M.1000G_phase1.indels.b37.vcf.gz \
-O child.table
bash
Runtime ~ 3 minutes
gatk ApplyBQSR \
-I child.dedup.bam \
-R ref.fa \
-bqsr child.table \
-O child.bqsr.bam
bash
Runtime ~ 1 minute
Show only the reads in the region [40340000-40350000] of chromosome 12
samtools view child.bqsr.bam 12:40340000-40350000
bash
samtools tview
allows you to visualise the aligned reads from a BAM file.
samtools tview child.bqsr.bam ref.fa
bash
Type g
and go to the following position ➡ 12:2690702
Type n
to color by nucleotide, and observe what is happening at the position 2690702
The column 2690702 is here
2690691 2690701 2690711
...........^...............
...........|...............
...........|...............
...........|...............
...........|...............
...........|...............
...........V...............
What would be the base call for this position ?
Type q
to quit
igv
allows to a more comfortable visualisation.
igv child.bqsr.bam
bash
Enter the position ➡ 12:2690702 and look at the reads for this position to infer the base call.
Next Step: 3 Variant Calling