Functions

General Information

Optional Arguments for all Functions

  • --out ResultFile : File that will contain the function’s results.vcf(.gz)

  • --log LogFile : File that will contain the function’s log.log

  • --gz : Force all outputs to be bgzipped

Produced ascii files are automatically bgzipped if the filename given by the user ends with .gz

If the user provides the --gz arguments, all output files be bgzipped and .gz will be appended automatically to the filenames (if missing). Output streamed to STD_OUT will also be bgzipped.

VCF Filter Functions

CompoundHeterozygous

Keeps only variants that respect the Compound Heterozygous pattern of inheritance.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --nohomo TRUE|FALSE : Reject if a case is homozygous to alternate allele or if a control has none of the allele ?

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --missing TRUE|FALSE : Missing genotypes allowed ?

Description

Keeps only variants that respect the Compound Heterozygous pattern of inheritance.
In the Compound Heterozygous pattern of inheritance, two variants V1 and V2 from the same gene are valid if
  • All cases have V1 and V2

  • No control have V1 and V2

Thus, a variant are rejected if
  • one case doesn’t have V1 and V2

  • one control has V1 and V2

With --missing true, missing genotypes are concidered compatible with the transmission pattern.
The --nohomo options allows to reject alternate alleles if a case is homozygous to an alternate allele or if at least one control is not heterozygous to an alternate allele of V1/V2. (If all the controls are supposed to be parents of cases)
It might be difficult to read results, since several combination of valid variants might exist. So an extra INFO field COMPOUND is added detailling the variants relation.
This field reads as
COMPOUND=A1>P1(gA|gB|gC)&P2(gD|gE|gF),A2>P3(gG|gH|gI)&P4(gJ|gK|gL),...
Where:
  1. Ax is the number of the allele involved

  2. Px is the partener allele in form chr:pos:ref:alt

  3. gX is the symbol of the gene common to this allele and it partner

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


DeNovo

Keeps only variants that are compatible with a De Novo pattern of inheritance.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --missing TRUE|FALSE : Missing genotypes allowed ?

Description

Keeps only variants that are compatible with a De Novo pattern of inheritance.
  • present in every Case

  • absent in every Controle

Warning

Father/Mother/Child(ren) Trios are expected

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


DeNovoRecessive

Keeps only variants that strictly respect this genotypes parent1 0/1 + parent2 0/0 + child 1/1

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Keeps only variants that strictly respect this genotypes parent1 0/1 + parent2 0/0 + child 1/1
parent1 and parent2 are interchangeable

Warning

Will only run if input file has a trio with 1 case an 2 controls

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


Dominant

Keeps only variants that respect the Dominant pattern of inheritance

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --missing TRUE|FALSE : Missing genotypes allowed ?

  • --nohomo TRUE|FALSE : Reject if a case is homozygous to alternate allele ?

  • --mode Mode : strict : true for all cases | loose : true for at least one case

Description

Keeps only variants that respect the Dominant pattern of inheritance
In the Dominant pattern of inheritance
  • Cases should have the causal variant

  • Controls cannot have the causal variant

Thus, a variant is rejected if
  • one case doesn’t possess the alternate allele (strict mode)

  • one control possesses the alternate allele

If --missing true, missing genotypes are concidered compatible with the transmission pattern.
The --nohomo options allows to reject alternate alleles if at least one case is homozygous. (If you expect the resulting phenotype would not be consistent for example.)
In the strict mode, all cases must have the alternate allele. In the loose mode, only one case has to have the allele (More permissive for larger panels).

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterConsequenceLevel

Filters the variants according to their consequences

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --csq vep.consequence : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]

Description

Filters the variants according to their consequences
The consequence of the variant must be at least as severe as the one given

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterF2

Filters variants to keep only those contributing to F2 data.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --prefix prefix : Output filename prefix

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Filters variants to keep only those contributing to F2 data.
F2 data are described in PubMedId: 23128226, figure 3a
Six sets of results are given, one for:
  1. All variants

  2. All SNVs

  3. variants without rs

  4. SNVs without rs

  5. variants with rs

  6. SNVs with rs

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterFrequencies

Keeps only variants with frequencies below the threshold in all of the selected populations.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --threshold 0.0-1.0 : maximum frequency in any population

  • --pop pop1,pop2,...,popN : List of Populations to test (from AF, AFR_AF, AMR_AF, EAS_AF, EUR_AF, SAS_AF, AA_AF, EA_AF, gnomAD_AF, gnomAD_AFR_AF, gnomAD_AMR_AF, gnomAD_ASJ_AF, gnomAD_EAS_AF, gnomAD_FIN_AF, gnomAD_NFE_AF, gnomAD_OTH_AF, gnomAD_SAS_AF, MAX_AF)

Description

Keeps only variants with frequencies below the threshold in all of the selected populations.
If the variant’s frequency exceeds the threshold for any of the selected populations, the variant is filtered out.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterGeneCsqLevel

Filters the variants according to their consequences on a list of genes.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --genes genes.txt : File listing genes to keep

  • --csq vep.consequence : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]

Description

Filters the variants according to their consequences on a list of genes.
Only the variants impacting at least one of the genes in the list are kept.
The consequence of the variant on the gene must be at least as severe as the one given

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterGeneCsqList

Filters the variants to keep only those affect one of the given genes with one of the given consequences.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --genes genes.txt : List of the genes to keep

  • --csq csq1,csq2,...,csqN : List (comma separated) of VEP consequences to keep

Description

Filters the variants to keep only those affect one of the given genes with one of the given consequences.
If the variants has at least one of the effect from --genes on one of the genes in the file from --csq, then the variants is kept.
The list of effects can be empty : --csq null
VEP consequence must be selected from : [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterGenotype

Filters the variants to match the given genotype filter.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --filter "SAMPLE1:geno1:keep1,SAMPLE2:geno2:keep2,...,SAMPLEN,genoN:keepN" : List (comma separated) for samples, their associated genotypes and is they are to be kept

Description

Filters the variants to match the given genotype filter.
If the genotype of at least one sample mismatches, the variant is Excluded.
Filter format : SAMPLE1:geno1:keep1,SAMPLE2:geno2:keep2,...,SAMPLEN:genoN:keepN
Keep=true|false tells if we want to keep(true) or exclude(false) matching genotype for this sample
Example SA:0/0:false,SB:0/1:true,SC:0/1:true,SD:1/1:false will keep variants that are 0/1 for SB and SC, and that aren’t 0/0 for SA or 1/1 for SD

FilterGnomADFrequency

Filters out variants with frequencies above threshold in GnomAD

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --threshold 0.0-1.0 : Maximum GnomAD Frequency

Description

Filters out variants with frequencies above threshold in GnomAD
In case of multiallelic variant, if any alternate allele passes the filter, the variant is kept

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FilterKnownID

Keeps only variant with and empty 3rd field

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Keeps only variant with and empty 3rd field
The field must be empty or equals to “.”

FilterNew

Keeps only the variants not found in either dbSNP, 1KG or GnomAD

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Keeps only the variants not found in either dbSNP, 1KG or GnomAD

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


FoundInAllCases

Keeps Variants found in every “Case” samples

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Keeps Variants found in every “Case” samples
Case samples are defined by a “1” in the 6th field of the --ped file.
In case of a multiallelic variant, if any variant allele is found or missing, the whole variant is kept.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


HQ

Extract HQ variants.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Extract HQ variants. Defined in 1.12 of the supllementary information of PubMedID=27535533 as
  1. VQSR PASS

  2. At least 80% of the genotypes have DP above 10 and GQ above 20

  3. at least one variant genotype has DP above 10 and GQ above 20


KeepHomoAlt

Returns a VCF containing only the position homozygous to alt for the given SAMPLES

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --sample s1,s2,...sN : list (comma separated) of samples to test

Description

Returns a VCF containing only the position homozygous to alt for the given SAMPLES

Note

In case of multiallelic variants : Various is kept if different samples are homozygous to different alternative alleles


MergeVQSR

Merges SNP and INDEL results files from VQSR

Mandatory Arguments

  • --snp snp.vcf : File containing SNP output from VQSR

  • --indel indel.vcf : File containing INDEL output from VQSR

Description

Merges SNP and INDEL results files from VQSR

MonoAllelicSNV

Keep only the lines containing monoallelic SNVs

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Keep only the lines containing monoallelic SNVs

NotFoundInAnyControl

Removes Variants that are found in controls.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Removes Variants that are found in controls.
Control samples are defined by a “2” in the 6th field of the --ped file.
In case of a multiallelic variant, if any variant allele isn’t found, the whole variant is kept.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


QC

Run a Quality Control on VCF Variants

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --opt custom.parameters.tsv : file containing the various thresholds for the QC (see Documentation)

  • --report fiteredVariant.tsv : output file listing all the variants that were filtered, and why

Description

Run a Quality Control on VCF Variants
A report file gives the reason(s) each variant has been filtered
For more Details, see https://gitlab.com/gmarenne/ravaq
For each group G, the info field has new annotations
  • G_AN: AlleleNumber for this group

  • G_AC: AlleleCounts for this group

  • G_AF: AlleleFrequencies for this group

Warning

The VCF File must contain the following INFO : QD,FS,SOR,MQ,ReadPosRankSum,InbreedingCoeff,MQRankSum


RandomVariants

kept only a portion of the variants from a VCF file.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ratio 0.0-1.0 : Probability of keeping each variant

  • --file positions.txt : File listing Positions to keep regardless of given probability in format chr:position

Description

kept only a portion of the variants from a VCF file.
Each line has a --ratio chance of being kept.
Position listed in the file --file are always kept

Recessive

Keeps only variants that respect the Recessive pattern of inheritance.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --missing TRUE|FALSE : Missing genotypes allowed ?

  • --nohomo TRUE|FALSE : Reject if a control is homozygous to reference allele ?

  • --mode Mode : strict : true for all cases | loose : true for at least one case

Description

Keeps only variants that respect the Recessive pattern of inheritance.
In the Recessive pattern of inheritance
  • Cases should be homozygous to the causal allele

  • Controls should not be homozygous to the causal allele

Thus, a variant is rejected if
  • one case isn’t homozygous to the alternate allele (strict mode)

  • one control is homozygous to the the alternate allele

If --missing true, missing genotypes are concidered compatible with the transmission pattern.
The --nohomo options allows to reject alternate alleles if at least one control is not heterozygous to the alternate allele. (If all the controls are supposed to be parents of cases)
In the strict mode, all cases must be homozygous to the alternate allele. In the loose mode, only one case has to be homozygous to the allele (More permissive for larger panels).

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


Recode

Reads all lines in a VCF Files

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Reads all lines in a VCF Files
Ouputs the input VCF file after applying the various command line filters

RemoveNonSNV

Remove variants lines where there have no SNVs

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Remove variants lines where there have no SNVs

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


RemoveNonVariant

Remove variants where only 0/0 and ./. genotypes are present

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Remove variants where only 0/0 and ./. genotypes are present

SplitByChromosome

Splits a given vcf file by chromosome

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Splits a given vcf file and produces one resulting vcf file by chromosome.

SplitByGene

Creates an output VCF file for each gene.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Creates an output VCF file for each gene.
Some variants can be in several output files, if they impact several genes.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.


SplitFromDB

Generates two new VCF files with variants present/absent in 1kG/GnomAD.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Generates two new VCF files :
  • inDB.MYVCF.vcf (with variants present in 1kG/GnomAD)

  • notInDB.MYVCF.vcf (with variant absent from 1kG/GnomAD)

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept in “inDB.MYVCF.vcf”.


StrictCompoundHeterozygous

Keeps only variants that strictly respect the Compound Heterozygous pattern of inheritance.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --nohomo TRUE|FALSE : Reject if a case is homozygous to alternate allele or if a control has none of the allele ?

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Keeps only variants that strictly respect the Compound Heterozygous pattern of inheritance.
In the Compound Heterozygous pattern of inheritance, two variants V1 and V2 from the same gene are valid if
  • All cases have V1 and V2 from their parents

  • All controls (parents) have one of V1/V2 while the other parents have V2/V1

A variant is kept if and only if :
  • All case have V1 and V2

  • All cases have a parent (control) with V1 and not V2, and this other parent with V2 and not V1

The --nohomo options allows to reject alternate alleles if a sample is homozygous to it.
It might be difficult to read results, since several combination of valid variants might exist. So an extra INFO field COMPOUND is added detailling the variants relation.
This field reads as
COMPOUND=A1>P1(gA|gB|gC)&P2(gD|gE|gF),A2>P3(gG|gH|gI)&P4(gJ|gK|gL),...
Where:
  1. Ax is the number of the allele involved

  2. Px is the partener allele in form chr:pos:ref:alt

  3. gX is the symbol of the gene common to this allele and it partner

Warning

The input VCF File must have been previously annotated with vep.

Warning

This function expects a complete definition of the sample, where all cases are affected children and both their parents are identified controls.

Note

In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.

VCF Transformation Functions

ClearInfoField

Replaces the Info column by “.”

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Removes all annotation from the VCF file by replacing the content of the Info column by “.”

FilterCsqExtractGene

Filters Variants according to consequences. Replaces ID by gene_chr_pos.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --csq vep.consequence : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]

Description

Filters Variants according to consequence.
Replaces ID by gene_chr_pos.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Only Ref and one alternate allele are Kept. The Kept alternate is (in this order) : 1. the most severe; 2. the most frequent in the file; 3. the first one.


GenerateHomoRefForPosition

Generates a VCF with Homozygous-to-Reference Genotypes for every given positions and each sample (Alternate is a transition A↔G, C↔T)

Mandatory Arguments

  • --ref Reference.fasta : Fasta File containing the reference genome

  • --pos positions.tsv : List of positions in the results VCF file

  • --sample samples.txt : List of samples in the results VCF File

Description

Given a Reference Genome, a list of positions and a list of individual, generates a VCF file with Homozygous-to-Reference Genotypes for every given positions and each sample.
The reference must be in .fasta format, with its associated .fai index in the same directory
The position file must contain one position per line, in the format : chr pos
The sample file must contain one sample per line
Each given position is looked up
The Ref of the VCF is taken from the given reference genome
Alt = Transition(Ref) : A↔G / C↔T
Format for each position is GT:DP:GQ:AD:PL
Each genotypes is 0/0:30:99:30,0:0,50,500

MissingToMajor

Replaces every missing genotype by the most frequent allele present

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Replaces every missing genotype by the most frequent allele present
updatse AC,AF,AN annotations
The genotype is homozygous to the most frequent allele A in the form A/A:0:0:0,0,0...

Note

In case of multiallelic variants : The major allele is the most frequent allele from ref and each alternate.


MissingToRef

Replaces every missing genotype by 0/0:0:0:0....

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Replaces every missing genotype by 0/0:0:0:0....
Updates AC/AN/AF annotations

Scramble

Outputs the same VCF same but randomly reassigns the genotypes among the samples

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

This function can be used to annonymize a VCF file. The AC/AN/AF of each variants will stay consistent, but the haplotypes will be broken.
For each line, the genotypes are randomely reassigned among the samples.
The random reassignment is different for each line

SetGenotypeFromProbability

Affect a genotype for each sample, for each position from the GenotypeProbability annotation. If a genotype is already present, it can be kept or replaced

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --overwrite TRUE|FALSE : overwrite existing genotypes ?

Description

The highest probabilty (given by annotation GP=p1,p2,p3) determines the genotype that will be affect
  • highest=p1 → 0/0

  • highest=p2 → 0/1

  • highest=p3 → 1/1

If a genotype is already present, it can be kept or replaced

Warning

The input VCF file must contain Genotype Probability (GP=p1,p2,p3) for each genotype

Note

In case of multiallelic variants : An error will be thrown, as this function expects only monoallelic variants. The affected variant line will be dropped.


SplitMultiAllelic

Splits multiallelic variants into several lines of monoallelic variants

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Splits multiallelic variants into several lines of monoallelic variants

VCFToReference

Outputs the given VCF File and reverts genotypes when ref/alt alleles are inverted according to given reference (as a fasta file)

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ref Reference.fasta : Fasta File containing the reference genome

Description

This function can be used when ref/alt alleles might be inverted (for example when the vcf file has been converted from a plink file)
At each position, the given reference genome is checked to see which allele matches the reference
If none of the allele matches the reference, the line is dropped, and a warning is displayed
AC/AN/AF are updated

Warning

Annotation (INFO/AD/PL/…) are not updated

Note

In case of multiallelic variants : An error will be thrown, as this function expects only monoallelic variants. The affected variant line will be dropped.

VCF Annotation Functions

AddAlleleBalance

Adds the annotations : AB, ABhet, ABhem, OND to a VCF file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Adds the following annotations :
  • AB : Allele balance for each het genotype (alleleDepth(gt1) / alleleDepth(gt1) + alleleDepth(gt2))

  • ABhet : Allele Balance for heterozygous calls (ref/(ref+alt)), for each variant

  • ABhom : Allele Balance for homozygous calls (A/(A+O)) where A is the allele (ref or alt) and O is anything other, for each variant

  • OND : Overall non-diploid ratio (alleles/(alleles+non-alleles)), for each variant

Algorithms is taken from GATK, with the following changes (Results are available for INDELs and multiallelic variants, use with caution)

AddDbSNP

Adds/updates dbSNP information to the VCF from a dbSNP release file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ref dbsnp.vcf : dbSNP refenrece VCF File (can be gzipped)

Description

Adds dbSNP RS in ID field and INFO field
Adds RS= and dbSNPBuildID= in INFO field from the input file --ref.

Note

In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).


AddGroupACANAF

Add AN,AC,AF annotation for each group described in the ped file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

For each group G, the info field has new annotations
  • G_AN AlleleNumber for this group

  • G_AC AlleleCounts for this group

  • G_AF AlleleFrequencies for this group

Note

In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).


AddWorstAndCanonicalConsequence

For each variant, add the most severe consequence from vep and add the consequence from vep for the annotation marked as Canonical.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

For each variant, add the most severe consequence from vep and add the consequence from vep for the annotation marked as Canonical.
The worst consequence is annotated with the keyword WORSTCSQ
The gene for the worst consequence is annotated with the keyword WORSTGENE
The canonical consequence is annotated with the keyword CANONICALCSQ
The gene for the canonical consequence is annotated with the keyword CANONICALGENE
If more than one annotation is marked as canonical, the most severe of them is kept
If no annotation is marked as canonical, the most severe consequence is kept

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).


ReaffectdbSNP

Puts all observed RS numbers in the ID column

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Takes the rs numbers from the Existing_variation annotation (from vep) and adds them to the ID column of the VCF. Puts “.” if no RS has been found

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Every RS IDs from every alternate alleles are listed in the ID column.


UpdateACANAF

Resets the AC AN and AF values for the given VCF file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Adds/Updates the AC AN and AF values for the VCF.

Note

In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).

Analysis Functions

CheckReference

For every position in the vcf file, compares the reference from the VCF to the one in the fasta

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ref Reference.fasta : Fasta File containing the reference genome

Description

For every position in the vcf file, gets the reference from the given fasta and prints :

CHROM

POS

VCF_REF

FASTA_REF

Warning

Lines containing indels are ignored


Chi2

Performs a chi² Association Tests on an input VCF file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Does a simple association test on the data present in the input vcf file.
Computes the number of case samples with and without variants, and the number of control samples with and without variants.
then does a chi² on those values

Note

In case of multiallelic variants : Each alternate allele is processed independently.


CommonVariants

Displays the list of variants that are common to two VCF files

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --file smallest.file.vcf : the smallest of the two input VCF files (can be gzipped)

Description

Displays the list of variants that are common to two VCF files
Output is given as a list of canonical variants.

Warning

For faster execution, use –vcf with the largest file and –file with the smallest one

Note

In case of multiallelic variants : Each alternate allele is processed independently.


CompareGenotype

Compares the genotypes of the samples in the first and second VCF file.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --vcf2 File2.vcf(.gz) : the second input VCF file (can be bgzipped)

Description

Compares the genotypes of the samples in the first and second VCF file.
Both VCF are suppose to contain the same samples. This function compares the genotypes of each sample for each variant accross the files.
This can be useful, for example, to compare 2 calling algorithm.
Output for is :

Sample

Group

Total

Concord

Discord

LeftMissing

RightMissing

%Concord

Note

In case of multiallelic variants : Alternate alleles are expected to be the same and in the same order in both files


CompareToGnomAD

Compares the variants present in a VCF file to those present in a GnomAD VCF file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --file GnomAD.site.vcf.gz : GnomAD VCF File (can be gzipped)

Description

Compares the variants present in a VCF file to those present in a GnomAD VCF file
Output format will be:

#CHR

POS

ID

REF

ALT

QUAL

CSQ

GENE

AC

AF

AN

GnomAD_AC

GnomAD_AF

GnomAD_AN

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


CountFromPublicDB

Returns the number of Variants, SNVs, INDEL, in dbSNP, 1kG, GnomAD.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Returns the number of Variants, SNVs, INDEL, in dbSNP, 1kG, GnomAD.
Output format is (For all/SNVs/Indels):

Total

dbSNP

1kG

GnomAD

Not dbSNP

Not 1kG

Not GnomAD

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


CountGenotypes

Counts the genotypes 0/1 and 1/1 for each variants

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Counts the genotypes 0/1 and 1/1 for each variants
The output format is:

CHROM

POS

REF

ALT

CONSEQUENCE

TOTAL_HETEROZYGOUS

TOTAL_HOMOZYGOUS_ALT

Followed by the number of heterozygous and homozygous for each group defined in the ped file.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


CountMissing

For each samples in the PED file, print a summary of missingness

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --threshold 0.0-1.0 : Maximum ratio of Missing Individuals per position

Description

for each samples in the PED file, prints a summary, in the format

#SAMPLE

TOTAL

GENOTYPED

NB_MISSING

%_MISSING

REF

ALT

Total_Variants

Kept_Variants

where
  • SAMPLE the sample name

  • TOTAL total variants kept

  • GENOTYPED variants with non missing genotypes for this sample

  • NB_MISSING variants with missing genotypes for this sample

  • %_MISSING percent of genotypes missing for this sample

  • REF number of variants homozygous to the ref for this sample

  • ALT number of variants not homozygous to the ref for this sample

The header of the output also contains the total number of variants present in the input file and the number of variants that are kept

Warning

Kept variants are those with less than --threshold genotypes missing


CountVariants

Counts the number of variants for each Samples and print a summary for each group

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Counts the number of variants for each Samples and print a summary for each group
Results Format :

FamilyID

ID

MotherID

FatherID

Sex

Phenotype

Group

NbVariants

Note

In case of multiallelic variants : Each alternate allele is processed independently.


DbSNPMismatch

Check if there is a discrepancy between the ID Column and the VEP annotation for RS ID.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Check if there is a discrepancy between the ID Column and the VEP annotation for RS ID.
Output for the lines with discrepancies have the following format :

CHR

POS

ID

REF

ALT

VEP_Annotation

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : RS IDs of every alternate allele are put in the ID field.


ExtractAlleleCounts

For every variants, exports the variant allele count for each sample

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

For every variants, exports the variant allele count for each sample
Output has the following format

#CHROM

POS

ID

REF

ALT

Followed by the allele count for each sample
Allele Counts can be 0, 1 or 2 for diploides
Missing genotypes have “.” as an allele count

Note

In case of multiallelic variants : Each alternate allele is processed independently.


ExtractNeighbours

Creates a bed file of the positions where at least one sample has 2 SNVs that could be in the same triplet (regardless of the reading frame)

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Scans the whole VCF file, for each successive variants V1 and V2
if at least one sample has the variants V1 and V2 then a bed regions if printed. The region is defined as :

chr

V1_pos

V2_pos

V1 and V2 must be on the same chromosome and V2_pos-V1_pos = 1 or V2_pos-V1_pos = 2

Note

In case of multiallelic variants : chr V1_pos V2_pos is printed if, at least one alternate allele of V1_pos and V2_pos is a SNP, and if one sample has a variant of both side (not necessarily the SNP one).


ExtractPrivateToGroup

Extracts All Variants that are private to a Group.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Extracts All Variants that are private to a Group.
Only the variants found in a single group of samples (as defined in the ped file) are exctrated
The list of the N samples in the group that have the variant is given
Output Format :

#CHR

POS

GROUP

SAMPLES

Note

In case of multiallelic variants : Each alternate allele is processed independently.


F2

Computes F2 data.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --prefix prefix : prefix of the output files

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Computes F2 data.
F2 data are described in PubMedId: 23128226, figure 3a
Six sets of results are given, one for:
  1. All variants

  2. All SNVs

  3. variants without rs (new)

  4. SNVs without rs (new)

  5. variants with rs (known)

  6. SNVs with rs (known)

Warning

The difference between known and new is done by looking a the vep annotation, not the ID column.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


F2Individuals

Computes F2 data by samples and not by groups (Each sample is its own group).

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --prefix prefix : prefix of the output files

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Computes F2 data by samples and not by groups (Each sample is its own group).
F2 data are described in PubMedId: 23128226, figure 3a
Six sets of results are given, one for:
  1. All variants

  2. All SNVs

  3. variants without rs (new)

  4. SNVs without rs (new)

  5. variants with rs (known)

  6. SNVs with rs (known)

Warning

The difference between known and new is done by looking a the vep annotation, not the ID column.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


FrequencyCorrelation

Prints the frequency correlation of variants between local samples and GnomAD

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Prints the frequency correlation of variants between local samples and GnomAD
For each variants prints :

CHR

POS

REF

ALT

Local

GnomAD

Outputs one line per VEP Consequence

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


FrequencyForPrivate

Prints the Allele frequency in the file and each group, for variants not found in dbSNP, 1kG or GnomAD.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

For each variant in the file, if the variant is not found in dbSNP, 1KG or GnomAD :
Prints the frequency in the file, and in each group, as well as its consequence

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


GeneList

Prints the list of all gene covered by the VCF file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Prints the list of all gene covered by the VCF file
The genes are extracted from the SYMBOL annotation from VEP.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


GetWorstConsequence

Print the worst consequence/gene for each variant allele.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Print the worst consequence/gene for each variant allele.
For each allele of each variant, the output is in the format:

#CHR

POS

ID

REF

ALT

WORST_CSQ

AFFECTED_GENE

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


InbreedingCoeffDistribution

Outputs a sorted list of all Inbreeding Coeff from a VCF File.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Outputs a sorted list of all Inbreeding Coeff from a VCF File.
The output file has no header, the values are sorted ascendingly

Warning

Input file must contains Inbreeding Coeff. annotation


IQSBySample

Computes the IQS score for each sample between sequences data and data imputed from genotyping.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --cpu Integer : number of cores

  • --file imputed.vcf(.gz) : VCF File Containing imputed data (can be gzipped)

Description

Computes the IQS score between sequences data and data imputed from genotyping.
Here the IQS score is computed for each sample.
Output format is :

#SAMPLE

GROUP

IQS

NB_VARIANTS

TOTAL_VARIANTS

Note

In case of multiallelic variants : Each alternate allele is processed independently.


IQSByVariant

Computes the IQS score for each variant between sequences data and data imputed from genotyping.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --cpu Integer : number of cores

  • --file imputed.vcf(.gz) : VCF File Containing imputed data (can be gzipped)

Description

Computes the IQS score between sequences data and data imputed from genotyping.
Here the IQS score is computed for each variant.
Output format is :

chr

pos

rs

ref

alt

gene

consequence

Freq_VCF

Freq_GnomAD_NFE

Freq_MaxPop

Max_Pop

IQS

Info

Warning

Extra information are available if the input file was annotated with VEP

Note

In case of multiallelic variants : Each alternate allele is processed independently.


JFSSummary

Outputs the Joint Site Frequency Spectrum Summary statistics

Mandatory Arguments

  • --file GROUP1.GROUP2.XXX.YYY.ZZZ.tsv : input tsv file

Description

Outputs the Joint Site Frequency Spectrum Summary statistics
The input file contains the JFS data comparing to groups of samples.
Those data, generated by the function JointFrequencySpectrum are in the following format :
a (2n+1)x(2n+1) matrix, where n is the number of samples in each population. The number at matrix[A][B], is the number of variants for which the first group has A variant alleles and the second group has B variant alleles
The output information are
  • N = Number of haplotypes in each population (2xn – the num of samples per pop.)

  • V = Total number of variants

  • threshold : pooled sample allele frequency (i + j)/2N <= 0.05

  • FST = overall measure of genetic diversity

  • AS = allele sharing statistic (probability that two individuals carrying an allele count of n come from different populations, normalized by the expected probability in panmictic population)

  • WS = weighted symmetry (measures how evenly rare aleeles are distributed between the two populations)


JointFrequencySpectrum

Creates a JointFrequencySpectrum result file for each group defined in the ped file.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Creates a JointFrequencySpectrum result file for each group defined in the ped file.
Samples from the same group MUST be split into 2 subgroup, so as to be compared
Example : GroupA1 GroupA2 GroupB1 GroupB2 GroupC1 GroupC2
each group MUST HAVE the same number of samples.
The format of the output is :
G*G output files (where G is the number of groups). Each file is name VCFINPUTFILE.group1.group2.tsv
Each of these files contains a (2n+1)x(2n+1) matrix, where n is the number of samples in each population. The number at matrix[A][B], is the number of variants for which the first group has A variant alleles and the second group has B variant alleles
Output file must then be processed with the function JFSSummary

Note

In case of multiallelic variants : Each alternate allele is processed independently.


Kappa

Kappa Comparision between to vcf files.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --vcf2 File2.vcf : the second input VCF File (can be gzipped)

  • --tsv output.tsv : the result TSV File

  • --outdir ResultsDirectory : The directory that will contain results files

Description

CHROM

POS

ID

MAF_FILE1

MAF_FILE2

KAPPA_With_Missing

KAPPA_Ignore_Missing

Note

In case of multiallelic variants : Results are given for the first alternate allele, why is expected to be the same in both files.


MaleFemale

Show Male/Female Allele Frequencies

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Show Male/Female Allele Frequencies
Output format:

CHROM

POS

ID

REF

ALT

FILTER

GENE/CSQ

AF

MALE_AF

FEMALE_AF

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


MeanQuality

Prints information and quality statistics for each variant.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

For each variant in the given vcf files. Prints :

#CHROM

POS

IN_dbSBP

IN_GnomAD

meanDP_with_missing

meanGQ_with_missing

meanDP_without_missing

meanGQ_without_missing

Warning

The input VCF File must have been previously annotated with vep.


MultiAllelicProportion

Slides a 1kb window over the genome and outputs a list of regions orderer by the proportion of multi-allelic variations (desc.)

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Slides a 1kb window over the genome and outputs a list of regions orderer by the proportion of multi-allelic variations (desc.)
Output format is :

Chr

pos_n

pos_n+Window_size

nb_multialleleic variants


NumberOfCsqPerGene

Given a VCF file and a list of genes, prints the number of variants per gene for each consequence

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --genes genes.txt : File listing genes

Description

Given a VCF file and a list of genes, prints the number of variants per gene for each consequence
Multiallelic sites are concidered for each alternate allele

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


NumberOfLinesFromTabix

Gets the number of lines indexed by a tabix file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Gets the number of lines indexed by a tabix file

Warning

The bgzipped VCF file FILENAME.vcf.gz must have an associated tabix file FILENAME.vcf.gz.tbi


PrivateAndShared

For the given VCF, gives the number of variants private to each group and shared amoung all groups.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

For the given VCF, gives the number of variants private to each group and shared amoung all groups.
Output :
  • The total number of variants in the file

  • The number of variants present in ALL the groups defined in the the Ped file

  • The number of variants private to each group

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


PrivateVSPanel

Check how many of the variants from the input file are filtered as Already_existing when adding samples from the reference file.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ref reference.vcf : the panel VCF File (can be gzipped)

Description

Check how many of the variants from the input file are filtered as Already_existing when adding samples from the reference file.
Takes all the variants in the given vcffile
Compares to each samples from the reffile
Gives a count of remaining (new) variants (by consequence) each time we add a sample.

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


QCParametersDistribution

Reports the distributions of each parameter used by QC

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Reports the distributions of each parameter used by QC
One parameter per line, sorted values

Warning

The VCF File must contain the following INFO : QD,FS,SOR,MQ,ReadPosRankSum,InbreedingCoeff,MQRankSum


SampleStats

Print Stats about each samples (Mean Depths, TS/TV Het/HomAlt).

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --ped samples.ped : File describing the VCF’s samples (See File Formats in the documentation)

Description

Print Stats about each samples (Mean Depths, TS/TV Het/HomAlt).
Output format is :

Sample

Group

Sites

Genotyped

Missing

%Missing

MeanDepths

Variants

Singletons

TS

TV

TS/TV

Het

HetRatio

HomAlt

Note

In case of multiallelic variants : Each alternate allele is processed independently.


SharedAlleleMatrix

returns a series of matrices [individuals/individuals] with the number of shared alleles.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --outdir ResultsDirectory : The directory that will contain results files

Description

returns a series of matrices [individuals/individuals] with the number of shared alleles.
Matrices are newSNP, SNP.f<0.005, SNP.f<0.01, SNP.f<0.05

Warning

The input VCF File must have been previously annotated with vep.

Note

In case of multiallelic variants : Each alternate allele is processed independently.


VQSLod

Print VQSLod statistics for each tranche.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Print VQSLod statistics for each tranche.
Output format is :

Tranche

Mean

Min

D1

D2

D3

D4

Median

D6

D7

D8

D9

Max

Warning

File must contain VQSLOD annotations

Formatting Functions

ShowFields

Shows selected fields of a VCF File

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --query "field1,field2,...,info:key1;key2;...,geno:key1;key2;..." : Output columns

Description

Shows selected fields of a VCF File
Query Syntax is Field_1,Field_2,...,Field_n
where Field_x, is one of CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT
or info:key1;key2;...;keyN ex: info:AbHet;AC;AN;AF
or geno:key1;key2;...;keyN ex : geno:GT;AD;GQ

TSV2HTML

Converts a TSV to a HTML

Mandatory Arguments

  • --file table.tsv : the input TSV File

  • --link PositiveInteger : put link in header, starting at column INDEX (counting from 0)

  • --title MyTitle : title of the result HTML page

Description

Converts a TSV to a HTML

VCF2HTML

Generates an HTML legible file for the given VCF file

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Creates a HTML file, that contains the variants of the VCF file.
For each variants, all the VCF fields are displayed.
All vep annotation are formatted and shown.

Warning

The input VCF File must have been previously annotated with vep.


VCF2TSV

Creates a TSV file, readable in Excel.

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

Description

Creates a TSV file, that can be opened in Excel.
For each variants, all the VCF fields are displayed.
All vep annotation are formatted and shown.

VCF2TSVGeneCsq

Creates a TSV file, readable in Excel, keeps only annotations for given genes and consequences

Mandatory Arguments

  • --vcf input.vcf(.gz) : VCF file to use as an input. Can be bgzipped

  • --genes genes.txt : Filename of gene list

  • --csq vep.consequence : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]

Description

Creates a TSV file, that can be opened in Excel.
For each variants, all the VCF fields are displayed.
All vep annotation are formatted and shown.
Only the variants impacting a gene within the given list are displayed.
Only the variants with consequence at least as severe as the one given are displayed.

Warning

The input VCF File must have been previously annotated with vep.

Other Functions

CoverageStats

Gets the coverage statistics for an input file

Mandatory Arguments

  • --tsv cov.tsv.gz : File containing depth-of-coverage

  • --chrom chr1 : Chromosome name

Description

Gets the coverage statistics for an input file
The input file has one line per chromosome position [1-chrSize] and one column per sample. Each cell contains the depth of coverage for the given sample at the given position.
The output format is :

chr

pos

mean

median

1

5

10

15

20

25

30

40

50

100


ExtendBed

Adds a padding to the left and right of each regions in the bed, and merges overlapping regions

Mandatory Arguments

  • --bed regions.bed : the Bed file to pad

  • --pad PositiveInteger : number of bases to add left and right of each region

Description

Adds a padding to the left and right of each regions in the bed, and merges overlapping regions

GeneCards

Generates a script to retrieves GeneCards HTML pages for each gene in the given list.

Mandatory Arguments

  • --file genes.txt : file listing genes

Description

Generates a script to retrieves GeneCards HTML pages for each gene in the given list.

GeneCardsParser

Exports summary data from a genecards HTML files as an unformatted table

Mandatory Arguments

  • --file input.html : input genecargs HTML file

Description

Exports summary data from a genecards HTML files as an unformatted table
All HTML markup are removed.
The data are formatted as such :

#Gene

GeneCards

Entrez Gene

UniProtKB/Swiss-Prot


GzPaste

Unix paste command for gzipped files

Mandatory Arguments

  • --files file1.gz,file2.gz,...,fileN.gz : list (comma separated) of gzipped files to paste

Description

Equivalent to the unix paste command without any special option.
Each input file can be either gzipped or not (mix are possible)
Use --gz to gzip the output

IsInBed

Check if a given chromosome:position is contained in a bedfile

Mandatory Arguments

  • --chrom chromosome : chromosome name : (chr)[1-25]/X/Y/M/MT

  • --pos PositiveInteger : Position

  • --bed region.bed : the Bed File to process

Description

Check if a given chromosome:position is contained in a bedfile
If it is : gives the region’s limits
Otherwise : gives the regions before and after the position

NormalizePed

Extract x subgroups of y samples for each group present in the Ped file

Mandatory Arguments

  • --ped samples.ped : The input PED file to process

  • --number PositiveInteger : Number Of subgroups for each group

  • --size PositiveInteger : Group Size

Description

Extract x subgroups of y samples for each group present in the Ped file
If the input ped file has three groups A,B,C of 50 individuals each. Using the command with --number 3 --size 10 will create 9 group :
A A2 A3 B B2 B3 C C2 C3, with 10 individuals in each, randomly picked from groups A B and C.
This function is usefull to dived groups, for instance to have 1 learning set and several computing sets.

RandomPed

Keeps N random samples from a Ped File

Mandatory Arguments

  • --ped samples.ped : The input PED file to process

  • --threshold PositiveInteger : Number Of Samples

Description

Keeps N random samples from a Ped File

SimplifyBED

Returns a simplified bed (with the smallest number of regions covering all the positions in the input bed file).

Mandatory Arguments

  • --bed region.bed : the Bed File to process

Description

Returns a simplified bed (with the smallest number of regions covering all the positions in the input bed file).
This is useful when the input bed file contains several overlapping regions.

Graphics

GraphCompareFrequencies

Compares the frequencies of common variants in 2 populations (output of FrequencyCorrelation / CompareToGnomAD)

Mandatory Arguments

  • --width PositiveInteger : Graph’s Width in Pixels

  • --height PositiveInteger : Graph’s Height in Pixels

  • --tsv input.tsv : input data

  • --name dataset : Graph Title

  • --outdir ResultsDirectory : The directory that will contain results files

  • --x PositiveInteger : index of the column containing X values 0-based

  • --y PositiveInteger : index of the column containing Y values 0-based

Description

Compares the frequencies of common variants in 2 populations (output of FrequencyCorrelation / CompareToGnomAD)
4 graphs will be created : linear/log JFS/graph

Example

GraphCompareFrequencies example

GraphCountGenotypes

Create a graph for the results of CountGenotypes

Mandatory Arguments

  • --width PositiveInteger : Graph’s Width in Pixels

  • --height PositiveInteger : Graph’s Height in Pixels

  • --tsv input.tsv : input data

  • --csq vep.consequence : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Create a graph for the results of CountGenotypes

Example

GraphCountGenotypes example

GraphF2

Create a graph for the results of F2 or F2Individuals

Mandatory Arguments

  • --width PositiveInteger : Graph’s Width in Pixels

  • --height PositiveInteger : Graph’s Height in Pixels

  • --tsv input.tsv : input data

  • --name title : Title (will be printed on the graph)

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Create a graph for the results of F2 or F2Individuals

Example

GraphF2 example

GraphJFS

Create a graph for the results of JointFrequencySpectrum

Mandatory Arguments

  • --width PositiveInteger : Graph’s Width in Pixels

  • --height PositiveInteger : Graph’s Height in Pixels

  • --tsv input.tsv : input data

  • --name title : Title (will be printed on the graph)

  • --x Set1 : Name of the first Set

  • --y Set2 : Name of the second Set

  • --max Scale Max : Top Number of variant on legend. Enter “null” to use the maximal value from data

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Create a graph for the results of JointFrequencySpectrum

Warning

Expects a NxN matrix, where matrix[a][b] is the number of variants seen a times in the first set and b times in the second set.

Example

GraphJFS example

GraphSampleStats

Create a graph for the results of SampleStats

Mandatory Arguments

  • --width PositiveInteger : Graph’s Width in Pixels

  • --height PositiveInteger : Graph’s Height in Pixels

  • --tsv input.tsv : input data

  • --name title : Title (will be printed on the graph)

  • --outdir ResultsDirectory : The directory that will contain results files

Description

Create a graph for the results of SampleStats
There will be a graph for each of the following values
  • Number of Variants

  • Mean Depth

  • TS/TV

  • Het/HomAlt

  • Missing

Example

GraphSampleStats example