Functions¶
General Information¶
Optional Arguments for all Functions¶
--out ResultFile
: File that will contain the function’s results.vcf(.gz)--log LogFile
: File that will contain the function’s log.log--gz
: Force all outputs to be bgzipped
Produced ascii files are automatically bgzipped if the filename given by the user ends with .gz
If the user provides the --gz
arguments, all output files be bgzipped and .gz will be appended automatically to the filenames (if missing). Output streamed to STD_OUT will also be bgzipped.
VCF Filter Functions¶
CompoundHeterozygous¶
Keeps only variants that respect the Compound Heterozygous pattern of inheritance.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--nohomo TRUE|FALSE
: Reject if a case is homozygous to alternate allele or if a control has none of the allele ?--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--missing TRUE|FALSE
: Missing genotypes allowed ?
Description
Keeps only variants that respect the Compound Heterozygous pattern of inheritance. In the Compound Heterozygous pattern of inheritance, two variants V1 and V2 from the same gene are valid if
All cases have V1 and V2
No control have V1 and V2
Thus, a variant are rejected if
one case doesn’t have V1 and V2
one control has V1 and V2
With --missing true
, missing genotypes are considered compatible with the transmission pattern.
The --nohomo
options allows to reject alternate alleles if a case is homozygous to an alternate allele or if at least one control is not heterozygous to an alternate allele of V1/V2. (If all the controls are supposed to be parents of cases)
It might be difficult to read results, since several combination of valid variants might exist. So an extra INFO field COMPOUND is added detailing the variants relation.
This field reads as
| COMPOUND=A1>P1(gA|gB|gC)&P2(gD|gE|gF),A2>P3(gG|gH|gI)&P4(gJ|gK|gL),...
Where:
Ax
is the number of the allele involvedPx
is the partner allele in form chr:pos:ref:altgX
is the symbol of the gene common to this allele and it partner
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
DeNovo¶
Keeps only variants that are compatible with a De Novo pattern of inheritance.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--missing TRUE|FALSE
: Missing genotypes allowed ?
Description
Keeps only variants that are compatible with a De Novo pattern of inheritance.
present in every Case
absent in every Control
Warning
Father/Mother/Child(ren) Trios are expected
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
DeNovoRecessive¶
Keeps only variants that strictly respect this genotypes parent1 0/1 + parent2 0/0 + child 1/1
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Keeps only variants that strictly respect this genotypes parent1 0/1 + parent2 0/0 + child 1/1 parent1 and parent2 are interchangeable
Warning
Will only run if input file has a trio with 1 case an 2 controls
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
Dominant¶
Keeps only variants that respect the Dominant pattern of inheritance
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--missing TRUE|FALSE
: Missing genotypes allowed ?--nohomo TRUE|FALSE
: Reject if a case is homozygous to alternate allele ?--mode Mode
: strict : true for all cases | loose : true for at least one case
Description
Keeps only variants that respect the Dominant pattern of inheritance In the Dominant pattern of inheritance
Cases should have the causal variant
Controls cannot have the causal variant
Thus, a variant is rejected if
one case doesn’t possess the alternate allele (strict mode)
one control possesses the alternate allele
If --missing true
, missing genotypes are considered compatible with the transmission pattern.
The --nohomo
options allows to reject alternate alleles if at least one case is homozygous. (If you expect the resulting phenotype would not be consistent for example.)
In the strict mode, all cases must have the alternate allele. In the loose mode, only one case has to have the allele (More permissive for larger panels).
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterConsequenceLevel¶
Filters the variants according to their consequences
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--csq vep.consequence
: Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]
Description
Filters the variants according to their consequences The consequence of the variant must be at least as severe as the one given
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterF2¶
Filters variants to keep only those contributing to F2 data.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Filters variants to keep only those contributing to F2 data. F2 data are described in PubMedId: 23128226, figure 3a
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterFrequencies¶
Keeps only variants with frequencies below the threshold in all of the selected populations.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--threshold 0.0-1.0
: maximum frequency in any population--pop pop1,pop2,...,popN
: List example of Populations to test (from AF, AFR_AF, AMR_AF, EAS_AF, EUR_AF, SAS_AF, AA_AF, EA_AF, gnomAD_AF, gnomAD_AFR_AF, gnomAD_AMR_AF, gnomAD_ASJ_AF, gnomAD_EAS_AF, gnomAD_FIN_AF, gnomAD_NFE_AF, gnomAD_OTH_AF, gnomAD_SAS_AF, MAX_AF)
Description
Keeps only variants with frequencies below the threshold in all of the selected populations. If the variant’s frequency exceeds the threshold for any of the selected populations, the variant is filtered out.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterGeneCsqLevel¶
Filters the variants according to their consequences on a list of genes.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--genes genes.txt
: File listing genes to keep--csq vep.consequence
: Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]
Description
Filters the variants according to their consequences on a list of genes. Only the variants impacting at least one of the genes in the list are kept. The consequence of the variant on the gene must be at least as severe as the one given
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterGeneCsqList¶
Filters the variants to keep only those affect one of the given genes with one of the given consequences.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--genes genes.txt
: List of the genes to keep--csq csq1,csq2,...,csqN
: List (comma separated) of VEP consequences to keep
Description
Filters the variants to keep only those affect one of the given genes with one of the given consequences.
If the variants has at least one of the effect from --genes
on one of the genes in the file from --csq
, then the variants is kept.
The list of effects can be empty : --csq null
VEP consequence must be selected from : [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterGenotype¶
Filters the variants to match the given genotype filter.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--filter "sample1:geno1:keep1,sample2:geno2:keep2,...,sampleN,genoN:keepN"
: List (comma separated) for samples, their associated genotypes and is they are to be kept
Description
Filters the variants to match the given genotype filter.
If the genotype of at least one sample mismatches, the variant is Excluded.
Filter format : sample1:geno1:keep1,sample2:geno2:keep2,...,sampleN:genoN:keepN
| Keep=true|false
tells if we want to keep(true) or exclude(false) matching genotype for this sample
Example SA:0/0:false,SB:0/1:true,SC:0/1:true,SD:1/1:false
will keep variants that are 0/1 for SB and SC, and that aren’t 0/0 for SA or 1/1 for SD
FilterGnomADFrequency¶
Filters out variants with frequencies above threshold in GnomAD
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--threshold 0.0-1.0
: Maximum GnomAD Frequency
Description
Filters out variants with frequencies above threshold in GnomAD In case of multiallelic variant, if any alternate allele passes the filter, the variant is kept
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterGnomADNFEFrequency¶
Filters out variants with frequencies above threshold in GnomAD_NFE
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--threshold 0.0-1.0
: Maximum GnomAD_NFE Frequency
Description
Filters out variants with frequencies above threshold in GnomAD_NFE In case of multiallelic variant, if any alternate allele passes the filter, the variant is kept
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterKnownID¶
Keeps only variant with and empty 3rd field
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Keeps only variant with and empty 3rd field The field must be empty or equals to “.”
FilterNew¶
Keeps only the variants not found in either dbSNP, 1KG or GnomAD
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Keeps only the variants not found in either dbSNP, 1KG or GnomAD
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
FilterNumericInfo¶
Filters variants according to the numerical values of info fieds
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--query "VALUE1>0|VALUE2>0"
: A query describing the variants to filter out
Description
Provide a logical definition for variants to remove. Example:
“VALUE=17.5”
“VALUE1>0|VALUE2>0”
“VALUE>20&VALUE<50”
Note
In case of multiallelic variants : null
FilterSeenInGnomAD¶
Filters out variants that are seen in gnomAD
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Variants that are seen the gnomAD for at least one allele are filtered out Variants are filtered if gnomad_genome_AN > 0 or gnomad_exome_AN > 0
Note
In case of multiallelic variants : Each alternate allele is processed independently.
FoundInAllCases¶
Keeps Variants found in every “Case” samples
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Keeps Variants found in every “Case” samples
Case samples are defined by a “1” in the 6th field of the --ped
file.
In case of a multiallelic variant, if any variant allele is found or missing, the whole variant is kept.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
HQ¶
Extract HQ variants.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Extract HQ variants. Defined in 1.12 of the supplementary information of PubMedID=27535533 as
VQSR PASS
At least 80% of the genotypes have DP above 10 and GQ above 20
at least one variant genotype has DP above 10 and GQ above 20
KeepHomoAlt¶
Returns a VCF containing only the position homozygous to alt for the given SAMPLES
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--sample s1,s2,...sN
: list (comma separated) of samples to test
Description
Returns a VCF containing only the position homozygous to alt for the given SAMPLES
Note
In case of multiallelic variants : Various is kept if different samples are homozygous to different alternative alleles
MergeVQSR¶
Merges SNP and INDEL results files from VQSR
Mandatory Arguments
--snp snp.vcf
: File containing SNP output from VQSR--indel indel.vcf
: File containing INDEL output from VQSR
Description
Merges SNP and INDEL results files from VQSR
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
MonoAllelicSNV¶
Keep only the lines containing monoallelic SNVs
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Keep only the lines containing monoallelic SNVs
NotFoundInAnyControl¶
Removes Variants that are found in controls.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Removes Variants that are found in controls.
Control samples are defined by a “2” in the 6th field of the --ped
file.
In case of a multiallelic variant, if any variant allele isn’t found, the whole variant is kept.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
QC¶
Run a Quality Control on VCF Variants
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--opt custom.parameters.tsv
: file containing the various thresholds for the QC (see Documentation)--report filteredVariant.tsv
: output file listing all the variants that were filtered, and why--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Run a Quality Control on VCF Variants A report file gives the reason(s) each variant has been filtered For more Details, see https://gitlab.com/gmarenne/ravaq For each group G, the info field has new annotations
G_AN: AlleleNumber for this group
G_AC: AlleleCounts for this group
G_AF: AlleleFrequencies for this group
Warning
The VCF File must contain the following INFO : QD,FS,SOR,MQ,ReadPosRankSum,InbreedingCoeff,MQRankSum
RandomVariants¶
kept only a portion of the variants from a VCF file.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ratio 0.0-1.0
: Probability of keeping each variant--file positions.txt
: File listing Positions to keep regardless of given probability in format chr:position
Description
kept only a portion of the variants from a VCF file.
Each line has a --ratio
chance of being kept.
Position listed in the file --file
are always kept
Recessive¶
Keeps only variants that respect the Recessive pattern of inheritance.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--missing TRUE|FALSE
: Missing genotypes allowed ?--nohomo TRUE|FALSE
: Reject if a control is homozygous to reference allele ?--mode Mode
: strict : true for all cases | loose : true for at least one case
Description
Keeps only variants that respect the Recessive pattern of inheritance. In the Recessive pattern of inheritance
Cases should be homozygous to the causal allele
Controls should not be homozygous to the causal allele
Thus, a variant is rejected if
one case isn’t homozygous to the alternate allele (strict mode)
one control is homozygous to the the alternate allele
If --missing true
, missing genotypes are considered compatible with the transmission pattern.
The --nohomo
options allows to reject alternate alleles if at least one control is not heterozygous to the alternate allele. (If all the controls are supposed to be parents of cases)
In the strict mode, all cases must be homozygous to the alternate allele. In the loose mode, only one case has to be homozygous to the allele (More permissive for larger panels).
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
Recode¶
Reads all lines in a VCF Files
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Reads all lines in a VCF Files Outputs the input VCF file after applying the various command line filters
RemoveNonSNV¶
Remove variants lines where there have no SNVs
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Remove variants lines where there have no SNVs
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
RemoveNonVariant¶
Remove variants where only 0/0 and ./. genotypes are present
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Remove variants where only 0/0 and ./. genotypes are present
SplitByChromosome¶
Splits a given vcf file by chromosome
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--outdir ResultsDirectory
: The directory that will contain results files
Description
Splits a given vcf file and produces one resulting vcf file by chromosome.
SplitByGene¶
Creates an output VCF file for each gene.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--outdir ResultsDirectory
: The directory that will contain results files
Description
Creates an output VCF file for each gene. Some variants can be in several output files, if they impact several genes.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
SplitFromDB¶
Generates two new VCF files with variants present/absent in 1kG/GnomAD.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--outdir ResultsDirectory
: The directory that will contain results files
Description
Generates two new VCF files :
inDB.MYVCF.vcf (with variants present in 1kG/GnomAD)
notInDB.MYVCF.vcf (with variant absent from 1kG/GnomAD)
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept in “inDB.MYVCF.vcf”.
StrictCompoundHeterozygous¶
Keeps only variants that strictly respect the Compound Heterozygous pattern of inheritance.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--nohomo TRUE|FALSE
: Reject if a case is homozygous to alternate allele or if a control has none of the allele ?--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Keeps only variants that strictly respect the Compound Heterozygous pattern of inheritance. In the Compound Heterozygous pattern of inheritance, two variants V1 and V2 from the same gene are valid if
All cases have V1 and V2 from their parents
All controls (parents) have one of V1/V2 while the other parents have V2/V1
A variant is kept if and only if :
All case have V1 and V2
All cases have a parent (control) with V1 and not V2, and this other parent with V2 and not V1
The --nohomo
options allows to reject alternate alleles if a sample is homozygous to it.
It might be difficult to read results, since several combination of valid variants might exist. So an extra INFO field COMPOUND is added detailing the variants relation.
This field reads as
| COMPOUND=A1>P1(gA|gB|gC)&P2(gD|gE|gF),A2>P3(gG|gH|gI)&P4(gJ|gK|gL),...
Where:
Ax
is the number of the allele involvedPx
is the partner allele in form chr:pos:ref:altgX
is the symbol of the gene common to this allele and it partner
Warning
The input VCF File must have been previously annotated with vep.
Warning
This function expects a complete definition of the sample, where all cases are affected children and both their parents are identified controls.
Note
In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept.
VCF Transformation Functions¶
ClearInfoField¶
Replaces the Info column by “.”
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Removes all annotation from the VCF file by replacing the content of the Info column by “.”
FilterCsqExtractGene¶
Filters Variants according to consequences. Replaces ID by gene_chr_pos.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--csq vep.consequence
: Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]
Description
Filters Variants according to consequence. Replaces ID by gene_chr_pos.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Only Ref and one alternate allele are Kept. The Kept alternate is (in this order) : 1. the most severe; 2. the most frequent in the file; 3. the first one.
GenerateHomoRefForPosition¶
Generates a VCF with Homozygous-to-Reference Genotypes for every given positions and each sample (Alternate is a transition A↔G, C↔T)
Mandatory Arguments
--ref Reference.fasta
: Fasta File containing the reference genome--pos positions.tsv
: List of positions in the results VCF file--sample samples.txt
: List of samples in the results VCF File
Description
Given a Reference Genome, a list of positions and a list of individual, generates a VCF file with Homozygous-to-Reference Genotypes for every given positions and each sample.
The reference must be in .fasta format, with its associated .fai index in the same directory
The position file must contain one position per line, in the format : chr pos
The sample file must contain one sample per line
Each given position is looked up
The Ref of the VCF is taken from the given reference genome
Alt = Transition(Ref) : A↔G / C↔T
Format for each position is GT:DP:GQ:AD:PL
Each genotypes is 0/0:30:99:30,0:0,50,500
MissingToMajor¶
Replaces every missing genotype by the most frequent allele present
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Replaces every missing genotype by the most frequent allele present
updates AC,AF,AN annotations
The genotype is homozygous to the most frequent allele A in the form A/A:0:0:0,0,0...
Note
In case of multiallelic variants : The major allele is the most frequent allele from ref and each alternate.
MissingToRef¶
Replaces every missing genotype by 0/0:0:0:0....
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Replaces every missing genotype by 0/0:0:0:0....
Updates AC/AN/AF annotations
Scramble¶
Outputs the same VCF same but randomly reassigns the genotypes among the samples
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
This function can be used to anonymize a VCF file. The AC/AN/AF of each variants will stay consistent, but the haplotypes will be broken. For each line, the genotypes are randomly reassigned among the samples. The random reassignment is different for each line
SetGenotypeFromProbability¶
Affect a genotype for each sample, for each position from the GenotypeProbability annotation. If a genotype is already present, it can be kept or replaced
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--overwrite TRUE|FALSE
: overwrite existing genotypes ?
Description
The highest probability (given by annotation GP=p1,p2,p3) determines the genotype that will be affect
highest=p1 → 0/0
highest=p2 → 0/1
highest=p3 → 1/1
If a genotype is already present, it can be kept or replaced
Warning
The input VCF file must contain Genotype Probability (GP=p1,p2,p3) for each genotype
Note
In case of multiallelic variants : An error will be thrown, as this function expects only monoallelic variants. The affected variant line will be dropped.
SimulateVCFFromExisting¶
Simulate a VCF File from an existing VCF File be mixing genotypes of samples from different ancestries
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--tsv variantlist.tsv
: File containing the list of biallelic variants with affected gene
Description
Simulate a VCF File from an existing VCF File be mixing genotypes of samples from different ancestries
Note
In case of multiallelic variants : The affected variant line will be silently dropped.
SplitMultiAllelic¶
Splits multiallelic variants into several lines of monoallelic variants
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Splits multiallelic variants into several lines of monoallelic variants
VCFToReference¶
Outputs the given VCF File and reverts genotypes when ref/alt alleles are inverted according to given reference (as a fasta file)
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ref Reference.fasta
: Fasta File containing the reference genome
Description
This function can be used when ref/alt alleles might be inverted (for example when the vcf file has been converted from a plink file) At each position, the given reference genome is checked to see which allele matches the reference If none of the allele matches the reference, the line is dropped, and a warning is displayed AC/AN/AF are updated
Warning
Annotation (INFO/AD/PL/…) are not updated
Note
In case of multiallelic variants : An error will be thrown, as this function expects only monoallelic variants. The affected variant line will be dropped.
VCF Annotation Functions¶
AddAlleleBalance¶
Adds the annotations : AB, ABhet, ABhom, OND to a VCF file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Adds the following annotations :
AB : Allele balance for each het genotype (alleleDepth(gt1) / alleleDepth(gt1) + alleleDepth(gt2))
ABhet : Allele Balance for heterozygous calls (ref/(ref+alt)), for each variant
ABhom : Allele Balance for homozygous calls (A/(A+O)) where A is the allele (ref or alt) and O is anything other, for each variant
OND : Overall non-diploid ratio (alleles/(alleles+non-alleles)), for each variant
Algorithms is taken from GATK, with the following changes (Results are available for INDELs and multiallelic variants, use with caution)
AddDbSNP¶
Adds/updates dbSNP information to the VCF from a dbSNP release file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ref dbsnp.vcf
: dbSNP reference VCF File (can be gzipped)
Description
Adds dbSNP RS in ID field and INFO field
Adds RS=
and dbSNPBuildID=
in INFO field from the input file --ref
.
Note
In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).
AddGroupACANAF¶
Add AN,AC,AF annotation for each group described in the ped file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
For each group G, the info field has new annotations
G_AN
AlleleNumber for this groupG_AC
AlleleCounts for this groupG_AF
AlleleFrequencies for this group
Note
In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).
AddWorstAndCanonicalConsequence¶
For each variant, add the most severe consequence from vep and add the consequence from vep for the annotation marked as Canonical.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
For each variant, add the most severe consequence from vep and add the consequence from vep for the annotation marked as Canonical.
The worst consequence is annotated with the keyword WORST_CSQ
The gene for the worst consequence is annotated with the keyword WORST_GENE
The canonical consequence is annotated with the keyword CANONICAL_CSQ
The gene for the canonical consequence is annotated with the keyword CANONICAL_GENE
If more than one annotation is marked as canonical, the most severe of them is kept
If no annotation is marked as canonical, the most severe consequence is kept
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).
ReaffectdbSNP¶
Puts all observed RS numbers in the ID column
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Takes the rs numbers from the Existing_variation annotation (from vep) and adds them to the ID column of the VCF. Puts “.” if no RS has been found
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Every RS IDs from every alternate alleles are listed in the ID column.
UpdateACANAF¶
Resets the AC
AN
and AF
values for the given VCF file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Adds/Updates the AC
AN
and AF
values for the VCF.
Note
In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated).
Analysis Functions¶
CheckReference¶
For every position in the vcf file, compares the reference from the VCF to the one in the fasta
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ref Reference.fasta
: Fasta File containing the reference genome
Description
For every position in the vcf file, gets the reference from the given fasta and prints :
CHROM |
POS |
VCF_REF |
FASTA_REF |
Warning
Lines containing indels are ignored
Chi2¶
Performs a chi² Association Tests on an input VCF file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Does a simple association test on the data present in the input vcf file. Computes the number of case samples with and without variants, and the number of control samples with and without variants. then does a chi² on those values
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
CommonVariants¶
Displays the list of variants that are common to two VCF files
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--file smallest.file.vcf
: the smallest of the two input VCF files (can be gzipped)
Description
Displays the list of variants that are common to two VCF files Output is given as a list of canonical variants.
Warning
For faster execution, use –vcf with the largest file and –file with the smallest one
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
CommonVariantsInSamplePairs¶
Count variants that are common to each sample pairs
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ref minor
: Also Process REF allele [0] ?
Description
Count variants that are common to each sample pairs For each pair of samples in the file (A vs B, but not B vs A), output :
number of variants A_het & B_het absent in other samples
number of variants A_het & B_het present in other samples
number of variants A_het & B_hom absent in other samples
number of variants A_het & B_hom present in other samples
number of variants A_hom & B_het absent in other samples
number of variants A_hom & B_het present in other samples
number of variants A_hom & B_hom absent in other samples
number of variants A_hom & B_hom present in other samples
number of variants common to A and B in any form
Note
In case of multiallelic variants : Each alternate allele is processed independently.
CompareGenotype¶
Compares the genotypes of the samples in the first and second VCF file.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--vcf2 File2.vcf(.gz)
: the second input VCF file (can be bgzipped)
Description
Compares the genotypes of the samples in the first and second VCF file. Both VCF are suppose to contain the same samples. This function compares the genotypes of each sample for each variant across the files. This can be useful, for example, to compare 2 calling algorithm. Output for is :
Sample |
Group |
Total |
Concord |
Discord |
LeftMissing |
RightMissing |
%Concord |
Note
In case of multiallelic variants : Alternate alleles are expected to be the same and in the same order in both files
CompareToGnomAD¶
Compares the variants present in a VCF file to those present in a GnomAD VCF file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--file GnomAD.site.vcf.gz
: GnomAD VCF File (can be gzipped)
Description
Compares the variants present in a VCF file to those present in a GnomAD VCF file Output format will be:
#CHR |
POS |
ID |
REF |
ALT |
QUAL |
CSQ |
GENE |
AC |
AF |
AN |
GnomAD_AC |
GnomAD_AF |
GnomAD_AN |
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
CountFromPublicDB¶
Returns the number of Variants, SNVs, INDEL, in dbSNP, 1kG, GnomAD.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Returns the number of Variants, SNVs, INDEL, in dbSNP, 1kG, GnomAD. Output format is (For all/SNVs/Indels):
Total |
dbSNP |
1kG |
GnomAD |
Not dbSNP |
Not 1kG |
Not GnomAD |
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
CountGenotypes¶
Counts the genotypes 0/1
and 1/1
for each variants
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Counts the genotypes 0/1
and 1/1
for each variants
The output format is:
CHROM |
POS |
REF |
ALT |
CONSEQUENCE |
TOTAL_HETEROZYGOUS |
TOTAL_HOMOZYGOUS_ALT |
Followed by the number of heterozygous and homozygous for each group defined in the ped file.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
CountMissing¶
For each samples in the PED file, print a summary of missingness
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--threshold 0.0-1.0
: Maximum ratio of Missing Individuals per position
Description
for each samples in the PED file, prints a summary, in the format
#SAMPLE |
TOTAL |
GENOTYPED |
NB_MISSING |
%_MISSING |
REF |
ALT |
Total_Variants |
Kept_Variants |
where
SAMPLE
the sample nameTOTAL
total variants keptGENOTYPED
variants with non missing genotypes for this sampleNB_MISSING
variants with missing genotypes for this sample%_MISSING
percent of genotypes missing for this sampleREF
number of variants homozygous to the ref for this sampleALT
number of variants not homozygous to the ref for this sample
The header of the output also contains the total number of variants present in the input file and the number of variants that are kept
Warning
Kept variants are those with less than --threshold
genotypes missing
CountVariants¶
Counts the number of variants for each Samples and print a summary for each group
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Counts the number of variants for each Samples and print a summary for each group Results Format :
FamilyID |
ID |
MotherID |
FatherID |
Sex |
Phenotype |
Group |
NbVariants |
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
CountVariantsFoundIn¶
Count the variants of each type for each samples, and globally
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ref gnomad.2.1.canonical
: File containing list of files with List of variants found in the reference file (in canonical format)
Description
Variants are filtered than, the count is made by category
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
DbSNPMismatch¶
Check if there is a discrepancy between the ID Column and the VEP annotation for RS ID.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Check if there is a discrepancy between the ID Column and the VEP annotation for RS ID. Output for the lines with discrepancies have the following format :
CHR |
POS |
ID |
REF |
ALT |
VEP_Annotation |
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : RS IDs of every alternate allele are put in the ID field.
ExtractAlleleCounts¶
For every variants, exports the variant allele count for each sample
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
For every variants, exports the variant allele count for each sample Output has the following format
#CHROM |
POS |
ID |
REF |
ALT |
Followed by the allele count for each sample Allele Counts can be 0, 1 or 2 for diploids Missing genotypes have “.” as an allele count
Note
In case of multiallelic variants : Each alternate allele is processed independently.
ExtractCanonical¶
Function that convert a VCF to a list of canonical variant
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Only variants with AC > 0 are kept
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
ExtractNeighbours¶
Creates a bed file of the positions where at least one sample has 2 SNVs that could be in the same triplet (regardless of the reading frame)
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Scans the whole VCF file, for each successive variants V1 and V2 if at least one sample has the variants V1 and V2 then a bed regions if printed. The region is defined as :
chr |
V1_pos |
V2_pos |
V1 and V2 must be on the same chromosome and V2_pos-V1_pos = 1 or V2_pos-V1_pos = 2
Note
In case of multiallelic variants : chr V1_pos V2_pos is printed if, at least one alternate allele of V1_pos and V2_pos is a SNP, and if one sample has a variant of both side (not necessarily the SNP one).
ExtractPrivateToGroup¶
Extracts All Variants that are private to a Group.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Extracts All Variants that are private to a Group. Only the variants found in a single group of samples (as defined in the ped file) are extracted The list of the N samples in the group that have the variant is given Output Format :
#CHR |
POS |
GROUP |
SAMPLES |
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
F2¶
Computes F2 data.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--prefix prefix
: prefix of the output files--outdir ResultsDirectory
: The directory that will contain results files
Description
Computes F2 data. F2 data are described in PubMedId: 23128226, figure 3a Six sets of results are given, one for:
All variants
All SNVs
variants without rs (new)
SNVs without rs (new)
variants with rs (known)
SNVs with rs (known)
Warning
The difference between known and new is done by looking a the vep annotation, not the ID column.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
F2Individuals¶
Computes F2 data by samples and not by groups (Each sample is its own group).
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--prefix prefix
: prefix of the output files--outdir ResultsDirectory
: The directory that will contain results files
Description
Computes F2 data by samples and not by groups (Each sample is its own group). F2 data are described in PubMedId: 23128226, figure 3a Six sets of results are given, one for:
All variants
All SNVs
variants without rs (new)
SNVs without rs (new)
variants with rs (known)
SNVs with rs (known)
Warning
The difference between known and new is done by looking a the vep annotation, not the ID column.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
FrequencyCorrelation¶
Prints the frequency correlation of variants between local samples and GnomAD
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--outdir ResultsDirectory
: The directory that will contain results files
Description
Prints the frequency correlation of variants between local samples and GnomAD For each variants prints :
CHR |
POS |
REF |
ALT |
Local |
GnomAD |
Outputs one line per VEP Consequence
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
FrequencyForPrivate¶
Prints the Allele frequency in the file and each group, for variants not found in dbSNP, 1kG or GnomAD.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
For each variant in the file, if the variant is not found in dbSNP, 1KG or GnomAD : Prints the frequency in the file, and in each group, as well as its consequence
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
GeneList¶
Prints the list of all gene covered by the VCF file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Prints the list of all gene covered by the VCF file The genes are extracted from the SYMBOL annotation from VEP.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently.
GetQCMetrics¶
Gets all the Metrics used by the QC function
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--file metrics.my.project
: output filename prefix
Description
Gets all the Metrics used by the QC function
Warning
VCF Requires the following annotations QUAL_BY_DEPTH,INBREEDING_COEF,FS,SOR,MQ,READPOSRANKSUM,AD,PL,GT
Note
In case of multiallelic variants : The affected variant line will be silently dropped.
GetWorstConsequence¶
Print the worst consequence/gene for each variant allele.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Print the worst consequence/gene for each variant allele. For each allele of each variant, the output is in the format:
#CHR |
POS |
ID |
REF |
ALT |
WORST_CSQ |
AFFECTED_GENE |
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
InbreedingCoeffDistribution¶
Outputs a sorted list of all Inbreeding Coeff from a VCF File.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Outputs a sorted list of all Inbreeding Coeff from a VCF File. The output file has no header, the values are sorted ascendingly
Warning
Input file must contains Inbreeding Coeff. annotation
IQSBySample¶
Computes the IQS score for each sample between sequences data and data imputed from genotyping.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--cpu Integer
: number of cores--file imputed.vcf(.gz)
: VCF File Containing imputed data (can be gzipped)
Description
Computes the IQS score between sequences data and data imputed from genotyping. Ref PMID26458263, See https://lysine.univ-brest.fr/redmine/issues/84 Here the IQS score is computed for each sample. Output format is :
#SAMPLE |
GROUP |
IQS |
NB_VARIANTS |
TOTAL_VARIANTS |
Note
In case of multiallelic variants : Each alternate allele is processed independently.
IQSByVariant¶
Computes the IQS score for each variant between sequences data and data imputed from genotyping.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--cpu Integer
: number of cores--file imputed.vcf(.gz)
: VCF File Containing imputed data (can be gzipped)
Description
Computes the IQS score between sequences data and data imputed from genotyping. Ref PMID26458263, See https://lysine.univ-brest.fr/redmine/issues/84 Here the IQS score is computed for each variant. Output format is :
chr |
pos |
rs |
ref |
alt |
gene |
consequence |
Freq_VCF |
Freq_GnomAD_NFE |
Freq_MaxPop |
Max_Pop |
IQS |
Info |
Warning
Extra information are available if the input file was annotated with VEP
Note
In case of multiallelic variants : Each alternate allele is processed independently.
JFSSummary¶
Outputs the Joint Site Frequency Spectrum Summary statistics
Mandatory Arguments
--file GROUP1.GROUP2.XXX.YYY.ZZZ.tsv
: input tsv file
Description
Outputs the Joint Site Frequency Spectrum Summary statistics See https://www.nature.com/articles/ejhg2013297 The input file contains the JFS data comparing to groups of samples. Those data, generated by the function JointFrequencySpectrum are in the following format : a (2n+1)x(2n+1) matrix, where n is the number of samples in each population. The number at matrix[A][B], is the number of variants for which the first group has A variant alleles and the second group has B variant alleles The output information are
N = Number of haplotypes in each population (2xn – the num of samples per pop.)
V = Total number of variants
threshold : pooled sample allele frequency (i + j)/2N <= 0.05
FST = overall measure of genetic diversity
AS = allele sharing statistic (probability that two individuals carrying an allele count of n come from different populations, normalized by the expected probability in panmictic population)
WS = weighted symmetry (measures how evenly rare alleles are distributed between the two populations)
JointFrequencySpectrum¶
Creates a JointFrequencySpectrum result file for each group defined in the ped file.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)--outdir ResultsDirectory
: The directory that will contain results files
Description
Creates a JointFrequencySpectrum result file for each group defined in the ped file. See https://www.nature.com/articles/ejhg2013297 Samples from the same group MUST be split into 2 subgroup, so as to be compared Example : GroupA1 GroupA2 GroupB1 GroupB2 GroupC1 GroupC2 each group MUST HAVE the same number of samples. The format of the output is : G*G output files (where G is the number of groups). Each file is named VCF_INPUT_FILE.group1.group2.tsv Each of these files contains a (2n+1)x(2n+1) matrix, where n is the number of samples in each population. The number at matrix[A][B], is the number of variants for which the first group has A variant alleles and the second group has B variant alleles Output file must then be processed with the function JFSSummary
Note
In case of multiallelic variants : Each alternate allele is processed independently.
Kappa¶
Kappa Comparison between to vcf files.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--vcf2 File2.vcf
: the second input VCF File (can be gzipped)--tsv output.tsv
: the result TSV File--outdir ResultsDirectory
: The directory that will contain results files
Description
Kappa Comparison between to vcf files. See : https://journals.sagepub.com/doi/abs/10.1177/001316446002000104?journalCode=epma and https://en.wikipedia.org/wiki/Cohen%27s_kappa Output format is :
CHROM |
POS |
ID |
MAF_FILE1 |
MAF_FILE2 |
KAPPA_With_Missing |
KAPPA_Ignore_Missing |
Note
In case of multiallelic variants : Results are given for the first alternate allele, why is expected to be the same in both files.
MaleFemale¶
Show Male/Female Allele Frequencies
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Show Male/Female Allele Frequencies Output format:
CHROM |
POS |
ID |
REF |
ALT |
FILTER |
GENE/CSQ |
AF |
MALE_AF |
FEMALE_AF |
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently.
MeanQuality¶
Prints information and quality statistics for each variant.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
For each variant in the given vcf files. Prints :
#CHROM |
POS |
IN_dbSBP |
IN_GnomAD |
meanDP_with_missing |
meanGQ_with_missing |
meanDP_without_missing |
meanGQ_without_missing |
Warning
The input VCF File must have been previously annotated with vep.
MultiAllelicProportion¶
Slides a 1kb window over the genome and outputs a list of regions ordered by the proportion of multi-allelic variations (desc.)
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Slides a 1kb window over the genome and outputs a list of regions ordered by the proportion of multi-allelic variations (desc.) Output format is :
Chr |
pos_n |
pos_n+Window_size |
nb_multialleleic variants |
NumberOfCsqPerGene¶
Given a VCF file and a list of genes, prints the number of variants per gene for each consequence
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--genes genes.txt
: File listing genes
Description
Given a VCF file and a list of genes, prints the number of variants per gene for each consequence Multiallelic sites are considered for each alternate allele
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
NumberOfLinesFromTabix¶
Gets the number of lines indexed by a tabix file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Gets the number of lines indexed by a tabix file
Warning
The bgzipped VCF file FILENAME.vcf.gz must have an associated tabix file FILENAME.vcf.gz.tbi
PrivateVSPanel¶
Check how many of the variants from the input file are filtered as Already_existing when adding samples from the reference file.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ref reference.vcf
: the panel VCF File (can be gzipped)
Description
Check how many of the variants from the input file are filtered as Already_existing when adding samples from the reference file. Takes all the variants in the given vcfFile Compares to each samples from the refFile Gives a count of remaining (new) variants (by consequence) each time we add a sample.
Warning
The input VCF File must have been previously annotated with vep.
Note
In case of multiallelic variants : Each alternate allele is processed independently, While ‘*’ allele ignored.
QCParametersDistribution¶
Reports the distributions of each parameter used by QC
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Reports the distributions of each parameter used by QC One parameter per line, sorted values
Warning
The VCF File must contain the following INFO : QD,FS,SOR,MQ,ReadPosRankSum,InbreedingCoeff,MQRankSum
SampleStats¶
Print Stats about each samples (Mean Depths, TS/TV Het/HomAlt).
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--ped samples.ped
: File describing the VCF’s samples (See File Formats in the documentation)
Description
Print Stats about each samples (Mean Depths, TS/TV Het/HomAlt). Output format is :
Sample |
Group |
Sites |
Genotyped |
Missing |
%Missing |
MeanDepths |
Variants |
Singletons |
TS |
TV |
TS/TV |
Het |
HetRatio |
HomAlt |
Note
In case of multiallelic variants : Each alternate allele is processed independently.
VQSLod¶
Print VQSLod statistics for each tranche.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Print VQSLod statistics for each tranche. Output format is :
Tranche |
Mean |
Min |
D1 |
D2 |
D3 |
D4 |
Median |
D6 |
D7 |
D8 |
D9 |
Max |
Warning
File must contain VQSLOD annotations
Formatting Functions¶
ShowFields¶
Shows selected fields of a VCF File
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--query "field1,field2,...,info:key1;key2;...,geno:key1;key2;..."
: Output columns
Description
Shows selected fields of a VCF File
Query Syntax is Field_1,Field_2,...,Field_n
where Field_x, is one of CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT
or info:key1;key2;...;keyN
ex: info:AbHet;AC;AN;AF
or geno:key1;key2;...;keyN
ex : geno:GT;AD;GQ
TSV2HTML¶
Converts a TSV to a HTML
Mandatory Arguments
--file table.tsv
: the input TSV File--link PositiveInteger
: put link in header, starting at column INDEX (counting from 0)--title MyTitle
: title of the result HTML page
Description
Converts a TSV to a HTML
VCF2HTML¶
Generates an HTML legible file for the given VCF file
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Creates a HTML file, that contains the variants of the VCF file. For each variants, all the VCF fields are displayed. All vep annotation are formatted and shown.
Warning
The input VCF File must have been previously annotated with vep.
VCF2TSV¶
Creates a TSV file, readable in Excel.
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped
Description
Creates a TSV file, that can be opened in Excel. For each variants, all the VCF fields are displayed. All vep annotation are formatted and shown.
VCF2TSVGeneCsq¶
Creates a TSV file, readable in Excel, keeps only annotations for given genes and consequences
Mandatory Arguments
--vcf input.vcf(.gz)
: VCF file to use as an input. Can be bgzipped--genes genes.txt
: Filename of gene list--csq vep.consequence
: Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]
Description
Creates a TSV file, that can be opened in Excel. For each variants, all the VCF fields are displayed. All vep annotation are formatted and shown. Only the variants impacting a gene within the given list are displayed. Only the variants with consequence at least as severe as the one given are displayed.
Warning
The input VCF File must have been previously annotated with vep.
Other Functions¶
BAMCoverage¶
Print the info from of a BAM file to compute coverage
Mandatory Arguments
--bam sample1.bam
: The bam file to process--bed regions.bed
: Regions
Description
TODO set
BAMView¶
Print the content of a BAM file
Mandatory Arguments
--bam sample1.bam
: The bam file to process
Description
analog the samtools view
CoverageStats¶
Gets the coverage statistics for an input file
Mandatory Arguments
--tsv cov.tsv.gz
: File containing depth-of-coverage--chrom chr1
: Chromosome name
Description
Gets the coverage statistics for an input file The input file has one line per chromosome position [1-chrSize] and one column per sample. Each cell contains the depth of coverage for the given sample at the given position. The output format is :
chr |
pos |
mean |
median |
1 |
5 |
10 |
15 |
20 |
25 |
30 |
40 |
50 |
100 |
ExtendBed¶
Adds a padding to the left and right of each regions in the bed, and merges overlapping regions
Mandatory Arguments
--bed regions.bed
: the Bed file to pad--pad PositiveInteger
: number of bases to add left and right of each region
Description
Adds a padding to the left and right of each regions in the bed, and merges overlapping regions
GeneCards¶
Generates a script to retrieves GeneCards HTML pages for each gene in the given list.
Mandatory Arguments
--file genes.txt
: file listing genes
Description
Generates a script to retrieves GeneCards HTML pages for each gene in the given list.
GeneCardsParser¶
Exports summary data from a genecards HTML files as an unformatted table
Mandatory Arguments
--file input.html
: input genecards HTML file
Description
Exports summary data from a genecards HTML files as an unformatted table All HTML markup are removed. The data are formatted as such :
#Gene |
GeneCards |
Entrez Gene |
UniProtKB/Swiss-Prot |
GzPaste¶
Unix paste command for gzipped files
Mandatory Arguments
--files file1.gz,file2.gz,...,fileN.gz
: list (comma separated) of gzipped files to paste
Description
Equivalent to the unix paste command without any special option.
Each input file can be either gzipped or not (mix are possible)
Use --gz
to gzip the output
IsInBed¶
Check if a given chromosome:position is contained in a bedFile
Mandatory Arguments
--chrom chromosome
: chromosome name : (chr)[1-25]/X/Y/M/MT--pos PositiveInteger
: Position--bed region.bed
: the Bed File to process
Description
Check if a given chromosome:position is contained in a bedFile If it is : gives the region’s limits Otherwise : gives the regions before and after the position
NormalizePed¶
Extract x subgroups of y samples for each group present in the Ped file
Mandatory Arguments
--ped samples.ped
: The input PED file to process--number PositiveInteger
: Number Of subgroups for each group--size PositiveInteger
: Group Size
Description
Extract x subgroups of y samples for each group present in the Ped file
If the input ped file has three groups A,B,C of 50 individuals each. Using the command with --number 3 --size 10
will create 9 group :
A A2 A3 B B2 B3 C C2 C3, with 10 individuals in each, randomly picked from groups A B and C.
This function is useful to dived groups, for instance to have 1 learning set and several computing sets.
RandomPed¶
Keeps N random samples from a Ped File
Mandatory Arguments
--ped samples.ped
: The input PED file to process--threshold PositiveInteger
: Number Of Samples
Description
Keeps N random samples from a Ped File
SimplifyBED¶
Returns a simplified bed (with the smallest number of regions covering all the positions in the input bed file).
Mandatory Arguments
--bed region.bed
: the Bed File to process
Description
Returns a simplified bed (with the smallest number of regions covering all the positions in the input bed file). This is useful when the input bed file contains several overlapping regions.
Graphics¶
GraphCompareFrequencies¶
Compares the frequencies of common variants in 2 populations (output of FrequencyCorrelation / CompareToGnomAD)
Mandatory Arguments
--width PositiveInteger
: Graph’s Width in Pixels--height PositiveInteger
: Graph’s Height in Pixels--tsv input.tsv
: input data--name dataset
: Graph Title--outdir ResultsDirectory
: The directory that will contain results files--x PositiveInteger
: index of the column containing X values 0-based--y PositiveInteger
: index of the column containing Y values 0-based
Description
Compares the frequencies of common variants in 2 populations (output of FrequencyCorrelation / CompareToGnomAD) 4 graphs will be created : linear/log JFS/graph
Example
GraphCountGenotypes¶
Create a graph for the results of CountGenotypes
Mandatory Arguments
--width PositiveInteger
: Graph’s Width in Pixels--height PositiveInteger
: Graph’s Height in Pixels--tsv input.tsv
: input data--csq vep.consequence
: Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation]--outdir ResultsDirectory
: The directory that will contain results files
Description
Create a graph for the results of CountGenotypes
Example
GraphF2¶
Create a graph for the results of F2 or F2Individuals
Mandatory Arguments
--width PositiveInteger
: Graph’s Width in Pixels--height PositiveInteger
: Graph’s Height in Pixels--tsv input.tsv
: input data--name title
: Title (will be printed on the graph)--color group1:#FF0000,group2:#00FF00,group3:#0000FF
: HTML color for some/all groups--outdir ResultsDirectory
: The directory that will contain results files
Description
Create a graph for the results of F2 or F2Individuals
Example
GraphJFS¶
Create a graph for the results of JointFrequencySpectrum
Mandatory Arguments
--width PositiveInteger
: Graph’s Width in Pixels--height PositiveInteger
: Graph’s Height in Pixels--tsv input.tsv
: input data--name title
: Title (will be printed on the graph)--x Set1
: Name of the first Set--y Set2
: Name of the second Set--max Scale Max
: Top Number of variant on legend. Enter “null” to use the maximal value from data--outdir ResultsDirectory
: The directory that will contain results files
Description
Create a graph for the results of JointFrequencySpectrum
Warning
Expects a NxN matrix, where matrix[a][b] is the number of variants seen a times in the first set and b times in the second set.
Example
GraphSampleStats¶
Create a graph for the results of SampleStats
Mandatory Arguments
--width PositiveInteger
: Graph’s Width in Pixels--height PositiveInteger
: Graph’s Height in Pixels--tsv input.tsv
: input data--name title
: Title (will be printed on the graph)--outdir ResultsDirectory
: The directory that will contain results files
Description
Create a graph for the results of SampleStats There will be a graph for each of the following values
Number of Variants
Mean Depth
TS/TV
Het/HomAlt
Missing
Example