.. _functions: Functions ********* General Information =================== Optional Arguments for all Functions ------------------------------------ * :code:`--out ResultFile` : File that will contain the function's results.vcf(.gz) * :code:`--log LogFile` : File that will contain the function's log.log * :code:`--gz` : Force all outputs to be bgzipped Produced ascii files are automatically bgzipped if the filename given by the user ends with .gz If the user provides the :code:`--gz` arguments, all output files be bgzipped and .gz will be appended automatically to the filenames (if missing). Output streamed to STD_OUT will also be bgzipped. VCF Filter Functions ==================== CompoundHeterozygous -------------------- Keeps only variants that respect the Compound Heterozygous pattern of inheritance. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--nohomo TRUE|FALSE` : Reject if a case is homozygous to alternate allele or if a control has none of the allele ? * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--missing TRUE|FALSE` : Missing genotypes allowed ? **Description** | Keeps only variants that respect the Compound Heterozygous pattern of inheritance. | In the Compound Heterozygous pattern of inheritance, two variants V1 and V2 from the same gene are valid if - All cases have V1 and V2 - No control have V1 and V2 | Thus, a variant are rejected if - one case doesn't have V1 and V2 - one control has V1 and V2 | With :code:`--missing true`, missing genotypes are concidered compatible with the transmission pattern. | The :code:`--nohomo` options allows to reject alternate alleles if a case is homozygous to an alternate allele or if at least one control is not heterozygous to an alternate allele of V1/V2. (If all the controls are supposed to be parents of cases) | It might be difficult to read results, since several combination of valid variants might exist. So an extra INFO field COMPOUND is added detailling the variants relation. | This field reads as | :code:`COMPOUND=A1>P1(gA|gB|gC)&P2(gD|gE|gF),A2>P3(gG|gH|gI)&P4(gJ|gK|gL),...` | Where: 1. :code:`Ax` is the number of the allele involved 2. :code:`Px` is the partener allele in form chr:pos:ref:alt 3. :code:`gX` is the symbol of the gene common to this allele and it partner .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- DeNovo ------ Keeps only variants that are compatible with a De Novo pattern of inheritance. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--missing TRUE|FALSE` : Missing genotypes allowed ? **Description** | Keeps only variants that are compatible with a De Novo pattern of inheritance. - present in every Case - absent in every Controle .. warning:: Father/Mother/Child(ren) Trios are expected .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- DeNovoRecessive --------------- Keeps only variants that strictly respect this genotypes *parent1* 0/1 + *parent2* 0/0 + *child* 1/1 **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Keeps only variants that strictly respect this genotypes *parent1* 0/1 + *parent2* 0/0 + *child* 1/1 | parent1 and parent2 are interchangeable .. warning:: Will only run if input file has a trio with 1 case an 2 controls .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- Dominant -------- Keeps only variants that respect the Dominant pattern of inheritance **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--missing TRUE|FALSE` : Missing genotypes allowed ? * :code:`--nohomo TRUE|FALSE` : Reject if a case is homozygous to alternate allele ? * :code:`--mode Mode` : strict : true for all cases | loose : true for at least one case **Description** | Keeps only variants that respect the Dominant pattern of inheritance | In the Dominant pattern of inheritance - Cases should have the causal variant - Controls cannot have the causal variant | Thus, a variant is rejected if - one case doesn't possess the alternate allele (strict mode) - one control possesses the alternate allele | If :code:`--missing true`, missing genotypes are concidered compatible with the transmission pattern. | The :code:`--nohomo` options allows to reject alternate alleles if at least one case is homozygous. (If you expect the resulting phenotype would not be consistent for example.) | In the strict mode, all cases must have the alternate allele. In the loose mode, only one case has to have the allele (More permissive for larger panels). .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterConsequenceLevel ---------------------- Filters the variants according to their consequences **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--csq vep.consequence` : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation] **Description** | Filters the variants according to their consequences | The consequence of the variant must be at least as severe as the one given .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterF2 -------- Filters variants to keep only those contributing to F2 data. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--prefix prefix` : Output filename prefix * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Filters variants to keep only those contributing to F2 data. | F2 data are described in PubMedId: 23128226, figure 3a | Six sets of results are given, one for: 1. All variants 2. All SNVs 3. variants without rs 4. SNVs without rs 5. variants with rs 6. SNVs with rs .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterFrequencies ----------------- Keeps only variants with frequencies below the threshold in all of the selected populations. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--threshold 0.0-1.0` : maximum frequency in any population * :code:`--pop pop1,pop2,...,popN` : List of Populations to test (from AF, AFR_AF, AMR_AF, EAS_AF, EUR_AF, SAS_AF, AA_AF, EA_AF, gnomAD_AF, gnomAD_AFR_AF, gnomAD_AMR_AF, gnomAD_ASJ_AF, gnomAD_EAS_AF, gnomAD_FIN_AF, gnomAD_NFE_AF, gnomAD_OTH_AF, gnomAD_SAS_AF, MAX_AF) **Description** | Keeps only variants with frequencies below the threshold in all of the selected populations. | If the variant's frequency exceeds the threshold for any of the selected populations, the variant is filtered out. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterGeneCsqLevel ------------------ Filters the variants according to their consequences on a list of genes. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--genes genes.txt` : File listing genes to keep * :code:`--csq vep.consequence` : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation] **Description** | Filters the variants according to their consequences on a list of genes. | Only the variants impacting at least one of the genes in the list are kept. | The consequence of the variant on the gene must be at least as severe as the one given .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterGeneCsqList ----------------- Filters the variants to keep only those affect one of the given genes with one of the given consequences. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--genes genes.txt` : List of the genes to keep * :code:`--csq csq1,csq2,...,csqN` : List (comma separated) of VEP consequences to keep **Description** | Filters the variants to keep only those affect one of the given genes with one of the given consequences. | If the variants has at least one of the effect from :code:`--genes` on one of the genes in the file from :code:`--csq`, then the variants is kept. | The list of effects can be empty : :code:`--csq null` | VEP consequence must be selected from : [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation] .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterGenotype -------------- Filters the variants to match the given genotype filter. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--filter "SAMPLE1:geno1:keep1,SAMPLE2:geno2:keep2,...,SAMPLEN,genoN:keepN"` : List (comma separated) for samples, their associated genotypes and is they are to be kept **Description** | Filters the variants to match the given genotype filter. | If the genotype of at least one sample mismatches, the variant is **Excluded**. | Filter format : :code:`SAMPLE1:geno1:keep1,SAMPLE2:geno2:keep2,...,SAMPLEN:genoN:keepN` | :code:`Keep=true|false` tells if we want to keep(true) or exclude(false) matching genotype for this sample | Example :code:`SA:0/0:false,SB:0/1:true,SC:0/1:true,SD:1/1:false` will keep variants that are 0/1 for *SB* and *SC*, and that aren't 0/0 for *SA* or 1/1 for *SD* ---------- FilterGnomADFrequency --------------------- Filters out variants with frequencies above threshold in GnomAD **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--threshold 0.0-1.0` : Maximum GnomAD Frequency **Description** | Filters out variants with frequencies above threshold in GnomAD | In case of multiallelic variant, if any alternate allele passes the filter, the variant is kept .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FilterKnownID ------------- Keeps only variant with and empty 3rd field **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Keeps only variant with and empty 3rd field | The field must be empty or equals to "." ---------- FilterNew --------- Keeps only the variants not found in either dbSNP, 1KG or GnomAD **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Keeps only the variants not found in either dbSNP, 1KG or GnomAD .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- FoundInAllCases --------------- Keeps Variants found in every "Case" samples **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Keeps Variants found in every "Case" samples | Case samples are defined by a "1" in the 6th field of the :code:`--ped` file. | In case of a multiallelic variant, if any variant allele is found or missing, the whole variant is kept. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- HQ -- Extract HQ variants. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Extract HQ variants. Defined in 1.12 of the supllementary information of PubMedID=27535533 as 1. VQSR PASS 2. At least 80% of the genotypes have DP above 10 and GQ above 20 3. at least one variant genotype has DP above 10 and GQ above 20 ---------- KeepHomoAlt ----------- Returns a VCF containing only the position homozygous to alt for the given SAMPLES **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--sample s1,s2,...sN` : list (comma separated) of samples to test **Description** | Returns a VCF containing only the position homozygous to alt for the given SAMPLES .. note:: In case of multiallelic variants : Various is kept if different samples are homozygous to different alternative alleles ---------- MergeVQSR --------- Merges SNP and INDEL results files from VQSR **Mandatory Arguments** * :code:`--snp snp.vcf` : File containing SNP output from VQSR * :code:`--indel indel.vcf` : File containing INDEL output from VQSR **Description** | Merges SNP and INDEL results files from VQSR ---------- MonoAllelicSNV -------------- Keep only the lines containing monoallelic SNVs **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Keep only the lines containing monoallelic SNVs ---------- NotFoundInAnyControl -------------------- Removes Variants that are found in controls. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Removes Variants that are found in controls. | Control samples are defined by a "2" in the 6th field of the :code:`--ped` file. | In case of a multiallelic variant, if any variant allele isn't found, the whole variant is kept. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- QC -- Run a Quality Control on VCF Variants **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--opt custom.parameters.tsv` : file containing the various thresholds for the QC (see Documentation) * :code:`--report fiteredVariant.tsv` : output file listing all the variants that were filtered, and why **Description** | Run a Quality Control on VCF Variants | A report file gives the reason(s) each variant has been filtered | For more Details, see https://gitlab.com/gmarenne/ravaq | For each group G, the info field has new annotations - G_AN: AlleleNumber for this group - G_AC: AlleleCounts for this group - G_AF: AlleleFrequencies for this group .. warning:: The VCF File must contain the following INFO : QD,FS,SOR,MQ,ReadPosRankSum,InbreedingCoeff,MQRankSum ---------- RandomVariants -------------- kept only a portion of the variants from a VCF file. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ratio 0.0-1.0` : Probability of keeping each variant * :code:`--file positions.txt` : File listing Positions to keep regardless of given probability in format chr:position **Description** | kept only a portion of the variants from a VCF file. | Each line has a :code:`--ratio` chance of being kept. | Position listed in the file :code:`--file` are always kept ---------- Recessive --------- Keeps only variants that respect the Recessive pattern of inheritance. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--missing TRUE|FALSE` : Missing genotypes allowed ? * :code:`--nohomo TRUE|FALSE` : Reject if a control is homozygous to reference allele ? * :code:`--mode Mode` : strict : true for all cases | loose : true for at least one case **Description** | Keeps only variants that respect the Recessive pattern of inheritance. | In the Recessive pattern of inheritance - Cases should be homozygous to the causal allele - Controls should not be homozygous to the causal allele | Thus, a variant is rejected if - one case isn't homozygous to the alternate allele (strict mode) - one control is homozygous to the the alternate allele | If :code:`--missing true`, missing genotypes are concidered compatible with the transmission pattern. | The :code:`--nohomo` options allows to reject alternate alleles if at least one control is not heterozygous to the alternate allele. (If all the controls are supposed to be parents of cases) | In the strict mode, all cases must be homozygous to the alternate allele. In the loose mode, only one case has to be homozygous to the allele (More permissive for larger panels). .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- Recode ------ Reads all lines in a VCF Files **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Reads all lines in a VCF Files | Ouputs the input VCF file after applying the various command line filters ---------- RemoveNonSNV ------------ Remove variants lines where there have no SNVs **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Remove variants lines where there have no SNVs .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- RemoveNonVariant ---------------- Remove variants where only 0/0 and ./. genotypes are present **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Remove variants where only 0/0 and ./. genotypes are present ---------- SplitByChromosome ----------------- Splits a given vcf file by chromosome **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Splits a given vcf file and produces one resulting vcf file by chromosome. ---------- SplitByGene ----------- Creates an output VCF file for each gene. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Creates an output VCF file for each gene. | Some variants can be in several output files, if they impact several genes. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. ---------- SplitFromDB ----------- Generates two new VCF files with variants present/absent in 1kG/GnomAD. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Generates two new VCF files : - **inDB.MYVCF.vcf** (with variants present in 1kG/GnomAD) - **notInDB.MYVCF.vcf** (with variant absent from 1kG/GnomAD) .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept in "inDB.MYVCF.vcf". ---------- StrictCompoundHeterozygous -------------------------- Keeps only variants that strictly respect the Compound Heterozygous pattern of inheritance. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--nohomo TRUE|FALSE` : Reject if a case is homozygous to alternate allele or if a control has none of the allele ? * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Keeps only variants that strictly respect the Compound Heterozygous pattern of inheritance. | In the Compound Heterozygous pattern of inheritance, two variants V1 and V2 from the same gene are valid if - All cases have V1 and V2 from their parents - All controls (parents) have one of V1/V2 while the other parents have V2/V1 | A variant is kept if and only if : - All case have V1 and V2 - All cases have a parent (control) with V1 and not V2, and this other parent with V2 and not V1 | The :code:`--nohomo` options allows to reject alternate alleles if a sample is homozygous to it. | It might be difficult to read results, since several combination of valid variants might exist. So an extra INFO field COMPOUND is added detailling the variants relation. | This field reads as | :code:`COMPOUND=A1>P1(gA|gB|gC)&P2(gD|gE|gF),A2>P3(gG|gH|gI)&P4(gJ|gK|gL),...` | Where: 1. :code:`Ax` is the number of the allele involved 2. :code:`Px` is the partener allele in form chr:pos:ref:alt 3. :code:`gX` is the symbol of the gene common to this allele and it partner .. warning:: The input VCF File must have been previously annotated with vep. .. warning:: This function expects a complete definition of the sample, where all cases are affected children and both their parents are identified controls. .. note:: In case of multiallelic variants : If at least one alternate allele satisfy all the conditions, the whole variant line is kept. VCF Transformation Functions ============================ ClearInfoField -------------- Replaces the Info column by "." **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Removes all annotation from the VCF file by replacing the content of the Info column by "." ---------- FilterCsqExtractGene -------------------- Filters Variants according to consequences. Replaces ID by gene_chr_pos. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--csq vep.consequence` : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation] **Description** | Filters Variants according to consequence. | Replaces ID by gene_chr_pos. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Only Ref and one alternate allele are Kept. The Kept alternate is (in this order) : 1. the most severe; 2. the most frequent in the file; 3. the first one. ---------- GenerateHomoRefForPosition -------------------------- Generates a VCF with Homozygous-to-Reference Genotypes for every given positions and each sample (Alternate is a transition A↔G, C↔T) **Mandatory Arguments** * :code:`--ref Reference.fasta` : Fasta File containing the reference genome * :code:`--pos positions.tsv` : List of positions in the results VCF file * :code:`--sample samples.txt` : List of samples in the results VCF File **Description** | Given a Reference Genome, a list of positions and a list of individual, generates a VCF file with Homozygous-to-Reference Genotypes for every given positions and each sample. | The reference must be in .fasta format, with its associated .fai index in the same directory | The position file must contain one position per line, in the format : chr pos | The sample file must contain one sample per line | Each given position is looked up | The Ref of the VCF is taken from the given reference genome | Alt = Transition(Ref) : A↔G / C↔T | Format for each position is :code:`GT:DP:GQ:AD:PL` | Each genotypes is :code:`0/0:30:99:30,0:0,50,500` ---------- MissingToMajor -------------- Replaces every missing genotype by the most frequent allele present **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Replaces every missing genotype by the most frequent allele present | updatse AC,AF,AN annotations | The genotype is homozygous to the most frequent allele A in the form :code:`A/A:0:0:0,0,0...` .. note:: In case of multiallelic variants : The major allele is the most frequent allele from ref and each alternate. ---------- MissingToRef ------------ Replaces every missing genotype by :code:`0/0:0:0:0....` **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Replaces every missing genotype by :code:`0/0:0:0:0....` | Updates AC/AN/AF annotations ---------- Scramble -------- Outputs the same VCF same but randomly reassigns the genotypes among the samples **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | This function can be used to annonymize a VCF file. The AC/AN/AF of each variants will stay consistent, but the haplotypes will be broken. | For each line, the genotypes are randomely reassigned among the samples. | The random reassignment is different for each line ---------- SetGenotypeFromProbability -------------------------- Affect a genotype for each sample, for each position from the GenotypeProbability annotation. If a genotype is already present, it can be kept or replaced **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--overwrite TRUE|FALSE` : overwrite existing genotypes ? **Description** | The highest probabilty (given by annotation GP=p1,p2,p3) determines the genotype that will be affect - highest=p1 → 0/0 - highest=p2 → 0/1 - highest=p3 → 1/1 | If a genotype is already present, it can be kept or replaced .. warning:: The input VCF file must contain Genotype Probability (GP=p1,p2,p3) for each genotype .. note:: In case of multiallelic variants : An error will be thrown, as this function expects only monoallelic variants. The affected variant line will be dropped. ---------- SplitMultiAllelic ----------------- Splits multiallelic variants into several lines of monoallelic variants **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Splits multiallelic variants into several lines of monoallelic variants ---------- VCFToReference -------------- Outputs the given VCF File and reverts genotypes when ref/alt alleles are inverted according to given reference (as a fasta file) **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ref Reference.fasta` : Fasta File containing the reference genome **Description** | This function can be used when ref/alt alleles might be inverted (for example when the vcf file has been converted from a plink file) | At each position, the given reference genome is checked to see which allele matches the reference | If none of the allele matches the reference, the line is dropped, and a warning is displayed | AC/AN/AF are updated .. warning:: Annotation (INFO/AD/PL/...) are not updated .. note:: In case of multiallelic variants : An error will be thrown, as this function expects only monoallelic variants. The affected variant line will be dropped. VCF Annotation Functions ======================== AddAlleleBalance ---------------- Adds the annotations : AB, ABhet, ABhem, OND to a VCF file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Adds the following annotations : - **AB** : Allele balance for each het genotype (alleleDepth(gt1) / alleleDepth(gt1) + alleleDepth(gt2)) - **ABhet** : Allele Balance for heterozygous calls (ref/(ref+alt)), for each variant - **ABhom** : Allele Balance for homozygous calls (A/(A+O)) where A is the allele (ref or alt) and O is anything other, for each variant - **OND** : Overall non-diploid ratio (alleles/(alleles+non-alleles)), for each variant | Algorithms is taken from GATK, with the following changes (Results are available for INDELs and multiallelic variants, use with caution) ---------- AddDbSNP -------- Adds/updates dbSNP information to the VCF from a dbSNP release file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ref dbsnp.vcf` : dbSNP refenrece VCF File (can be gzipped) **Description** | Adds dbSNP RS in ID field and INFO field | Adds :code:`RS=` and :code:`dbSNPBuildID=` in INFO field from the input file :code:`--ref`. .. note:: In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated). ---------- AddGroupACANAF -------------- Add AN,AC,AF annotation for each group described in the ped file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | For each group **G**, the info field has new annotations - :code:`G_AN` AlleleNumber for this group - :code:`G_AC` AlleleCounts for this group - :code:`G_AF` AlleleFrequencies for this group .. note:: In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated). ---------- AddWorstAndCanonicalConsequence ------------------------------- For each variant, add the most severe consequence from vep and add the consequence from vep for the annotation marked as Canonical. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | For each variant, add the most severe consequence from vep and add the consequence from vep for the annotation marked as Canonical. | The worst consequence is annotated with the keyword :code:`WORSTCSQ` | The gene for the worst consequence is annotated with the keyword :code:`WORSTGENE` | The canonical consequence is annotated with the keyword :code:`CANONICALCSQ` | The gene for the canonical consequence is annotated with the keyword :code:`CANONICALGENE` | If more than one annotation is marked as canonical, the most severe of them is kept | If no annotation is marked as canonical, the most severe consequence is kept .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated). ---------- ReaffectdbSNP ------------- Puts all observed RS numbers in the ID column **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Takes the rs numbers from the *Existing_variation* annotation (from vep) and adds them to the ID column of the VCF. Puts "." if no RS has been found .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Every RS IDs from every alternate alleles are listed in the ID column. ---------- UpdateACANAF ------------ Resets the :code:`AC` :code:`AN` and :code:`AF` values for the given VCF file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Adds/Updates the :code:`AC` :code:`AN` and :code:`AF` values for the VCF. .. note:: In case of multiallelic variants : Annotation is added/updated for each alternate allele (comma-separated). Analysis Functions ================== CheckReference -------------- For every position in the vcf file, compares the reference from the VCF to the one in the fasta **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ref Reference.fasta` : Fasta File containing the reference genome **Description** | For every position in the vcf file, gets the reference from the given fasta and prints : ======= ===== ========= =========== CHROM POS VCF_REF FASTA_REF ======= ===== ========= =========== .. warning:: Lines containing indels are ignored ---------- Chi2 ---- Performs a chi² Association Tests on an input VCF file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Does a simple association test on the data present in the input vcf file. | Computes the number of case samples with and without variants, and the number of control samples with and without variants. | then does a chi² on those values .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- CommonVariants -------------- Displays the list of variants that are common to two VCF files **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--file smallest.file.vcf` : the smallest of the two input VCF files (can be gzipped) **Description** | Displays the list of variants that are common to two VCF files | Output is given as a list of canonical variants. .. warning:: For faster execution, use --vcf with the largest file and --file with the smallest one .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- CompareGenotype --------------- Compares the genotypes of the samples in the first and second VCF file. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--vcf2 File2.vcf(.gz)` : the second input VCF file (can be bgzipped) **Description** | Compares the genotypes of the samples in the first and second VCF file. | Both VCF are suppose to contain the same samples. This function compares the genotypes of each sample for each variant accross the files. | This can be useful, for example, to compare 2 calling algorithm. | Output for is : ======== ======= ======= ========= ========= ============= ============== ========== Sample Group Total Concord Discord LeftMissing RightMissing %Concord ======== ======= ======= ========= ========= ============= ============== ========== .. note:: In case of multiallelic variants : Alternate alleles are expected to be the same and in the same order in both files ---------- CompareToGnomAD --------------- Compares the variants present in a VCF file to those present in a GnomAD VCF file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--file GnomAD.site.vcf.gz` : GnomAD VCF File (can be gzipped) **Description** | Compares the variants present in a VCF file to those present in a GnomAD VCF file | Output format will be: ====== ===== ==== ===== ===== ====== ===== ====== ==== ==== ==== =========== =========== =========== #CHR POS ID REF ALT QUAL CSQ GENE AC AF AN GnomAD_AC GnomAD_AF GnomAD_AN ====== ===== ==== ===== ===== ====== ===== ====== ==== ==== ==== =========== =========== =========== .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- CountFromPublicDB ----------------- Returns the number of Variants, SNVs, INDEL, in dbSNP, 1kG, GnomAD. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Returns the number of Variants, SNVs, INDEL, in dbSNP, 1kG, GnomAD. | Output format is (For all/SNVs/Indels): ======= ======= ===== ======== =========== ========= ============ Total dbSNP 1kG GnomAD Not dbSNP Not 1kG Not GnomAD ======= ======= ===== ======== =========== ========= ============ .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- CountGenotypes -------------- Counts the genotypes :code:`0/1` and :code:`1/1` for each variants **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Counts the genotypes :code:`0/1` and :code:`1/1` for each variants | The output format is: ======= ===== ===== ===== ============= ==================== ====================== CHROM POS REF ALT CONSEQUENCE TOTAL_HETEROZYGOUS TOTAL_HOMOZYGOUS_ALT ======= ===== ===== ===== ============= ==================== ====================== | Followed by the number of heterozygous and homozygous for each group defined in the ped file. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- CountMissing ------------ For each samples in the PED file, print a summary of missingness **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--threshold 0.0-1.0` : Maximum ratio of Missing Individuals per position **Description** | for each samples in the PED file, prints a summary, in the format ========= ======= =========== ============ =========== ===== ===== ================ =============== #SAMPLE TOTAL GENOTYPED NB_MISSING %_MISSING REF ALT Total_Variants Kept_Variants ========= ======= =========== ============ =========== ===== ===== ================ =============== | where - :code:`SAMPLE` the sample name - :code:`TOTAL` total variants kept - :code:`GENOTYPED` variants with non missing genotypes for this sample - :code:`NB_MISSING` variants with missing genotypes for this sample - :code:`%_MISSING` percent of genotypes missing for this sample - :code:`REF` number of variants homozygous to the ref for this sample - :code:`ALT` number of variants not homozygous to the ref for this sample | The header of the output also contains the total number of variants present in the input file and the number of variants that are kept .. warning:: Kept variants are those with less than :code:`--threshold` genotypes missing ---------- CountVariants ------------- Counts the number of variants for each Samples and print a summary for each group **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Counts the number of variants for each Samples and print a summary for each group | Results Format : ========== ==== ========== ========== ===== =========== ======= ============ FamilyID ID MotherID FatherID Sex Phenotype Group NbVariants ========== ==== ========== ========== ===== =========== ======= ============ .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- DbSNPMismatch ------------- Check if there is a discrepancy between the ID Column and the VEP annotation for RS ID. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Check if there is a discrepancy between the ID Column and the VEP annotation for RS ID. | Output for the lines with discrepancies have the following format : ===== ===== ==== ===== ===== ================ CHR POS ID REF ALT VEP_Annotation ===== ===== ==== ===== ===== ================ .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : RS IDs of every alternate allele are put in the ID field. ---------- ExtractAlleleCounts ------------------- For every variants, exports the variant allele count for each sample **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | For every variants, exports the variant allele count for each sample | Output has the following format ======== ===== ==== ===== ===== #CHROM POS ID REF ALT ======== ===== ==== ===== ===== | Followed by the allele count for each sample | Allele Counts can be 0, 1 or 2 for diploides | Missing genotypes have "." as an allele count .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- ExtractNeighbours ----------------- Creates a bed file of the positions where at least one sample has 2 SNVs that could be in the same triplet (regardless of the reading frame) **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Scans the whole VCF file, for each successive variants V1 and V2 | if at least one sample has the variants V1 and V2 then a bed regions if printed. The region is defined as : ===== ======== ======== chr V1_pos V2_pos ===== ======== ======== | V1 and V2 must be on the same chromosome and V2_pos-V1_pos = 1 or V2_pos-V1_pos = 2 .. note:: In case of multiallelic variants : chr V1_pos V2_pos is printed if, at least one alternate allele of V1_pos and V2_pos is a SNP, and if one sample has a variant of both side (not necessarily the SNP one). ---------- ExtractPrivateToGroup --------------------- Extracts All Variants that are private to a Group. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Extracts All Variants that are private to a Group. | Only the variants found in a single group of samples (as defined in the ped file) are exctrated | The list of the N samples in the group that have the variant is given | Output Format : ====== ===== ======= ========= #CHR POS GROUP SAMPLES ====== ===== ======= ========= .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- F2 -- Computes F2 data. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--prefix prefix` : prefix of the output files * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Computes F2 data. | F2 data are described in PubMedId: 23128226, figure 3a | Six sets of results are given, one for: 1. All variants 2. All SNVs 3. variants without rs (new) 4. SNVs without rs (new) 5. variants with rs (known) 6. SNVs with rs (known) .. warning:: The difference between known and new is done by looking a the vep annotation, not the ID column. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- F2Individuals ------------- Computes F2 data by samples and not by groups (Each sample is its own group). **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--prefix prefix` : prefix of the output files * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Computes F2 data by samples and not by groups (Each sample is its own group). | F2 data are described in PubMedId: 23128226, figure 3a | Six sets of results are given, one for: 1. All variants 2. All SNVs 3. variants without rs (new) 4. SNVs without rs (new) 5. variants with rs (known) 6. SNVs with rs (known) .. warning:: The difference between known and new is done by looking a the vep annotation, not the ID column. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- FrequencyCorrelation -------------------- Prints the frequency correlation of variants between local samples and GnomAD **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Prints the frequency correlation of variants between local samples and GnomAD | For each variants prints : ===== ===== ===== ===== ======= ======== CHR POS REF ALT Local GnomAD ===== ===== ===== ===== ======= ======== | Outputs one line per VEP Consequence .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- FrequencyForPrivate ------------------- Prints the Allele frequency in the file and each group, for variants not found in dbSNP, 1kG or GnomAD. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | For each variant in the file, if the variant is not found in dbSNP, 1KG or GnomAD : | Prints the frequency in the file, and in each group, as well as its consequence .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- GeneList -------- Prints the list of all gene covered by the VCF file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Prints the list of all gene covered by the VCF file | The genes are extracted from the SYMBOL annotation from VEP. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- GetWorstConsequence ------------------- Print the worst consequence/gene for each variant allele. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Print the worst consequence/gene for each variant allele. | For each allele of each variant, the output is in the format: ====== ===== ==== ===== ===== =========== =============== #CHR POS ID REF ALT WORST_CSQ AFFECTED_GENE ====== ===== ==== ===== ===== =========== =============== .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- InbreedingCoeffDistribution --------------------------- Outputs a sorted list of all Inbreeding Coeff from a VCF File. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Outputs a sorted list of all Inbreeding Coeff from a VCF File. | The output file has no header, the values are sorted ascendingly .. warning:: Input file must contains Inbreeding Coeff. annotation ---------- IQSBySample ----------- Computes the IQS score for each sample between sequences data and data imputed from genotyping. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--cpu Integer` : number of cores * :code:`--file imputed.vcf(.gz)` : VCF File Containing imputed data (can be gzipped) **Description** | Computes the IQS score between sequences data and data imputed from genotyping. | Ref PMID26458263, See http://lysine.univ-brest.fr/redmine/issues/84 | Here the IQS score is computed for each sample. | Output format is : ========= ======= ===== ============= ================ #SAMPLE GROUP IQS NB_VARIANTS TOTAL_VARIANTS ========= ======= ===== ============= ================ .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- IQSByVariant ------------ Computes the IQS score for each variant between sequences data and data imputed from genotyping. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--cpu Integer` : number of cores * :code:`--file imputed.vcf(.gz)` : VCF File Containing imputed data (can be gzipped) **Description** | Computes the IQS score between sequences data and data imputed from genotyping. | Ref PMID26458263, See http://lysine.univ-brest.fr/redmine/issues/84 | Here the IQS score is computed for each variant. | Output format is : ===== ===== ==== ===== ===== ====== ============= ========== ================= ============= ========= ===== ====== chr pos rs ref alt gene consequence Freq_VCF Freq_GnomAD_NFE Freq_MaxPop Max_Pop IQS Info ===== ===== ==== ===== ===== ====== ============= ========== ================= ============= ========= ===== ====== .. warning:: Extra information are available if the input file was annotated with VEP .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- JFSSummary ---------- Outputs the Joint Site Frequency Spectrum Summary statistics **Mandatory Arguments** * :code:`--file GROUP1.GROUP2.XXX.YYY.ZZZ.tsv` : input tsv file **Description** | Outputs the Joint Site Frequency Spectrum Summary statistics | See https://www.nature.com/articles/ejhg2013297 | The input file contains the JFS data comparing to groups of samples. | Those data, generated by the function JointFrequencySpectrum_ are in the following format : | a (2n+1)x(2n+1) matrix, where n is the number of samples in each population. The number at matrix[A][B], is the number of variants for which the first group has A variant alleles and the second group has B variant alleles | The output information are - N = Number of haplotypes in each population (2xn -- the num of samples per pop.) - V = Total number of variants - threshold : pooled sample allele frequency (i + j)/2N <= 0.05 - FST = overall measure of genetic diversity - AS = allele sharing statistic (probability that two individuals carrying an allele count of n come from different populations, normalized by the expected probability in panmictic population) - WS = weighted symmetry (measures how evenly rare aleeles are distributed between the two populations) ---------- JointFrequencySpectrum ---------------------- Creates a JointFrequencySpectrum result file for each group defined in the ped file. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Creates a JointFrequencySpectrum result file for each group defined in the ped file. | See https://www.nature.com/articles/ejhg2013297 | Samples from the same group MUST be split into 2 subgroup, so as to be compared | Example : GroupA1 GroupA2 GroupB1 GroupB2 GroupC1 GroupC2 | each group MUST HAVE the same number of samples. | The format of the output is : | G*G output files (where G is the number of groups). Each file is name VCFINPUTFILE.group1.group2.tsv | Each of these files contains a (2n+1)x(2n+1) matrix, where n is the number of samples in each population. The number at matrix[A][B], is the number of variants for which the first group has A variant alleles and the second group has B variant alleles | Output file must then be processed with the function JFSSummary .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- Kappa ----- Kappa Comparision between to vcf files. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--vcf2 File2.vcf` : the second input VCF File (can be gzipped) * :code:`--tsv output.tsv` : the result TSV File * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Kappa Comparision between to vcf files. | See : https://journals.sagepub.com/doi/abs/10.1177/001316446002000104?journalCode=epma and https://en.wikipedia.org/wiki/Cohen%27s_kappa | Output format is : ======= ===== ==== =========== =========== ==================== ====================== CHROM POS ID MAF_FILE1 MAF_FILE2 KAPPA_With_Missing KAPPA_Ignore_Missing ======= ===== ==== =========== =========== ==================== ====================== .. note:: In case of multiallelic variants : Results are given for the first alternate allele, why is expected to be the same in both files. ---------- MaleFemale ---------- Show Male/Female Allele Frequencies **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Show Male/Female Allele Frequencies | Output format: ======= ===== ==== ===== ===== ======== ========== ==== ========= =========== CHROM POS ID REF ALT FILTER GENE/CSQ AF MALE_AF FEMALE_AF ======= ===== ==== ===== ===== ======== ========== ==== ========= =========== .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- MeanQuality ----------- Prints information and quality statistics for each variant. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | For each variant in the given vcf files. Prints : ======== ===== ========== =========== ===================== ===================== ======================== ======================== #CHROM POS IN_dbSBP IN_GnomAD meanDP_with_missing meanGQ_with_missing meanDP_without_missing meanGQ_without_missing ======== ===== ========== =========== ===================== ===================== ======================== ======================== .. warning:: The input VCF File must have been previously annotated with vep. ---------- MultiAllelicProportion ---------------------- Slides a 1kb window over the genome and outputs a list of regions orderer by the proportion of multi-allelic variations (desc.) **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Slides a 1kb window over the genome and outputs a list of regions orderer by the proportion of multi-allelic variations (desc.) | Output format is : ===== ======= =================== =========================== Chr pos_n pos_n+Window_size nb_multialleleic variants ===== ======= =================== =========================== ---------- NumberOfCsqPerGene ------------------ Given a VCF file and a list of genes, prints the number of variants per gene for each consequence **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--genes genes.txt` : File listing genes **Description** | Given a VCF file and a list of genes, prints the number of variants per gene for each consequence | Multiallelic sites are concidered for each alternate allele .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- NumberOfLinesFromTabix ---------------------- Gets the number of lines indexed by a tabix file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Gets the number of lines indexed by a tabix file .. warning:: The bgzipped VCF file FILENAME.vcf.gz must have an associated tabix file FILENAME.vcf.gz.tbi ---------- PrivateAndShared ---------------- For the given VCF, gives the number of variants private to each group and shared amoung all groups. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | For the given VCF, gives the number of variants private to each group and shared amoung all groups. | Output : - The total number of variants in the file - The number of variants present in ALL the groups defined in the the Ped file - The number of variants private to each group .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- PrivateVSPanel -------------- Check how many of the variants from the input file are filtered as Already_existing when adding samples from the reference file. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ref reference.vcf` : the panel VCF File (can be gzipped) **Description** | Check how many of the variants from the input file are filtered as Already_existing when adding samples from the reference file. | Takes all the variants in the given vcffile | Compares to each samples from the reffile | Gives a count of remaining (new) variants (by consequence) each time we add a sample. .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- QCParametersDistribution ------------------------ Reports the distributions of each parameter used by QC **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Reports the distributions of each parameter used by QC | One parameter per line, sorted values .. warning:: The VCF File must contain the following INFO : QD,FS,SOR,MQ,ReadPosRankSum,InbreedingCoeff,MQRankSum ---------- SampleStats ----------- Print Stats about each samples (Mean Depths, TS/TV Het/HomAlt). **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--ped samples.ped` : File describing the VCF's samples (See File Formats in the documentation) **Description** | Print Stats about each samples (Mean Depths, TS/TV Het/HomAlt). | Output format is : ======== ======= ======= =========== ========= ========== ============ ========== ============ ==== ==== ======= ===== ========== ======== Sample Group Sites Genotyped Missing %Missing MeanDepths Variants Singletons TS TV TS/TV Het HetRatio HomAlt ======== ======= ======= =========== ========= ========== ============ ========== ============ ==== ==== ======= ===== ========== ======== .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- SharedAlleleMatrix ------------------ returns a series of matrices [individuals/individuals] with the number of shared alleles. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | returns a series of matrices [individuals/individuals] with the number of shared alleles. | Matrices are newSNP, SNP.f<0.005, SNP.f<0.01, SNP.f<0.05 .. warning:: The input VCF File must have been previously annotated with vep. .. note:: In case of multiallelic variants : Each alternate allele is processed independently. ---------- VQSLod ------ Print VQSLod statistics for each tranche. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Print VQSLod statistics for each tranche. | Output format is : ========= ====== ===== ==== ==== ==== ==== ======== ==== ==== ==== ==== ===== Tranche Mean Min D1 D2 D3 D4 Median D6 D7 D8 D9 Max ========= ====== ===== ==== ==== ==== ==== ======== ==== ==== ==== ==== ===== .. warning:: File must contain VQSLOD annotations Formatting Functions ==================== ShowFields ---------- Shows selected fields of a VCF File **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--query "field1,field2,...,info:key1;key2;...,geno:key1;key2;..."` : Output columns **Description** | Shows selected fields of a VCF File | Query Syntax is :code:`Field_1,Field_2,...,Field_n` | where Field_x, is one of CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT | or :code:`info:key1;key2;...;keyN` ex: :code:`info:AbHet;AC;AN;AF` | or :code:`geno:key1;key2;...;keyN` ex : :code:`geno:GT;AD;GQ` ---------- TSV2HTML -------- Converts a TSV to a HTML **Mandatory Arguments** * :code:`--file table.tsv` : the input TSV File * :code:`--link PositiveInteger` : put link in header, starting at column INDEX (counting from 0) * :code:`--title MyTitle` : title of the result HTML page **Description** | Converts a TSV to a HTML ---------- VCF2HTML -------- Generates an HTML legible file for the given VCF file **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Creates a HTML file, that contains the variants of the VCF file. | For each variants, all the VCF fields are displayed. | All vep annotation are formatted and shown. .. warning:: The input VCF File must have been previously annotated with vep. ---------- VCF2TSV ------- Creates a TSV file, readable in Excel. **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped **Description** | Creates a TSV file, that can be opened in Excel. | For each variants, all the VCF fields are displayed. | All vep annotation are formatted and shown. ---------- VCF2TSVGeneCsq -------------- Creates a TSV file, readable in Excel, keeps only annotations for given genes and consequences **Mandatory Arguments** * :code:`--vcf input.vcf(.gz)` : VCF file to use as an input. Can be bgzipped * :code:`--genes genes.txt` : Filename of gene list * :code:`--csq vep.consequence` : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation] **Description** | Creates a TSV file, that can be opened in Excel. | For each variants, all the VCF fields are displayed. | All vep annotation are formatted and shown. | Only the variants impacting a gene within the given list are displayed. | Only the variants with consequence at least as severe as the one given are displayed. .. warning:: The input VCF File must have been previously annotated with vep. Other Functions =============== CoverageStats ------------- Gets the coverage statistics for an input file **Mandatory Arguments** * :code:`--tsv cov.tsv.gz` : File containing depth-of-coverage * :code:`--chrom chr1` : Chromosome name **Description** | Gets the coverage statistics for an input file | The input file has one line per chromosome position [1-chrSize] and one column per sample. Each cell contains the depth of coverage for the given sample at the given position. | The output format is : ===== ===== ====== ======== === === ==== ==== ==== ==== ==== ==== ==== ===== chr pos mean median 1 5 10 15 20 25 30 40 50 100 ===== ===== ====== ======== === === ==== ==== ==== ==== ==== ==== ==== ===== ---------- ExtendBed --------- Adds a padding to the left and right of each regions in the bed, and merges overlapping regions **Mandatory Arguments** * :code:`--bed regions.bed` : the Bed file to pad * :code:`--pad PositiveInteger` : number of bases to add left and right of each region **Description** | Adds a padding to the left and right of each regions in the bed, and merges overlapping regions ---------- GeneCards --------- Generates a script to retrieves GeneCards HTML pages for each gene in the given list. **Mandatory Arguments** * :code:`--file genes.txt` : file listing genes **Description** | Generates a script to retrieves GeneCards HTML pages for each gene in the given list. ---------- GeneCardsParser --------------- Exports summary data from a genecards HTML files as an unformatted table **Mandatory Arguments** * :code:`--file input.html` : input genecargs HTML file **Description** | Exports summary data from a genecards HTML files as an unformatted table | All HTML markup are removed. | The data are formatted as such : ======= =========== ============= ====================== #Gene GeneCards Entrez Gene UniProtKB/Swiss-Prot ======= =========== ============= ====================== ---------- GzPaste ------- Unix paste command for gzipped files **Mandatory Arguments** * :code:`--files file1.gz,file2.gz,...,fileN.gz` : list (comma separated) of gzipped files to paste **Description** | Equivalent to the unix paste command without any special option. | Each input file can be either gzipped or not (mix are possible) | Use :code:`--gz` to gzip the output ---------- IsInBed ------- Check if a given chromosome:position is contained in a bedfile **Mandatory Arguments** * :code:`--chrom chromosome` : chromosome name : (chr)[1-25]/X/Y/M/MT * :code:`--pos PositiveInteger` : Position * :code:`--bed region.bed` : the Bed File to process **Description** | Check if a given chromosome:position is contained in a bedfile | If it is : gives the region's limits | Otherwise : gives the regions before and after the position ---------- NormalizePed ------------ Extract x subgroups of y samples for each group present in the Ped file **Mandatory Arguments** * :code:`--ped samples.ped` : The input PED file to process * :code:`--number PositiveInteger` : Number Of subgroups for each group * :code:`--size PositiveInteger` : Group Size **Description** | Extract x subgroups of y samples for each group present in the Ped file | If the input ped file has three groups A,B,C of 50 individuals each. Using the command with :code:`--number 3 --size 10` will create 9 group : | A A2 A3 B B2 B3 C C2 C3, with 10 individuals in each, randomly picked from groups A B and C. | This function is usefull to dived groups, for instance to have 1 learning set and several computing sets. ---------- RandomPed --------- Keeps N random samples from a Ped File **Mandatory Arguments** * :code:`--ped samples.ped` : The input PED file to process * :code:`--threshold PositiveInteger` : Number Of Samples **Description** | Keeps N random samples from a Ped File ---------- SimplifyBED ----------- Returns a simplified bed (with the smallest number of regions covering all the positions in the input bed file). **Mandatory Arguments** * :code:`--bed region.bed` : the Bed File to process **Description** | Returns a simplified bed (with the smallest number of regions covering all the positions in the input bed file). | This is useful when the input bed file contains several overlapping regions. Graphics ======== GraphCompareFrequencies ----------------------- Compares the frequencies of common variants in 2 populations (output of FrequencyCorrelation / CompareToGnomAD) **Mandatory Arguments** * :code:`--width PositiveInteger` : Graph's Width in Pixels * :code:`--height PositiveInteger` : Graph's Height in Pixels * :code:`--tsv input.tsv` : input data * :code:`--name dataset` : Graph Title * :code:`--outdir ResultsDirectory` : The directory that will contain results files * :code:`--x PositiveInteger` : index of the column containing X values 0-based * :code:`--y PositiveInteger` : index of the column containing Y values 0-based **Description** | Compares the frequencies of common variants in 2 populations (output of FrequencyCorrelation / CompareToGnomAD) | 4 graphs will be created : linear/log JFS/graph **Example** .. image:: http://lysine.univ-brest.fr/media/GraphCompareFrequencies.png :width: 600 :alt: GraphCompareFrequencies example ---------- GraphCountGenotypes ------------------- Create a graph for the results of CountGenotypes **Mandatory Arguments** * :code:`--width PositiveInteger` : Graph's Width in Pixels * :code:`--height PositiveInteger` : Graph's Height in Pixels * :code:`--tsv input.tsv` : input data * :code:`--csq vep.consequence` : Least severe consequence [empty | intergenic_variant | feature_truncation | regulatory_region_variant | feature_elongation | regulatory_region_amplification | regulatory_region_ablation | TF_binding_site_variant | TFBS_amplification | TFBS_ablation | downstream_gene_variant | upstream_gene_variant | non_coding_transcript_variant | NMD_transcript_variant | intron_variant | non_coding_transcript_exon_variant | 3_prime_UTR_variant | 5_prime_UTR_variant | mature_miRNA_variant | coding_sequence_variant | synonymous_variant | stop_retained_variant | start_retained_variant | incomplete_terminal_codon_variant | splice_region_variant | protein_altering_variant | missense_variant | inframe_deletion | inframe_insertion | transcript_amplification | start_lost | stop_lost | frameshift_variant | stop_gained | splice_donor_variant | splice_acceptor_variant | transcript_ablation] * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Create a graph for the results of CountGenotypes **Example** .. image:: http://lysine.univ-brest.fr/media/GraphCountGenotypes.png :width: 600 :alt: GraphCountGenotypes example ---------- GraphF2 ------- Create a graph for the results of F2 or F2Individuals **Mandatory Arguments** * :code:`--width PositiveInteger` : Graph's Width in Pixels * :code:`--height PositiveInteger` : Graph's Height in Pixels * :code:`--tsv input.tsv` : input data * :code:`--name title` : Title (will be printed on the graph) * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Create a graph for the results of F2 or F2Individuals **Example** .. image:: http://lysine.univ-brest.fr/media/GraphF2.png :width: 600 :alt: GraphF2 example ---------- GraphJFS -------- Create a graph for the results of JointFrequencySpectrum **Mandatory Arguments** * :code:`--width PositiveInteger` : Graph's Width in Pixels * :code:`--height PositiveInteger` : Graph's Height in Pixels * :code:`--tsv input.tsv` : input data * :code:`--name title` : Title (will be printed on the graph) * :code:`--x Set1` : Name of the first Set * :code:`--y Set2` : Name of the second Set * :code:`--max Scale Max` : Top Number of variant on legend. Enter "null" to use the maximal value from data * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Create a graph for the results of JointFrequencySpectrum .. warning:: Expects a NxN matrix, where matrix[a][b] is the number of variants seen a times in the first set and b times in the second set. **Example** .. image:: http://lysine.univ-brest.fr/media/GraphJFS.png :width: 600 :alt: GraphJFS example ---------- GraphSampleStats ---------------- Create a graph for the results of SampleStats **Mandatory Arguments** * :code:`--width PositiveInteger` : Graph's Width in Pixels * :code:`--height PositiveInteger` : Graph's Height in Pixels * :code:`--tsv input.tsv` : input data * :code:`--name title` : Title (will be printed on the graph) * :code:`--outdir ResultsDirectory` : The directory that will contain results files **Description** | Create a graph for the results of SampleStats | There will be a graph for each of the following values - Number of Variants - Mean Depth - TS/TV - Het/HomAlt - Missing **Example** .. image:: http://lysine.univ-brest.fr/media/GraphSampleStats.png :width: 600 :alt: GraphSampleStats example