File Formats¶
Here are the File Formats Handled by VCFProcessor
VCF v4.2¶
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.
In the VCF file’s body there is one line per genomic position containing a variant, and the column for those lines are :
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | Sample1 | Sample2 | ... | SampleN |
---|---|---|---|---|---|---|---|---|---|---|---|---|
See the full specifications : https://samtools.github.io/hts-specs/VCFv4.2.pdf
PED¶
The PED file formats used in VCFProcessor is an adaptation of the PED/FAM format used in PLINK.
It is used to specify information about the samples present in the VCF files, and to describe relation between them. This tabulated text file has 7 columns :
FamID | ID | MID | FID | Sex | Pheno | Group |
---|---|---|---|---|---|---|
FamID : the family ID (all samples in the same family share the same FamID)
ID : the sample’s unique ID
MID : the ID of the sample’s mother
FID : the ID of the sample’s father
Sex : the sample’s sex (1 male / 2 female)
Pheno : the sample’s phenotype (1 unaffected / 2 affected)
Group : the group to which the sample belongs (batch id, population, symptoms, …)
BED¶
The BED format consists of one line per feature, each containing 3-12 columns of data, plus optional track definition lines.
The first three fields in each feature line are required:
chrom - name of the chromosome or scaffold. Any valid seq_region_name can be used, and chromosome names can be given with or without the ‘chr’ prefix.
chromStart - Start position of the feature in standard chromosomal coordinates (i.e. first base is 0).
chromEnd - End position of the feature in standard chromosomal coordinates
See the full specifications : https://m.ensembl.org/info/website/upload/bed.html