sgkit.read_scikit_allel_vcfzarr#
- sgkit.read_scikit_allel_vcfzarr(path, field_defs=None)#
Read a VCF Zarr file created using scikit-allel.
Loads VCF variant, sample, and genotype data as Dask arrays within a Dataset from a Zarr file created using scikit-allel’s
vcf_to_zarrfunction.This allows conversion from scikit-allel’s Zarr format to sgkit’s VCF Zarr format.
Since
vcf_to_zarrdoes not preserve phasing information, there is nosgkit.variables.call_genotype_phased_specvariable in the resulting dataset.- Parameters
- path :
str|PathUnion[str,Path] Path to the Zarr file.
- field_defs : {
str: {str:Any}} |NoneOptional[Dict[str,Dict[str,Any]]] (default:None) Per-field information that overrides the field definitions in the VCF header, or provides extra information needed in the dataset representation. Definitions are a represented as a dictionary whose keys are the field names, and values are dictionaries with any of the following keys:
Number,Type,Description,dimension. The first three correspond to VCF header values, anddimensionis the name of the final dimension in the array for the case whereNumberis a fixed integer larger than 1. For example,{"INFO/AC": {"Number": "A"}, "FORMAT/HQ": {"dimension": "haplotypes"}}overrides theINFO/ACfield to be NumberA(useful if the VCF defines it as having variable length with.), and names the final dimension of theHQarray (which is defined as Number 2 in the VCF header) ashaplotypes. (Note that NumberAis the number of alternate alleles, see section 1.4.2 of the VCF spec https://samtools.github.io/hts-specs/VCFv4.3.pdf.)
- path :
- Return type
- Returns
A dataset containing the following variables:
sgkit.variables.variant_id_spec(variants)sgkit.variables.variant_contig_spec(variants)sgkit.variables.variant_position_spec(variants)sgkit.variables.variant_allele_spec(variants)sgkit.variables.sample_id_spec(samples)sgkit.variables.call_genotype_spec(variants, samples, ploidy)sgkit.variables.call_genotype_mask_spec(variants, samples, ploidy)