sgkit.read_scikit_allel_vcfzarr#

sgkit.read_scikit_allel_vcfzarr(path, field_defs=None)#

Read a VCF Zarr file created using scikit-allel.

Loads VCF variant, sample, and genotype data as Dask arrays within a Dataset from a Zarr file created using scikit-allel’s vcf_to_zarr function.

This allows conversion from scikit-allel’s Zarr format to sgkit’s VCF Zarr format.

Since vcf_to_zarr does not preserve phasing information, there is no sgkit.variables.call_genotype_phased_spec variable in the resulting dataset.

Parameters
path : str | PathUnion[str, Path]

Path to the Zarr file.

field_defs : {str: {str: Any}} | NoneOptional[Dict[str, Dict[str, Any]]] (default: None)

Per-field information that overrides the field definitions in the VCF header, or provides extra information needed in the dataset representation. Definitions are a represented as a dictionary whose keys are the field names, and values are dictionaries with any of the following keys: Number, Type, Description, dimension. The first three correspond to VCF header values, and dimension is the name of the final dimension in the array for the case where Number is a fixed integer larger than 1. For example, {"INFO/AC": {"Number": "A"}, "FORMAT/HQ": {"dimension": "haplotypes"}} overrides the INFO/AC field to be Number A (useful if the VCF defines it as having variable length with .), and names the final dimension of the HQ array (which is defined as Number 2 in the VCF header) as haplotypes. (Note that Number A is the number of alternate alleles, see section 1.4.2 of the VCF spec https://samtools.github.io/hts-specs/VCFv4.3.pdf.)

Return type

Dataset

Returns

A dataset containing the following variables: