sgkit.io.vcf.zarr_array_sizes#

sgkit.io.vcf.zarr_array_sizes(input, *, regions=None, target_part_size='auto')#

Make a pass through a VCF/BCF file to determine sizes for storage in Zarr.

Deprecated since version 0.9.0: Functions for reading VCF are deprecated, please use the bio2zarr package.

By default, the input is processed in parts in parallel. However, if the input is a single file, target_part_size is None, and regions is None, then the operation will be carried out sequentially.

Parameters:
input str | Path | Sequence[Union[str, Path]]Union[str, Path, Sequence[Union[str, Path]]]

A path (or paths) to the input BCF or VCF file (or files). VCF files should be compressed and have a .tbi or .csi index file. BCF files should have a .csi index file.

target_part_size None | int | strUnion[None, int, str] (default: 'auto')

The desired size, in bytes, of each (compressed) part of the input to be processed in parallel. Defaults to "auto", which will pick a good size (currently 20MB). A value of None means that the input will be processed sequentially. The setting will be ignored if regions is also specified.

regions None | Sequence[str] | Sequence[Optional[Sequence[str]]]Union[None, Sequence[str], Sequence[Optional[Sequence[str]]]] (default: None)

Genomic region or regions to extract variants for. For multiple inputs, multiple input regions are specified as a sequence of values which may be None, or a sequence of region strings. Takes priority over target_part_size if both are not None.

Return type:

Dict[str, Any]