sgkit.io.vcf.vcf_to_zarrs#

sgkit.io.vcf.vcf_to_zarrs(input, output, regions, chunk_length=10000, chunk_width=1000, compressor=Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0), encoding=None, output_storage_options=None, ploidy=2, mixed_ploidy=False, truncate_calls=False, max_alt_alleles=3, fields=None, exclude_fields=None, field_defs=None, read_chunk_length=None)#

Convert VCF files to multiple Zarr on-disk stores, one per region.

Deprecated since version 0.9.0: Functions for reading VCF are deprecated, please use the bio2zarr package.

Parameters:
input str | Path | Sequence[Union[str, Path]]Union[str, Path, Sequence[Union[str, Path]]]

A path (or paths) to the input BCF or VCF file (or files). VCF files should be compressed and have a .tbi or .csi index file. BCF files should have a .csi index file.

output str | PathUnion[str, Path]

Path to directory containing the multiple Zarr output stores.

regions None | Sequence[str] | Sequence[Optional[Sequence[str]]]Union[None, Sequence[str], Sequence[Optional[Sequence[str]]]]

Genomic region or regions to extract variants for. For multiple inputs, multiple input regions are specified as a sequence of values which may be None, or a sequence of region strings.

chunk_length int (default: 10000)

Length (number of variants) of chunks in which data are stored, by default 10,000.

chunk_width int (default: 1000)

Width (number of samples) to use when storing chunks in output, by default 1,000.

compressor Any | NoneOptional[Any] (default: Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0))

Zarr compressor, by default Blosc + zstd with compression level 7 and auto-shuffle. No compression is used when set as None.

encoding Any | NoneOptional[Any] (default: None)

Variable-specific encodings for xarray, specified as a nested dictionary with variable names as keys and dictionaries of variable specific encodings as values. Can be used to override Zarr compressor and filters on a per-variable basis, e.g., {"call_genotype": {"compressor": Blosc("zstd", 9)}}.

output_storage_options {str: str} | NoneOptional[Dict[str, str]] (default: None)

Any additional parameters for the storage backend, for the output (see fsspec.open).

ploidy int (default: 2)

The (maximum) ploidy of genotypes in the VCF file.

mixed_ploidy bool (default: False)

If True, genotype calls with fewer alleles than the specified ploidy will be padded with the fill (non-allele) sentinel value of -2. If false, calls with fewer alleles than the specified ploidy will be treated as incomplete and will be padded with the missing-allele sentinel value of -1.

truncate_calls bool (default: False)

If True, genotype calls with more alleles than the specified (maximum) ploidy value will be truncated to size ploidy. If false, calls with more alleles than the specified ploidy will raise an exception.

max_alt_alleles int (default: 3)

The (maximum) number of alternate alleles in the VCF file. Any records with more than this number of alternate alleles will have the extra alleles dropped (the variant_allele variable will be truncated). Any call genotype fields with the extra alleles will be changed to the missing-allele sentinel value of -1.

fields Sequence[str] | NoneOptional[Sequence[str]] (default: None)

Extra fields to extract data for. A list of strings, with INFO or FORMAT prefixes. Wildcards are permitted too, for example: ["INFO/*", "FORMAT/DP"].

field_defs {str: {str: Any}} | NoneOptional[Dict[str, Dict[str, Any]]] (default: None)

Per-field information that overrides the field definitions in the VCF header, or provides extra information needed in the dataset representation. Definitions are a represented as a dictionary whose keys are the field names, and values are dictionaries with any of the following keys: Number, Type, Description, dimension. The first three correspond to VCF header values, and dimension is the name of the final dimension in the array for the case where Number is a fixed integer larger than 1. For example, {"INFO/AC": {"Number": "A"}, "FORMAT/HQ": {"dimension": "haplotypes"}} overrides the INFO/AC field to be Number A (useful if the VCF defines it as having variable length with .), and names the final dimension of the HQ array (which is defined as Number 2 in the VCF header) as haplotypes. (Note that Number A is the number of alternate alleles, see section 1.4.2 of the VCF spec https://samtools.github.io/hts-specs/VCFv4.3.pdf.)

read_chunk_length int | NoneOptional[int] (default: None)

Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by using a value lower than chunk_length with a small cost in extra run time. The increase in runtime becomes higher as the ratio of read_chunk_length to chunk_length decreases. Defaults to None, which means that a value equal to chunk_length is used. The memory usage of the conversion process is proportional to read_chunk_length*n_samples*(1+n_ploidy) so this option is mainly useful for very large numbers of samples and/or where a large chunk_size is desirable to reduce the number of dask tasks needed in downstream analysis.

Return type:

Sequence[str]

Returns:

: A list of URLs to the Zarr outputs.