sgkit.io.vcf.partition_into_regions#

sgkit.io.vcf.partition_into_regions(vcf_path, *, index_path=None, num_parts=None, target_part_size=None, storage_options=None)#

Calculate genomic region strings to partition a compressed VCF or BCF file into roughly equal parts.

Deprecated since version 0.9.0: Functions for reading VCF are deprecated, please use the bio2zarr package.

A .tbi or .csi file is used to find BGZF boundaries in the compressed VCF file, which are then used to divide the file into parts.

The number of parts can specified directly by providing num_parts, or by specifying the desired size (in bytes) of each (compressed) part by providing target_part_size. Exactly one of num_parts or target_part_size must be provided.

Both num_parts and target_part_size serve as hints: the number of parts and their sizes may be more or less than these parameters.

Parameters:
vcf_path str | PathUnion[str, Path]

The path to the VCF file.

index_path str | Path | NoneUnion[str, Path, None] (default: None)

The path to the VCF index (.tbi or .csi), by default None. If not specified, the index path is constructed by appending the index suffix (.tbi or .csi) to the VCF path.

num_parts int | NoneOptional[int] (default: None)

The desired number of parts to partition the VCF file into, by default None

target_part_size None | int | strUnion[None, int, str] (default: None)

The desired size, in bytes, of each (compressed) part of the partitioned VCF, by default None. If the value is a string, it may be specified using standard abbreviations, e.g. 100MB is equivalent to 100_000_000.

storage_options {str: str} | NoneOptional[Dict[str, str]] (default: None)

Any additional parameters for the storage backend (see fsspec.open).

Return type:

Optional[Sequence[str]]

Returns:

: The region strings that partition the VCF file, or None if the VCF file should not be partitioned (so there is only a single partition).

Raises:
  • ValueError – If neither of num_parts or target_part_size has been specified.

  • ValueError – If both of num_parts and target_part_size have been specified.

  • ValueError – If either of num_parts or target_part_size is not a positive integer.