sgkit.io.vcf.partition_into_regions#
- sgkit.io.vcf.partition_into_regions(vcf_path, *, index_path=None, num_parts=None, target_part_size=None, storage_options=None)#
Calculate genomic region strings to partition a compressed VCF or BCF file into roughly equal parts.
Deprecated since version 0.9.0: Functions for reading VCF are deprecated, please use the bio2zarr package.
A
.tbi
or.csi
file is used to find BGZF boundaries in the compressed VCF file, which are then used to divide the file into parts.The number of parts can specified directly by providing
num_parts
, or by specifying the desired size (in bytes) of each (compressed) part by providingtarget_part_size
. Exactly one ofnum_parts
ortarget_part_size
must be provided.Both
num_parts
andtarget_part_size
serve as hints: the number of parts and their sizes may be more or less than these parameters.- Parameters:
- vcf_path
str
|Path
Union
[str
,Path
] The path to the VCF file.
- index_path
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) The path to the VCF index (
.tbi
or.csi
), by default None. If not specified, the index path is constructed by appending the index suffix (.tbi
or.csi
) to the VCF path.- num_parts
int
|None
Optional
[int
] (default:None
) The desired number of parts to partition the VCF file into, by default None
- target_part_size
None
|int
|str
Union
[None
,int
,str
] (default:None
) The desired size, in bytes, of each (compressed) part of the partitioned VCF, by default None. If the value is a string, it may be specified using standard abbreviations, e.g.
100MB
is equivalent to100_000_000
.- storage_options {
str
:str
} |None
Optional
[Dict
[str
,str
]] (default:None
) Any additional parameters for the storage backend (see
fsspec.open
).
- vcf_path
- Return type:
- Returns:
: The region strings that partition the VCF file, or None if the VCF file should not be partitioned (so there is only a single partition).
- Raises:
ValueError – If neither of
num_parts
ortarget_part_size
has been specified.ValueError – If both of
num_parts
andtarget_part_size
have been specified.ValueError – If either of
num_parts
ortarget_part_size
is not a positive integer.