sgkit.io.vcf.partition_into_regions
sgkit.io.vcf.partition_into_regions#
- sgkit.io.vcf.partition_into_regions(vcf_path, *, index_path=None, num_parts=None, target_part_size=None, storage_options=None)#
Calculate genomic region strings to partition a compressed VCF or BCF file into roughly equal parts.
A
.tbior.csifile is used to find BGZF boundaries in the compressed VCF file, which are then used to divide the file into parts.The number of parts can specified directly by providing
num_parts, or by specifying the desired size (in bytes) of each (compressed) part by providingtarget_part_size. Exactly one ofnum_partsortarget_part_sizemust be provided.Both
num_partsandtarget_part_sizeserve as hints: the number of parts and their sizes may be more or less than these parameters.- Parameters
- vcf_path :
str|PathUnion[str,Path] The path to the VCF file.
- index_path :
str|Path|NoneUnion[str,Path,None] (default:None) The path to the VCF index (
.tbior.csi), by default None. If not specified, the index path is constructed by appending the index suffix (.tbior.csi) to the VCF path.- num_parts :
int|NoneOptional[int] (default:None) The desired number of parts to partition the VCF file into, by default None
- target_part_size :
None|int|strUnion[None,int,str] (default:None) The desired size, in bytes, of each (compressed) part of the partitioned VCF, by default None. If the value is a string, it may be specified using standard abbreviations, e.g.
100MBis equivalent to100_000_000.- storage_options : {
str:str} |NoneOptional[Dict[str,str]] (default:None) Any additional parameters for the storage backend (see
fsspec.open).
- vcf_path :
- Return type
- Returns
The region strings that partition the VCF file, or None if the VCF file should not be partitioned (so there is only a single partition).
- Raises
ValueError – If neither of
num_partsortarget_part_sizehas been specified.ValueError – If both of
num_partsandtarget_part_sizehave been specified.ValueError – If either of
num_partsortarget_part_sizeis not a positive integer.