sgkit.io.vcf.vcf_to_zarr¶
-
sgkit.io.vcf.vcf_to_zarr(input, output, *, regions=None, chunk_length=10000, chunk_width=1000, temp_chunk_length=None, tempdir=None, tempdir_storage_options=None, ploidy=2, mixed_ploidy=False, truncate_calls=False)¶ Convert VCF files to a single Zarr on-disk store.
For a single input and a single region, the conversion is carried out sequentially.
For multiple outputs or regions, the conversion is carried out in parallel, by writing the output for each region to a separate, intermediate Zarr store in
tempdir. Then, in a second step the intermediate outputs are concatenated and rechunked into the final output Zarr store inoutput.For more control over these two steps, consider using
vcf_to_zarrs()followed byzarrs_to_dataset(), then saving the dataset using Xarray’sxarray.Dataset.to_zarr()method.- Parameters
- input :
str,Path,Sequence[Union[str,Path]]Union[str,Path,Sequence[Union[str,Path]]] A path (or paths) to the input BCF or VCF file (or files). VCF files should be compressed and have a
.tbior.csiindex file. BCF files should have a.csiindex file.- output :
str,Path,MutableMapping[str,bytes]Union[str,Path,MutableMapping[str,bytes]] Zarr store or path to directory in file system.
- regions :
None,Sequence[str],Sequence[Optional[Sequence[str]]]Union[None,Sequence[str],Sequence[Optional[Sequence[str]]]] (default:None) Genomic region or regions to extract variants for. For multiple inputs, multiple input regions are specified as a sequence of values which may be None, or a sequence of region strings.
- chunk_length :
intint(default:10000) Length (number of variants) of chunks in which data are stored, by default 10,000.
- chunk_width :
intint(default:1000) Width (number of samples) to use when storing chunks in output, by default 1,000.
- temp_chunk_length :
int,NoneOptional[int] (default:None) Length (number of variants) of chunks for temporary intermediate files. Set this to be smaller than
chunk_lengthto avoid memory errors when loading files with very large numbers of samples. Must be evenly divisible intochunk_length. Defaults tochunk_lengthif not set.- tempdir :
str,Path,NoneUnion[str,Path,None] (default:None) Temporary directory where intermediate files are stored. The default None means use the system default temporary directory.
- tempdir_storage_options : {
str:str},NoneOptional[Dict[str,str]] (default:None) Any additional parameters for the storage backend for tempdir (see
fsspec.open).- ploidy :
intint(default:2) The (maximum) ploidy of genotypes in the VCF file.
- mixed_ploidy :
boolbool(default:False) If True, genotype calls with fewer alleles than the specified ploidy will be padded with the non-allele sentinel value of -2. If false, calls with fewer alleles than the specified ploidy will be treated as incomplete and will be padded with the missing-allele sentinel value of -1.
- truncate_calls :
boolbool(default:False) If True, genotype calls with more alleles than the specified (maximum) ploidy value will be truncated to size ploidy. If false, calls with more alleles than the specified ploidy will raise an exception.
- input :
- Return type