sgkit.io.vcf.vcf_to_zarrs¶
-
sgkit.io.vcf.vcf_to_zarrs(input, output, regions, chunk_length=10000, chunk_width=1000, output_storage_options=None, ploidy=2, mixed_ploidy=False, truncate_calls=False)¶ Convert VCF files to multiple Zarr on-disk stores, one per region.
- Parameters
- input :
str,Path,Sequence[Union[str,Path]]Union[str,Path,Sequence[Union[str,Path]]] A path (or paths) to the input BCF or VCF file (or files). VCF files should be compressed and have a
.tbior.csiindex file. BCF files should have a.csiindex file.- output :
str,PathUnion[str,Path] Path to directory containing the multiple Zarr output stores.
- regions :
None,Sequence[str],Sequence[Optional[Sequence[str]]]Union[None,Sequence[str],Sequence[Optional[Sequence[str]]]] Genomic region or regions to extract variants for. For multiple inputs, multiple input regions are specified as a sequence of values which may be None, or a sequence of region strings.
- chunk_length :
intint(default:10000) Length (number of variants) of chunks in which data are stored, by default 10,000.
- chunk_width :
intint(default:1000) Width (number of samples) to use when storing chunks in output, by default 1,000.
- output_storage_options : {
str:str},NoneOptional[Dict[str,str]] (default:None) Any additional parameters for the storage backend, for the output (see
fsspec.open).- ploidy :
intint(default:2) The (maximum) ploidy of genotypes in the VCF file.
- mixed_ploidy :
boolbool(default:False) If True, genotype calls with fewer alleles than the specified ploidy will be padded with the non-allele sentinel value of -2. If false, calls with fewer alleles than the specified ploidy will be treated as incomplete and will be padded with the missing-allele sentinel value of -1.
- truncate_calls :
boolbool(default:False) If True, genotype calls with more alleles than the specified (maximum) ploidy value will be truncated to size ploidy. If false, calls with more alleles than the specified ploidy will raise an exception.
- input :
- Return type
- Returns
A list of URLs to the Zarr outputs.