sgkit.io.vcf.vcf_to_zarr#
- sgkit.io.vcf.vcf_to_zarr(input, output, *, target_part_size='auto', regions=None, chunk_length=10000, chunk_width=1000, compressor=Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0), encoding=None, temp_chunk_length=None, tempdir=None, tempdir_storage_options=None, ploidy=2, mixed_ploidy=False, truncate_calls=False, max_alt_alleles=3, fields=None, exclude_fields=None, field_defs=None, read_chunk_length=None, retain_temp_files=None)#
Convert VCF files to a single Zarr on-disk store.
Deprecated since version 0.9.0: Functions for reading VCF are deprecated, please use the bio2zarr package.
By default, the conversion is carried out in parallel, by writing the output for each part to a separate, intermediate Zarr store in
tempdir
. Then, in a second step the intermediate outputs are concatenated and rechunked into the final output Zarr store inoutput
.Conversion is carried out sequentially if
target_part_size
is None, andregions
is None.For more control over these two steps, consider using
vcf_to_zarrs()
followed byconcat_zarrs()
.- Parameters:
- input
str
|Path
|Sequence
[Union
[str
,Path
]]Union
[str
,Path
,Sequence
[Union
[str
,Path
]]] A path (or paths) to the input BCF or VCF file (or files). VCF files should be compressed and have a
.tbi
or.csi
index file. BCF files should have a.csi
index file.- output
str
|Path
|MutableMapping
[str
,bytes
]Union
[str
,Path
,MutableMapping
[str
,bytes
]] Zarr store or path to directory in file system.
- target_part_size
None
|int
|str
Union
[None
,int
,str
] (default:'auto'
) The desired size, in bytes, of each (compressed) part of the input to be processed in parallel. Defaults to
"auto"
, which will pick a good size (currently 20MB). A value of None means that the input will be processed sequentially. The setting will be ignored ifregions
is also specified.- regions
None
|Sequence
[str
] |Sequence
[Optional
[Sequence
[str
]]]Union
[None
,Sequence
[str
],Sequence
[Optional
[Sequence
[str
]]]] (default:None
) Genomic region or regions to extract variants for. For multiple inputs, multiple input regions are specified as a sequence of values which may be None, or a sequence of region strings. Takes priority over
target_part_size
if both are not None.- chunk_length
int
(default:10000
) Length (number of variants) of chunks in which data are stored, by default 10,000.
- chunk_width
int
(default:1000
) Width (number of samples) to use when storing chunks in output, by default 1,000.
- compressor
Any
|None
Optional
[Any
] (default:Blosc(cname='zstd', clevel=7, shuffle=AUTOSHUFFLE, blocksize=0)
) Zarr compressor, by default Blosc + zstd with compression level 7 and auto-shuffle. No compression is used when set as None.
- encoding
Any
|None
Optional
[Any
] (default:None
) Variable-specific encodings for xarray, specified as a nested dictionary with variable names as keys and dictionaries of variable specific encodings as values. Can be used to override Zarr compressor and filters on a per-variable basis, e.g.,
{"call_genotype": {"compressor": Blosc("zstd", 9)}}
.- temp_chunk_length
int
|None
Optional
[int
] (default:None
) Length (number of variants) of chunks for temporary intermediate files. Set this to be smaller than
chunk_length
to avoid memory errors when loading files with very large numbers of samples. Must be evenly divisible intochunk_length
. Defaults tochunk_length
if not set.- tempdir
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Temporary directory where intermediate files are stored. The default None means use the system default temporary directory.
- tempdir_storage_options {
str
:str
} |None
Optional
[Dict
[str
,str
]] (default:None
) Any additional parameters for the storage backend for tempdir (see
fsspec.open
).- ploidy
int
(default:2
) The (maximum) ploidy of genotypes in the VCF file.
- mixed_ploidy
bool
(default:False
) If True, genotype calls with fewer alleles than the specified ploidy will be padded with the fill (non-allele) sentinel value of -2. If false, calls with fewer alleles than the specified ploidy will be treated as incomplete and will be padded with the missing-allele sentinel value of -1.
- truncate_calls
bool
(default:False
) If True, genotype calls with more alleles than the specified (maximum) ploidy value will be truncated to size ploidy. If false, calls with more alleles than the specified ploidy will raise an exception.
- max_alt_alleles
int
(default:3
) The (maximum) number of alternate alleles in the VCF file. Any records with more than this number of alternate alleles will have the extra alleles dropped (the variant_allele variable will be truncated). Any call genotype fields with the extra alleles will be changed to the missing-allele sentinel value of -1.
- fields
Sequence
[str
] |None
Optional
[Sequence
[str
]] (default:None
) Extra fields to extract data for. A list of strings, with
INFO
orFORMAT
prefixes. Wildcards are permitted too, for example:["INFO/*", "FORMAT/DP"]
.- field_defs {
str
: {str
:Any
}} |None
Optional
[Dict
[str
,Dict
[str
,Any
]]] (default:None
) Per-field information that overrides the field definitions in the VCF header, or provides extra information needed in the dataset representation. Definitions are a represented as a dictionary whose keys are the field names, and values are dictionaries with any of the following keys:
Number
,Type
,Description
,dimension
. The first three correspond to VCF header values, anddimension
is the name of the final dimension in the array for the case whereNumber
is a fixed integer larger than 1. For example,{"INFO/AC": {"Number": "A"}, "FORMAT/HQ": {"dimension": "haplotypes"}}
overrides theINFO/AC
field to be NumberA
(useful if the VCF defines it as having variable length with.
), and names the final dimension of theHQ
array (which is defined as Number 2 in the VCF header) ashaplotypes
. (Note that NumberA
is the number of alternate alleles, see section 1.4.2 of the VCF spec https://samtools.github.io/hts-specs/VCFv4.3.pdf.)- read_chunk_length
int
|None
Optional
[int
] (default:None
) Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by using a value lower than
chunk_length
with a small cost in extra run time. The increase in runtime becomes higher as the ratio ofread_chunk_length
to Defaults toNone
, which means that a value equal tochunk_length
is used. The memory usage of the conversion process is proportional toread_chunk_length*n_samples*(1+n_ploidy)
so this option is mainly useful for very large numbers of samples and/or where a largechunk_size
is desirable to reduce the number of dask tasks needed in downstream analysis.- retain_temp_files
bool
|None
Optional
[bool
] (default:None
) If True, intermediate files are retained after the final output is written. Defaults to deleting intermediate files. Intermediate files are deleted in a single process, so for large VCF files this can be slow.
- input
- Return type: