sgkit.io.vcf.write_vcf#

sgkit.io.vcf.write_vcf(input, output, *, vcf_header=None)#

Convert a dataset to a VCF file.

The VCF header to use is dictated by either the vcf_header parameter or the vcf_header attribute on the input dataset.

If specified, the vcf_header parameter will be used, and any variables in the dataset that are not in this header will not be included in the output.

If the vcf_header parameter is left as the default (None) and a vcf_header attribute is present in the dataset (such as one created by vcf_to_zarr()), it will be used to generate the new VCF header. In this case, any variables in the dataset that are not specified in this header will have corresponding header lines added, and any lines in the header without a corresponding variable in the dataset will be omitted.

In the case of no vcf_header parameter or attribute, a VCF header will be generated, and will include all variables in the dataset.

Float fields are written with up to 3 decimal places of precision. Exponent/scientific notation is not supported, so values less than 5e-4 will be rounded to zero.

Data is written sequentially to VCF, using Numba to optimize the write throughput speed. Speeds in the region of 100 MB/s have been observed on an Apple M1 machine from 2020.

Data is loaded into memory in chunks sized according to the chunking along the variants dimension. Chunking in other dimensions (such as samples) is ignored for the purposes of writing VCF. If the dataset is not chunked (because it does not originate from Zarr or Dask, for example), then it will all be loaded into memory at once.

The output is not compressed or indexed. It is therefore recommended to post-process the output using external tools such as bgzip(1), bcftools(1), or tabix(1).

This example shows how to convert a Zarr dataset to bgzip-compressed VCF by writing it to standard output then applying an external compressor:

python -c 'import sys; from sgkit.io.vcf import zarr_to_vcf; zarr_to_vcf("in.zarr", sys.stdout)'
    | bgzip > out.vcf.gz

Parameters:

input Dataset: Dataset to convert to VCF.
output str | Path | TextIOUnion[str, Path, TextIO]: A path or text file object that the output VCF should be written to.
vcf_header str | NoneOptional[str] (default: None): The VCF header to use (including the line starting with #CHROM). If None, then a header will be generated from the dataset vcf_header attribute (if present), or from scratch otherwise.

Return type:

None