sgkit.io.vcf.write_vcf#
- sgkit.io.vcf.write_vcf(input, output, *, vcf_header=None)#
Convert a dataset to a VCF file.
The VCF header to use is dictated by either the
vcf_header
parameter or thevcf_header
attribute on the input dataset.If specified, the
vcf_header
parameter will be used, and any variables in the dataset that are not in this header will not be included in the output.If the
vcf_header
parameter is left as the default (None) and avcf_header
attribute is present in the dataset (such as one created byvcf_to_zarr()
), it will be used to generate the new VCF header. In this case, any variables in the dataset that are not specified in this header will have corresponding header lines added, and any lines in the header without a corresponding variable in the dataset will be omitted.In the case of no
vcf_header
parameter or attribute, a VCF header will be generated, and will include all variables in the dataset.Float fields are written with up to 3 decimal places of precision. Exponent/scientific notation is not supported, so values less than
5e-4
will be rounded to zero.Data is written sequentially to VCF, using Numba to optimize the write throughput speed. Speeds in the region of 100 MB/s have been observed on an Apple M1 machine from 2020.
Data is loaded into memory in chunks sized according to the chunking along the variants dimension. Chunking in other dimensions (such as samples) is ignored for the purposes of writing VCF. If the dataset is not chunked (because it does not originate from Zarr or Dask, for example), then it will all be loaded into memory at once.
The output is not compressed or indexed. It is therefore recommended to post-process the output using external tools such as
bgzip(1)
,bcftools(1)
, ortabix(1)
.This example shows how to convert a Zarr dataset to bgzip-compressed VCF by writing it to standard output then applying an external compressor:
python -c 'import sys; from sgkit.io.vcf import zarr_to_vcf; zarr_to_vcf("in.zarr", sys.stdout)' | bgzip > out.vcf.gz
- Parameters:
- input
Dataset
Dataset to convert to VCF.
- output
str
|Path
|TextIO
Union
[str
,Path
,TextIO
] A path or text file object that the output VCF should be written to.
- vcf_header
str
|None
Optional
[str
] (default:None
) The VCF header to use (including the line starting with
#CHROM
). If None, then a header will be generated from the datasetvcf_header
attribute (if present), or from scratch otherwise.
- input
- Return type: