vcf2zarr#
Convert VCF data to the VCF Zarr specification reliably, in parallel or distributed over a cluster.
See the Tutorial for a step-by-step introduction and the CLI Reference detailed documentation on command line options.
Quickstart#
First install bio2zarr.
Get some indexed VCF data:
curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz
curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz.tbi
Convert to VCF Zarr in two steps:
vcf2zarr explode sample.vcf.gz sample.icf
vcf2zarr encode sample.icf sample.vcz
Tip
If the vcf2zarr
executable doesn’t work, try python -m bio2zarr vcf2zarr
instead.
Have a look at the results:
vcf2zarr inspect sample.vcz
What next?#
VCF Zarr is a starting point in what we hope will become a diverse ecosytem of packages that efficiently process VCF data in Zarr format. However, this ecosytem does not exist yet, and there isn’t much software available for working with the format. As such, VCF Zarr isn’t suitable for end users who just want to get their work done for the moment.
Having said that, you can:
Look at the VCF Zarr specification to see how data is mapped from VCF to Zarr
Use the mature Zarr Python package or one of the other Zarr implementations to access your data.
Use the many functions in our sgkit sister project to analyse the data. Note that sgkit is under active development, however, and the documentation may not be fully in-sync with this project.
How does it work?#
The conversion of VCF data to Zarr is a two-step process:
This two-step process allows vcf2zarr
to determine the correct
dimension of Zarr arrays corresponding to each VCF field, and
to keep memory usage tightly bounded while writing the arrays.
Important
The intermediate columnar format is not intended for any use
other than a temporary storage while converting VCF to Zarr.
The format may change between versions of bio2zarr
.
Both explode
and encode
can be performed in parallel
across cores on a single machine (via the --worker-processes
argument)
or distributed across a cluster by the three-part init
, partition
and finalise
commands.
Copying to object stores#
Todo
Document process of copying VCF Zarr datasets to an object store like S3. See Issue 234