vcf2zarr#
Convert VCF data to the VCF Zarr specification reliably, in parallel or distributed over a cluster.
See the Tutorial for a step-by-step introduction and the CLI Reference detailed documentation on command line options.
See the bioRxiv preprint for further details.
Quickstart#
First install bio2zarr.
Get some indexed VCF data:
curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz
curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz.tbi
Convert to VCF Zarr in two steps:
vcf2zarr explode sample.vcf.gz sample.icf
vcf2zarr encode sample.icf sample.vcz
Tip
If the vcf2zarr
executable doesn’t work, try python -m bio2zarr vcf2zarr
instead.
Have a look at the results:
vcf2zarr inspect sample.vcz
What next?#
VCF Zarr is a starting point in what we hope will become a diverse ecosytem of packages that efficiently process VCF data in Zarr format. This ecosytem is in its infancy and there isn’t much software available for performing off-the-shelf bioinformatics tasks working with the format. As such, VCF Zarr isn’t suitable for end users who just want to get their work done for the moment, and is currently aimed methods developers and early adopters.
Having said that, you can:
Use vcztools as a drop-in replacment for bcftools, transparently using Zarr on local storage or cloud stores as the backend.
Look at the VCF Zarr specification to see how data is mapped from VCF to Zarr
Use the mature Zarr Python package or one of the other Zarr implementations to access your data.
Use the many functions in our sgkit sister project to analyse the data. Note that sgkit is under active development, however, and the documentation may not be fully in-sync with this project.
For more information, please see our bioRxiv preprint Analysis-ready VCF at Biobank scale using Zarr.
How does it work?#
The conversion of VCF data to Zarr is a two-step process:
This two-step process allows vcf2zarr
to determine the correct
dimension of Zarr arrays corresponding to each VCF field, and
to keep memory usage tightly bounded while writing the arrays.
Important
The intermediate columnar format is not intended for any use
other than a temporary storage while converting VCF to Zarr.
The format may change between versions of bio2zarr
.
Both explode
and encode
can be performed in parallel
across cores on a single machine (via the --worker-processes
argument)
or distributed across a cluster by the three-part init
, partition
and finalise
commands.
Local alleles#
As discussed in our preprint vcf2zarr has an experimental implementation of the local alleles data reduction technique. This essentially reduces the inner dimension of large fields such as AD by storing information relevant only to the alleles involved in a particular variant call, rather than information information for all alleles. This can make a substantial difference when there is a large number of alleles.
To use local alleles, you must generate storage a schema (see the
Medium dataset section of the tutorial)
using the mkschema command with the
--local-alleles
option. This will generate the call_LA
field
which lists the alleles observed for each genotype call, and
translate supported fields from their global alleles to local
alleles representation.
Warning
Support for local-alleles is preliminary and may be subject to change as the details of how alleles for a particular call are chosen, and the number of alleles retained determined. Please open an issue on GitHub if you would like to help improve Bio2zarr’s local alleles implementation.
Note
Only the PL and AD fields are currently supported for local alleles data reduction. Please comment on our local alleles fields tracking issue if you would like to see other fields supported, or to help out with implementing more.
Debugging#
When things go wrong with conversion, it’s very useful to generate some
debugging information using the -v
or -vv
options. These can
help identify what’s going wrong (usually running out of memory).
Warning
To get the full logging output you must use -p0, such that no multiprocessing is used. This means that tracking down problems can be slow, unfortunately. This is due to a bug in the way logging from child processes is handled; see issue 302 for details.
Copying to object stores#
Todo
Document process of copying VCF Zarr datasets to an object store like S3. See Issue 234