Tutorial#
This is a step-by-step tutorial showing you how to convert your VCF data into Zarr format. There’s three different ways to convert your data, basically providing different levels of convenience and flexibility corresponding to what you might need for small, intermediate and large datasets.
Warning
The documentation of vcf2zarr is under development, and some bits are more polished than others. This “tutorial” is experimental, and will likely evolve into a slightly different format in the near future. It is a work in progress and incomplete. The CLI Reference should be complete and authoritative, however.
Small dataset#
The simplest way to convert VCF data to Zarr is to use the vcf2zarr convert command:
Tip
Hit the play button to see the process in action!
This converts the input VCF file to the output sample.vcz
Zarr data in a single pass. The Zarr dataset is stored in the
file system, and you can have a look around using the usual
tools:
ls sample.vcz
call_DP filter_id variant_DP variant_id_mask
call_GQ sample_id variant_H2 variant_length
call_HQ variant_AA variant_NS variant_position
call_genotype variant_AC variant_allele variant_quality
call_genotype_mask variant_AF variant_contig
call_genotype_phased variant_AN variant_filter
contig_id variant_DB variant_id
Each of the directories corresponds to an array in Zarr, or one of the fields in your VCF. The chunk data for each of these arrays is then stored hierarchically within these directories:
tree sample.vcz
sample.vcz
├── call_DP
│ └── 0
│ └── 0
├── call_GQ
│ └── 0
│ └── 0
├── call_HQ
│ └── 0
│ └── 0
│ └── 0
├── call_genotype
│ └── 0
│ └── 0
│ └── 0
├── call_genotype_mask
│ └── 0
│ └── 0
│ └── 0
├── call_genotype_phased
│ └── 0
│ └── 0
├── contig_id
│ └── 0
├── filter_id
│ └── 0
├── sample_id
│ └── 0
├── variant_AA
│ └── 0
├── variant_AC
│ └── 0
│ └── 0
├── variant_AF
│ └── 0
│ └── 0
├── variant_AN
│ └── 0
├── variant_DB
│ └── 0
├── variant_DP
│ └── 0
├── variant_H2
│ └── 0
├── variant_NS
│ └── 0
├── variant_allele
│ └── 0
│ └── 0
├── variant_contig
│ └── 0
├── variant_filter
│ └── 0
│ └── 0
├── variant_id
│ └── 0
├── variant_id_mask
│ └── 0
├── variant_length
│ └── 0
├── variant_position
│ └── 0
└── variant_quality
└── 0
38 directories, 25 files
You can get a better idea of what’s being stored and the sizes of the different fields using the vcf2zarr inspect command:
vcf2zarr inspect sample.vcz
name dtype stored size ratio nchunks chunk_size avg_chunk_stored shape chunk_shape compressor filters
--------------------- ------- --------- --------- ------- --------- ------------ ------------------ --------- ---------------- -------------------------------------------------------------- ------------
/call_genotype int8 14.19 KiB 54 bytes 0.0037 1 54.0 bytes 14.19 KiB (9, 3, 2) (10000, 1000, 2) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/call_HQ int8 14.18 KiB 54 bytes 0.0037 1 54.0 bytes 14.18 KiB (9, 3, 2) (10000, 1000, 2) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_genotype_mask bool 14.12 KiB 54 bytes 0.0037 1 54.0 bytes 14.12 KiB (9, 3, 2) (10000, 1000, 2) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/call_GQ int8 9.34 KiB 27 bytes 0.0028 1 27.0 bytes 9.34 KiB (9, 3) (10000, 1000) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_DP int8 9.33 KiB 27 bytes 0.0028 1 27.0 bytes 9.33 KiB (9, 3) (10000, 1000) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_genotype_phased bool 9.3 KiB 27 bytes 0.0028 1 27.0 bytes 9.3 KiB (9, 3) (10000, 1000) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/variant_allele object 8.66 KiB 288 bytes 0.032 1 288.0 bytes 8.66 KiB (9, 4) (10000, 4) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) [VLenUTF8()]
/variant_AC int8 8.58 KiB 18 bytes 0.002 1 18.0 bytes 8.58 KiB (9, 2) (10000, 2) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_AF float32 8.55 KiB 72 bytes 0.0082 1 72.0 bytes 8.55 KiB (9, 2) (10000, 2) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_filter bool 8.53 KiB 27 bytes 0.0031 1 27.0 bytes 8.53 KiB (9, 3) (10000, 3) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/variant_id object 4.6 KiB 72 bytes 0.015 1 72.0 bytes 4.6 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) [VLenUTF8()]
/variant_contig int8 4.57 KiB 9 bytes 0.0019 1 9.0 bytes 4.57 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_AA object 4.54 KiB 72 bytes 0.015 1 72.0 bytes 4.54 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) [VLenUTF8()]
/variant_quality float32 4.52 KiB 36 bytes 0.0078 1 36.0 bytes 4.52 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_position int32 4.51 KiB 36 bytes 0.0078 1 36.0 bytes 4.51 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_AN int8 4.51 KiB 9 bytes 0.002 1 9.0 bytes 4.51 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_length int8 4.5 KiB 9 bytes 0.002 1 9.0 bytes 4.5 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_NS int8 4.49 KiB 9 bytes 0.002 1 9.0 bytes 4.49 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_DB bool 4.49 KiB 9 bytes 0.002 1 9.0 bytes 4.49 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/variant_DP int8 4.48 KiB 9 bytes 0.002 1 9.0 bytes 4.48 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_H2 bool 4.48 KiB 9 bytes 0.002 1 9.0 bytes 4.48 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/sample_id object 4.47 KiB 24 bytes 0.0052 1 24.0 bytes 4.47 KiB (3,) (1000,) Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0) [VLenUTF8()]
/variant_id_mask bool 4.46 KiB 9 bytes 0.002 1 9.0 bytes 4.46 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/filter_id object 4.45 KiB 24 bytes 0.0053 1 24.0 bytes 4.45 KiB (3,) (3,) Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0) [VLenUTF8()]
/contig_id object 4.44 KiB 24 bytes 0.0053 1 24.0 bytes 4.44 KiB (3,) (3,) Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0) [VLenUTF8()]
The stored
and size
columns here are important, and tell you
how much storage space is being used by the compressed chunks,
and the full size of the uncompressed array. The ratio
column is the ratio of these two values, with large numbers
indicating a high compression ration. In this example
the compression ratios are less than one because the compressed
chunks are larger than the actual arrays. This is because it’s
a tiny example, with only 9 variants and 3 samples (see the shape
column), so, for example call_genotype
is only 54 bytes.
Medium dataset#
Conversion in vcf2zarr
is a two step process. First we convert the VCF(s) to
an “intermediate columnar format” (ICF), and then use this as the basis of
generating the final Zarr.
Important
We would recommend using this two-step process for all but the smallest datasets.
In the simplest case, we just call two commands instead of one:
Tip
Both the explode
and
encode
commands allow you to perform the
conversion in parallel using the -p/--worker-processes
option.
The ICF form gives us useful information about the VCF data, and can help us to decide some of the finer details of the final Zarr encoding. We can also run vcf2zarr inspect command on an ICF:
vcf2zarr inspect sample.icf
name type chunks size compressed max_n min_val max_val
--------- ------- -------- ---------- ------------ ------- --------- ---------
CHROM String 3 1008 bytes 543 bytes 1 n/a n/a
POS Integer 3 1008 bytes 573 bytes 1 10 1.2e+06
QUAL Float 3 912 bytes 576 bytes 1 3 67
ID String 3 528 bytes 432 bytes 1 n/a n/a
FILTERS String 3 1.07 KiB 576 bytes 1 n/a n/a
REF String 3 1008 bytes 550 bytes 1 n/a n/a
ALT String 3 1.07 KiB 582 bytes 3 n/a n/a
rlen Integer 3 1008 bytes 547 bytes 1 1 2
INFO/NS Integer 3 624 bytes 259 bytes 1 2 3
INFO/AN Integer 3 240 bytes 247 bytes 1 6 6
INFO/AC Integer 3 256 bytes 243 bytes 2 1 3
INFO/DP Integer 3 624 bytes 267 bytes 1 9 14
INFO/AF Float 3 448 bytes 280 bytes 2 0.017 0.67
INFO/AA String 3 444 bytes 253 bytes 1 n/a n/a
INFO/DB Flag 3 336 bytes 251 bytes 1 1 1
INFO/H2 Flag 3 240 bytes 240 bytes 1 1 1
FORMAT/GT Integer 3 1.28 KiB 597 bytes 3 -2 2
FORMAT/GQ Integer 3 764 bytes 291 bytes 1 2 61
FORMAT/DP Integer 3 764 bytes 282 bytes 1 0 8
FORMAT/HQ Integer 3 960 bytes 457 bytes 2 2 65
This tells us about the fields found in the input VCF, and provides details about their data type, raw and compressed size, the maximum dimension of a value and the maximum and minimum values observed in the field. These extreme values are use to determine the width of integer types in the final Zarr encoding. The defaults can be overridden by generating a “storage schema” using the vcf2zarr mkschema command on an ICF:
vcf2zarr mkschema sample.icf > sample.schema.json
head -n 20 sample.schema.json
{
"format_version": "0.4",
"samples_chunk_size": 1000,
"variants_chunk_size": 10000,
"samples": [
{
"id": "NA00001"
},
{
"id": "NA00002"
},
{
"id": "NA00003"
}
],
"contigs": [
{
"id": "19",
"length": null
},
We’ve displayed the first 20 lines here so you can get a feel for the JSON format.
The jq tool provides a useful way of manipulating
these schemas. Let’s look at the schema for just the call_genotype
field, for example:
jq '.fields[] | select(.name == "call_genotype")' < sample.schema.json
{
"name": "call_genotype",
"dtype": "i1",
"shape": [
9,
3,
2
],
"chunks": [
10000,
1000,
2
],
"dimensions": [
"variants",
"samples",
"ploidy"
],
"description": "",
"vcf_field": null,
"compressor": {
"id": "blosc",
"cname": "zstd",
"clevel": 7,
"shuffle": 2,
"blocksize": 0
},
"filters": []
}
If we wanted to trade-off file size for better decoding performance
we could turn off the shuffle option here (which is set to 2, for
Blosc BitShuffle). This can be done by loading the sample.schema.json
file into your favourite editor, and changing the value “2” to “0”.
Todo
There is a lot going on here. It would be good to give some specific instructions for how to do this using jq. We would also want to explain the numbers provided for shuffle, as well as links to the Blosc documentation. Better mechanisms for updating schemas via a Python API would be a useful addition here.
A common thing we might want to do is to drop some fields entirely
from the final Zarr, either because they are too big and unwieldy
in the standard encoding (e.g. call_PL
) or we just don’t
want them.
Suppose here wanted to drop the FORMAT/HQ
/call_HQ
field. This
can be done using jq
like:
vcf2zarr mkschema sample.icf \
| jq 'del(.fields[] | select(.name == "call_HQ"))' > sample_noHQ.schema.json
Then we can use the updated schema as input to encode
:
vcf2zarr encode sample.icf -Qs sample_noHQ.schema.json sample_noHQ.vcz
Tip
Use the -Q/--no-progress
flag to suppress progress bars.
We can then inspect
to see that there is no call_HQ
array in the output:
vcf2zarr inspect sample_noHQ.vcz
name dtype stored size ratio nchunks chunk_size avg_chunk_stored shape chunk_shape compressor filters
--------------------- ------- --------- --------- ------- --------- ------------ ------------------ --------- ---------------- -------------------------------------------------------------- ------------
/call_genotype int8 14.19 KiB 54 bytes 0.0037 1 54.0 bytes 14.19 KiB (9, 3, 2) (10000, 1000, 2) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/call_genotype_mask bool 14.12 KiB 54 bytes 0.0037 1 54.0 bytes 14.12 KiB (9, 3, 2) (10000, 1000, 2) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/call_GQ int8 9.34 KiB 27 bytes 0.0028 1 27.0 bytes 9.34 KiB (9, 3) (10000, 1000) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_DP int8 9.33 KiB 27 bytes 0.0028 1 27.0 bytes 9.33 KiB (9, 3) (10000, 1000) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_genotype_phased bool 9.3 KiB 27 bytes 0.0028 1 27.0 bytes 9.3 KiB (9, 3) (10000, 1000) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/variant_allele object 8.66 KiB 288 bytes 0.032 1 288.0 bytes 8.66 KiB (9, 4) (10000, 4) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) [VLenUTF8()]
/variant_AC int8 8.58 KiB 18 bytes 0.002 1 18.0 bytes 8.58 KiB (9, 2) (10000, 2) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_AF float32 8.55 KiB 72 bytes 0.0082 1 72.0 bytes 8.55 KiB (9, 2) (10000, 2) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_filter bool 8.53 KiB 27 bytes 0.0031 1 27.0 bytes 8.53 KiB (9, 3) (10000, 3) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/variant_id object 4.6 KiB 72 bytes 0.015 1 72.0 bytes 4.6 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) [VLenUTF8()]
/variant_contig int8 4.57 KiB 9 bytes 0.0019 1 9.0 bytes 4.57 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_AA object 4.54 KiB 72 bytes 0.015 1 72.0 bytes 4.54 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) [VLenUTF8()]
/variant_quality float32 4.52 KiB 36 bytes 0.0078 1 36.0 bytes 4.52 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_position int32 4.51 KiB 36 bytes 0.0078 1 36.0 bytes 4.51 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_AN int8 4.51 KiB 9 bytes 0.002 1 9.0 bytes 4.51 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_length int8 4.5 KiB 9 bytes 0.002 1 9.0 bytes 4.5 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_NS int8 4.49 KiB 9 bytes 0.002 1 9.0 bytes 4.49 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_DB bool 4.49 KiB 9 bytes 0.002 1 9.0 bytes 4.49 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/variant_DP int8 4.48 KiB 9 bytes 0.002 1 9.0 bytes 4.48 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/variant_H2 bool 4.48 KiB 9 bytes 0.002 1 9.0 bytes 4.48 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/sample_id object 4.47 KiB 24 bytes 0.0052 1 24.0 bytes 4.47 KiB (3,) (1000,) Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0) [VLenUTF8()]
/variant_id_mask bool 4.46 KiB 9 bytes 0.002 1 9.0 bytes 4.46 KiB (9,) (10000,) Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0) None
/filter_id object 4.45 KiB 24 bytes 0.0053 1 24.0 bytes 4.45 KiB (3,) (3,) Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0) [VLenUTF8()]
/contig_id object 4.44 KiB 24 bytes 0.0053 1 24.0 bytes 4.44 KiB (3,) (3,) Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0) [VLenUTF8()]
Tip
Use the max-variants-chunks
option to encode the first few chunks of your
dataset while doing these kinds of schema tuning operations!
Large dataset#
The explode and encode commands have powerful features for conversion on a single machine, and can take full advantage of large servers with many cores. Current biobank scale datasets, however, are so large that we must go a step further and distribute computations over a cluster. Vcf2zarr provides some low-level utilities that allow you to do this, that should be compatible with any cluster scheduler.
The distributed commands are split into three phases:
init <num_partitions>: Initialise the computation, setting up the data structures needed for the bulk computation to be split into
num_partitions
independent partitionspartition
: perform the computation of partition j
finalise: Complete the full process.
When performing large-scale computations like this on a cluster, errors and job failures are essentially inevitable, and the commands are resilient to various failure modes.
Let’s go through the example above using the distributed commands. First, we dexplode-init to create an ICF directory:
vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf -n 5 -Q
num_partitions 3
num_samples 3
num_variants 9
Here we asked dexplode-init
to set up an ICF store in which the data
is split into 5 partitions. The number of partitions determines the level
of parallelism, so we would usually set this to the number of
parallel jobs we would like to use. The output of dexplode-init
is
important though, as it tells us the actual number of partitions that
we have (partitioning is based on the VCF indexes, which have a limited
granularity). You should be careful to use this value in your scripts
(the format is designed to be machine readable using e.g. cut
and
grep
). In this case there are only 3 possible partitions.
Once dexplode-init
is done and we know how many partitions we have,
we need to call
dexplode-partition this number of times:
vcf2zarr dexplode-partition sample-dist.icf 0
vcf2zarr dexplode-partition sample-dist.icf 1
vcf2zarr dexplode-partition sample-dist.icf 2
This is not how it would be done in practise of course: you would use your cluster scheduler of choice to dispatch these operations.
Todo
Document how to do this conveniently over some popular schedulers.
Tip
Use the --one-based
argument in cases in which it’s more convenient
to index the partitions from 1 to n, rather than 0 to n - 1.
Finally we need to call dexplode-finalise:
vcf2zarr dexplode-finalise sample-dist.icf
Todo
Document the process for dencode, noting the information output about memory requirements.