Tutorial#

This is a step-by-step tutorial showing you how to convert your VCF data into Zarr format. There’s three different ways to convert your data, basically providing different levels of convenience and flexibility corresponding to what you might need for small, intermediate and large datasets.

Warning

The documentation of vcf2zarr is under development, and some bits are more polished than others. This “tutorial” is experimental, and will likely evolve into a slightly different format in the near future. It is a work in progress and incomplete. The CLI Reference should be complete and authoritative, however.

Small dataset#

The simplest way to convert VCF data to Zarr is to use the vcf2zarr convert command:

Tip

Hit the play button to see the process in action!

This converts the input VCF file to the output sample.vcz Zarr data in a single pass. The Zarr dataset is stored in the file system, and you can have a look around using the usual tools:

ls sample.vcz
call_DP               filter_id   variant_DP      variant_id_mask
call_GQ               sample_id   variant_H2      variant_length
call_HQ               variant_AA  variant_NS      variant_position
call_genotype         variant_AC  variant_allele  variant_quality
call_genotype_mask    variant_AF  variant_contig
call_genotype_phased  variant_AN  variant_filter
contig_id             variant_DB  variant_id

Each of the directories corresponds to an array in Zarr, or one of the fields in your VCF. The chunk data for each of these arrays is then stored hierarchically within these directories:

tree sample.vcz
sample.vcz
├── call_DP
│   └── 0
│       └── 0
├── call_GQ
│   └── 0
│       └── 0
├── call_HQ
│   └── 0
│       └── 0
│           └── 0
├── call_genotype
│   └── 0
│       └── 0
│           └── 0
├── call_genotype_mask
│   └── 0
│       └── 0
│           └── 0
├── call_genotype_phased
│   └── 0
│       └── 0
├── contig_id
│   └── 0
├── filter_id
│   └── 0
├── sample_id
│   └── 0
├── variant_AA
│   └── 0
├── variant_AC
│   └── 0
│       └── 0
├── variant_AF
│   └── 0
│       └── 0
├── variant_AN
│   └── 0
├── variant_DB
│   └── 0
├── variant_DP
│   └── 0
├── variant_H2
│   └── 0
├── variant_NS
│   └── 0
├── variant_allele
│   └── 0
│       └── 0
├── variant_contig
│   └── 0
├── variant_filter
│   └── 0
│       └── 0
├── variant_id
│   └── 0
├── variant_id_mask
│   └── 0
├── variant_length
│   └── 0
├── variant_position
│   └── 0
└── variant_quality
└── 0
38 directories, 25 files

You can get a better idea of what’s being stored and the sizes of the different fields using the vcf2zarr inspect command:

vcf2zarr inspect sample.vcz
name                   dtype    stored     size         ratio    nchunks  chunk_size    avg_chunk_stored    shape      chunk_shape       compressor                                                      filters
---------------------  -------  ---------  ---------  -------  ---------  ------------  ------------------  ---------  ----------------  --------------------------------------------------------------  ------------
/call_genotype         int8     14.19 KiB  54 bytes    0.0037          1  54.0 bytes    14.19 KiB           (9, 3, 2)  (10000, 1000, 2)  Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_HQ               int8     14.18 KiB  54 bytes    0.0037          1  54.0 bytes    14.18 KiB           (9, 3, 2)  (10000, 1000, 2)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_genotype_mask    bool     14.12 KiB  54 bytes    0.0037          1  54.0 bytes    14.12 KiB           (9, 3, 2)  (10000, 1000, 2)  Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_GQ               int8     9.34 KiB   27 bytes    0.0028          1  27.0 bytes    9.34 KiB            (9, 3)     (10000, 1000)     Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_DP               int8     9.33 KiB   27 bytes    0.0028          1  27.0 bytes    9.33 KiB            (9, 3)     (10000, 1000)     Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_genotype_phased  bool     9.3 KiB    27 bytes    0.0028          1  27.0 bytes    9.3 KiB             (9, 3)     (10000, 1000)     Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_allele        object   8.66 KiB   288 bytes   0.032           1  288.0 bytes   8.66 KiB            (9, 4)     (10000, 4)        Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/variant_AC            int8     8.58 KiB   18 bytes    0.002           1  18.0 bytes    8.58 KiB            (9, 2)     (10000, 2)        Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_AF            float32  8.55 KiB   72 bytes    0.0082          1  72.0 bytes    8.55 KiB            (9, 2)     (10000, 2)        Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_filter        bool     8.53 KiB   27 bytes    0.0031          1  27.0 bytes    8.53 KiB            (9, 3)     (10000, 3)        Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_id            object   4.6 KiB    72 bytes    0.015           1  72.0 bytes    4.6 KiB             (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/variant_contig        int8     4.57 KiB   9 bytes     0.0019          1  9.0 bytes     4.57 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_AA            object   4.54 KiB   72 bytes    0.015           1  72.0 bytes    4.54 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/variant_quality       float32  4.52 KiB   36 bytes    0.0078          1  36.0 bytes    4.52 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_position      int32    4.51 KiB   36 bytes    0.0078          1  36.0 bytes    4.51 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_AN            int8     4.51 KiB   9 bytes     0.002           1  9.0 bytes     4.51 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_length        int8     4.5 KiB    9 bytes     0.002           1  9.0 bytes     4.5 KiB             (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_NS            int8     4.49 KiB   9 bytes     0.002           1  9.0 bytes     4.49 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_DB            bool     4.49 KiB   9 bytes     0.002           1  9.0 bytes     4.49 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_DP            int8     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_H2            bool     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/sample_id             object   4.47 KiB   24 bytes    0.0052          1  24.0 bytes    4.47 KiB            (3,)       (1000,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_id_mask       bool     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/filter_id             object   4.45 KiB   24 bytes    0.0053          1  24.0 bytes    4.45 KiB            (3,)       (3,)              Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/contig_id             object   4.44 KiB   24 bytes    0.0053          1  24.0 bytes    4.44 KiB            (3,)       (3,)              Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]

The stored and size columns here are important, and tell you how much storage space is being used by the compressed chunks, and the full size of the uncompressed array. The ratio column is the ratio of these two values, with large numbers indicating a high compression ration. In this example the compression ratios are less than one because the compressed chunks are larger than the actual arrays. This is because it’s a tiny example, with only 9 variants and 3 samples (see the shape column), so, for example call_genotype is only 54 bytes.

Medium dataset#

Conversion in vcf2zarr is a two step process. First we convert the VCF(s) to an “intermediate columnar format” (ICF), and then use this as the basis of generating the final Zarr.

Important

We would recommend using this two-step process for all but the smallest datasets.

In the simplest case, we just call two commands instead of one:

Tip

Both the explode and encode commands allow you to perform the conversion in parallel using the -p/--worker-processes option.

The ICF form gives us useful information about the VCF data, and can help us to decide some of the finer details of the final Zarr encoding. We can also run vcf2zarr inspect command on an ICF:

vcf2zarr inspect sample.icf
name       type       chunks  size        compressed      max_n  min_val    max_val
---------  -------  --------  ----------  ------------  -------  ---------  ---------
CHROM      String          3  1008 bytes  543 bytes           1  n/a        n/a
POS        Integer         3  1008 bytes  573 bytes           1  10         1.2e+06
QUAL       Float           3  912 bytes   576 bytes           1  3          67
ID         String          3  528 bytes   432 bytes           1  n/a        n/a
FILTERS    String          3  1.07 KiB    576 bytes           1  n/a        n/a
REF        String          3  1008 bytes  550 bytes           1  n/a        n/a
ALT        String          3  1.07 KiB    582 bytes           3  n/a        n/a
rlen       Integer         3  1008 bytes  547 bytes           1  1          2
INFO/NS    Integer         3  624 bytes   259 bytes           1  2          3
INFO/AN    Integer         3  240 bytes   247 bytes           1  6          6
INFO/AC    Integer         3  256 bytes   243 bytes           2  1          3
INFO/DP    Integer         3  624 bytes   267 bytes           1  9          14
INFO/AF    Float           3  448 bytes   280 bytes           2  0.017      0.67
INFO/AA    String          3  444 bytes   253 bytes           1  n/a        n/a
INFO/DB    Flag            3  336 bytes   251 bytes           1  1          1
INFO/H2    Flag            3  240 bytes   240 bytes           1  1          1
FORMAT/GT  Integer         3  1.28 KiB    597 bytes           3  -2         2
FORMAT/GQ  Integer         3  764 bytes   291 bytes           1  2          61
FORMAT/DP  Integer         3  764 bytes   282 bytes           1  0          8
FORMAT/HQ  Integer         3  960 bytes   457 bytes           2  2          65

This tells us about the fields found in the input VCF, and provides details about their data type, raw and compressed size, the maximum dimension of a value and the maximum and minimum values observed in the field. These extreme values are use to determine the width of integer types in the final Zarr encoding. The defaults can be overridden by generating a “storage schema” using the vcf2zarr mkschema command on an ICF:

vcf2zarr mkschema sample.icf > sample.schema.json
head -n 20 sample.schema.json
{
"format_version": "0.4",
"samples_chunk_size": 1000,
"variants_chunk_size": 10000,
"samples": [
{
"id": "NA00001"
},
{
"id": "NA00002"
},
{
"id": "NA00003"
}
],
"contigs": [
{
"id": "19",
"length": null
},

We’ve displayed the first 20 lines here so you can get a feel for the JSON format. The jq tool provides a useful way of manipulating these schemas. Let’s look at the schema for just the call_genotype field, for example:

jq '.fields[] | select(.name == "call_genotype")' < sample.schema.json
{
"name": "call_genotype",
"dtype": "i1",
"shape": [
9,
3,
2
],
"chunks": [
10000,
1000,
2
],
"dimensions": [
"variants",
"samples",
"ploidy"
],
"description": "",
"vcf_field": null,
"compressor": {
"id": "blosc",
"cname": "zstd",
"clevel": 7,
"shuffle": 2,
"blocksize": 0
},
"filters": []
}

If we wanted to trade-off file size for better decoding performance we could turn off the shuffle option here (which is set to 2, for Blosc BitShuffle). This can be done by loading the sample.schema.json file into your favourite editor, and changing the value “2” to “0”.

Todo

There is a lot going on here. It would be good to give some specific instructions for how to do this using jq. We would also want to explain the numbers provided for shuffle, as well as links to the Blosc documentation. Better mechanisms for updating schemas via a Python API would be a useful addition here.

A common thing we might want to do is to drop some fields entirely from the final Zarr, either because they are too big and unwieldy in the standard encoding (e.g. call_PL) or we just don’t want them.

Suppose here wanted to drop the FORMAT/HQ/call_HQ field. This can be done using jq like:

vcf2zarr mkschema sample.icf \
| jq 'del(.fields[] | select(.name == "call_HQ"))' > sample_noHQ.schema.json 

Then we can use the updated schema as input to encode:

vcf2zarr encode sample.icf -Qs sample_noHQ.schema.json sample_noHQ.vcz

Tip

Use the -Q/--no-progress flag to suppress progress bars.

We can then inspect to see that there is no call_HQ array in the output:

vcf2zarr inspect sample_noHQ.vcz
name                   dtype    stored     size         ratio    nchunks  chunk_size    avg_chunk_stored    shape      chunk_shape       compressor                                                      filters
---------------------  -------  ---------  ---------  -------  ---------  ------------  ------------------  ---------  ----------------  --------------------------------------------------------------  ------------
/call_genotype         int8     14.19 KiB  54 bytes    0.0037          1  54.0 bytes    14.19 KiB           (9, 3, 2)  (10000, 1000, 2)  Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_genotype_mask    bool     14.12 KiB  54 bytes    0.0037          1  54.0 bytes    14.12 KiB           (9, 3, 2)  (10000, 1000, 2)  Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_GQ               int8     9.34 KiB   27 bytes    0.0028          1  27.0 bytes    9.34 KiB            (9, 3)     (10000, 1000)     Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_DP               int8     9.33 KiB   27 bytes    0.0028          1  27.0 bytes    9.33 KiB            (9, 3)     (10000, 1000)     Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_genotype_phased  bool     9.3 KiB    27 bytes    0.0028          1  27.0 bytes    9.3 KiB             (9, 3)     (10000, 1000)     Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_allele        object   8.66 KiB   288 bytes   0.032           1  288.0 bytes   8.66 KiB            (9, 4)     (10000, 4)        Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/variant_AC            int8     8.58 KiB   18 bytes    0.002           1  18.0 bytes    8.58 KiB            (9, 2)     (10000, 2)        Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_AF            float32  8.55 KiB   72 bytes    0.0082          1  72.0 bytes    8.55 KiB            (9, 2)     (10000, 2)        Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_filter        bool     8.53 KiB   27 bytes    0.0031          1  27.0 bytes    8.53 KiB            (9, 3)     (10000, 3)        Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_id            object   4.6 KiB    72 bytes    0.015           1  72.0 bytes    4.6 KiB             (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/variant_contig        int8     4.57 KiB   9 bytes     0.0019          1  9.0 bytes     4.57 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_AA            object   4.54 KiB   72 bytes    0.015           1  72.0 bytes    4.54 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/variant_quality       float32  4.52 KiB   36 bytes    0.0078          1  36.0 bytes    4.52 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_position      int32    4.51 KiB   36 bytes    0.0078          1  36.0 bytes    4.51 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_AN            int8     4.51 KiB   9 bytes     0.002           1  9.0 bytes     4.51 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_length        int8     4.5 KiB    9 bytes     0.002           1  9.0 bytes     4.5 KiB             (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_NS            int8     4.49 KiB   9 bytes     0.002           1  9.0 bytes     4.49 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_DB            bool     4.49 KiB   9 bytes     0.002           1  9.0 bytes     4.49 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_DP            int8     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_H2            bool     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/sample_id             object   4.47 KiB   24 bytes    0.0052          1  24.0 bytes    4.47 KiB            (3,)       (1000,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_id_mask       bool     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (10000,)          Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/filter_id             object   4.45 KiB   24 bytes    0.0053          1  24.0 bytes    4.45 KiB            (3,)       (3,)              Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/contig_id             object   4.44 KiB   24 bytes    0.0053          1  24.0 bytes    4.44 KiB            (3,)       (3,)              Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]

Tip

Use the max-variants-chunks option to encode the first few chunks of your dataset while doing these kinds of schema tuning operations!

Large dataset#

The explode and encode commands have powerful features for conversion on a single machine, and can take full advantage of large servers with many cores. Current biobank scale datasets, however, are so large that we must go a step further and distribute computations over a cluster. Vcf2zarr provides some low-level utilities that allow you to do this, that should be compatible with any cluster scheduler.

The distributed commands are split into three phases:

  • init <num_partitions>: Initialise the computation, setting up the data structures needed for the bulk computation to be split into num_partitions independent partitions

  • partition : perform the computation of partition j

  • finalise: Complete the full process.

When performing large-scale computations like this on a cluster, errors and job failures are essentially inevitable, and the commands are resilient to various failure modes.

Let’s go through the example above using the distributed commands. First, we dexplode-init to create an ICF directory:

vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf -n 5 -Q
num_partitions  3
num_samples     3
num_variants    9

Here we asked dexplode-init to set up an ICF store in which the data is split into 5 partitions. The number of partitions determines the level of parallelism, so we would usually set this to the number of parallel jobs we would like to use. The output of dexplode-init is important though, as it tells us the actual number of partitions that we have (partitioning is based on the VCF indexes, which have a limited granularity). You should be careful to use this value in your scripts (the format is designed to be machine readable using e.g. cut and grep). In this case there are only 3 possible partitions.

Once dexplode-init is done and we know how many partitions we have, we need to call dexplode-partition this number of times:

vcf2zarr dexplode-partition sample-dist.icf 0
vcf2zarr dexplode-partition sample-dist.icf 1
vcf2zarr dexplode-partition sample-dist.icf 2

This is not how it would be done in practise of course: you would use your cluster scheduler of choice to dispatch these operations.

Todo

Document how to do this conveniently over some popular schedulers.

Tip

Use the --one-based argument in cases in which it’s more convenient to index the partitions from 1 to n, rather than 0 to n - 1.

Finally we need to call dexplode-finalise:

vcf2zarr dexplode-finalise sample-dist.icf

Todo

Document the process for dencode, noting the information output about memory requirements.