Tutorial

Tutorial#

This is a step-by-step tutorial showing you how to convert your VCF data into Zarr format. There’s three different ways to convert your data, basically providing different levels of convenience and flexibility corresponding to what you might need for small, intermediate and large datasets.

Warning

The documentation of vcf2zarr is under development, and some bits are more polished than others. This “tutorial” is experimental, and will likely evolve into a slightly different format in the near future. It is a work in progress and incomplete. The CLI Reference should be complete and authoritative, however.

Small dataset#

The simplest way to convert VCF data to Zarr is to use the vcf2zarr convert command:

Tip

Hit the play button to see the process in action!

This converts the input VCF file to the output sample.vcz Zarr data in a single pass. The Zarr dataset is stored in the file system, and you can have a look around using the usual tools:

ls sample.vcz

call_DP		      filter_description  variant_AN	  variant_filter
call_GQ		      filter_id		  variant_DB	  variant_id
call_HQ		      region_index	  variant_DP	  variant_id_mask
call_genotype	      sample_id		  variant_H2	  variant_length
call_genotype_mask    variant_AA	  variant_NS	  variant_position
call_genotype_phased  variant_AC	  variant_allele  variant_quality
contig_id	      variant_AF	  variant_contig

Each of the directories corresponds to an array in Zarr, or one of the fields in your VCF. The chunk data for each of these arrays is then stored hierarchically within these directories:

tree sample.vcz

sample.vcz
├── call_DP
│   └── 0
│       └── 0
├── call_GQ
│   └── 0
│       └── 0
├── call_HQ
│   └── 0
│       └── 0
│           └── 0
├── call_genotype
│   └── 0
│       └── 0
│           └── 0
├── call_genotype_mask
│   └── 0
│       └── 0
│           └── 0
├── call_genotype_phased
│   └── 0
│       └── 0
├── contig_id
│   └── 0
├── filter_description
│   └── 0
├── filter_id
│   └── 0
├── region_index
│   └── 0
│       └── 0
├── sample_id
│   └── 0
├── variant_AA
│   └── 0
├── variant_AC
│   └── 0
│       └── 0
├── variant_AF
│   └── 0
│       └── 0
├── variant_AN
│   └── 0
├── variant_DB
│   └── 0
├── variant_DP
│   └── 0
├── variant_H2
│   └── 0
├── variant_NS
│   └── 0
├── variant_allele
│   └── 0
│       └── 0
├── variant_contig
│   └── 0
├── variant_filter
│   └── 0
│       └── 0
├── variant_id
│   └── 0
├── variant_id_mask
│   └── 0
├── variant_length
│   └── 0
├── variant_position
│   └── 0
└── variant_quality
    └── 0
42 directories, 27 files

You can get a better idea of what’s being stored and the sizes of the different fields using the vcf2zarr inspect command:

vcf2zarr inspect sample.vcz

name                   dtype    stored     size         ratio    nchunks  chunk_size    avg_chunk_stored    shape      chunk_shape    compressor                                                      filters
---------------------  -------  ---------  ---------  -------  ---------  ------------  ------------------  ---------  -------------  --------------------------------------------------------------  ------------
/call_HQ               int8     12.58 KiB  54 bytes    0.0042          1  54.0 bytes    12.58 KiB           (9, 3, 2)  (9, 3, 2)      Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/call_genotype_mask    bool     12.56 KiB  54 bytes    0.0042          1  54.0 bytes    12.56 KiB           (9, 3, 2)  (9, 3, 2)      Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_genotype         int8     12.56 KiB  54 bytes    0.0042          1  54.0 bytes    12.56 KiB           (9, 3, 2)  (9, 3, 2)      Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_allele        object   8.63 KiB   288 bytes   0.033           1  288.0 bytes   8.63 KiB            (9, 4)     (9, 4)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_AC            int8     8.56 KiB   18 bytes    0.0021          1  18.0 bytes    8.56 KiB            (9, 2)     (9, 2)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_AF            float32  8.56 KiB   72 bytes    0.0082          1  72.0 bytes    8.56 KiB            (9, 2)     (9, 2)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/region_index          int32    8.54 KiB   72 bytes    0.0082          1  72.0 bytes    8.54 KiB            (3, 6)     (3, 6)         Blosc(cname='zstd', clevel=9, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_filter        bool     8.52 KiB   27 bytes    0.0031          1  27.0 bytes    8.52 KiB            (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_GQ               int8     8.51 KiB   27 bytes    0.0031          1  27.0 bytes    8.51 KiB            (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/call_DP               int8     8.5 KiB    27 bytes    0.0031          1  27.0 bytes    8.5 KiB             (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/call_genotype_phased  bool     8.5 KiB    27 bytes    0.0031          1  27.0 bytes    8.5 KiB             (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_id            object   4.6 KiB    72 bytes    0.015           1  72.0 bytes    4.6 KiB             (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_contig        int8     4.55 KiB   9 bytes     0.0019          1  9.0 bytes     4.55 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_AA            object   4.54 KiB   72 bytes    0.015           1  72.0 bytes    4.54 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/filter_description    object   4.5 KiB    24 bytes    0.0052          1  24.0 bytes    4.5 KiB             (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_quality       float32  4.49 KiB   36 bytes    0.0078          1  36.0 bytes    4.49 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_position      int32    4.49 KiB   36 bytes    0.0078          1  36.0 bytes    4.49 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_length        int8     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_AN            int8     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_NS            int8     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_DB            bool     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/sample_id             object   4.46 KiB   24 bytes    0.0053          1  24.0 bytes    4.46 KiB            (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_H2            bool     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_DP            int8     4.45 KiB   9 bytes     0.002           1  9.0 bytes     4.45 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/filter_id             object   4.45 KiB   24 bytes    0.0053          1  24.0 bytes    4.45 KiB            (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/contig_id             object   4.44 KiB   24 bytes    0.0053          1  24.0 bytes    4.44 KiB            (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_id_mask       bool     4.44 KiB   9 bytes     0.002           1  9.0 bytes     4.44 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None

The stored and size columns here are important, and tell you how much storage space is being used by the compressed chunks, and the full size of the uncompressed array. The ratio column is the ratio of these two values, with large numbers indicating a high compression ration. In this example the compression ratios are less than one because the compressed chunks are larger than the actual arrays. This is because it’s a tiny example, with only 9 variants and 3 samples (see the shape column), so, for example call_genotype is only 54 bytes.

Medium dataset#

Conversion in vcf2zarr is a two step process. First we convert the VCF(s) to an “intermediate columnar format” (ICF), and then use this as the basis of generating the final Zarr.

Important

We would recommend using this two-step process for all but the smallest datasets.

In the simplest case, we just call two commands instead of one:

Tip

Both the explode and encode commands allow you to perform the conversion in parallel using the -p/--worker-processes option.

The ICF form gives us useful information about the VCF data, and can help us to decide some of the finer details of the final Zarr encoding. We can also run vcf2zarr inspect command on an ICF:

vcf2zarr inspect sample.icf

name       type       chunks  size       compressed      max_n  min_val    max_val
---------  -------  --------  ---------  ------------  -------  ---------  ---------
CHROM      String          9  3.06 KiB   1.61 KiB            1  n/a        n/a
POS        Integer         9  3.06 KiB   1.69 KiB            1  10         1.2e+06
QUAL       Float           9  2.78 KiB   1.71 KiB            1  3          67
ID         String          9  1.47 KiB   1.13 KiB            1  n/a        n/a
FILTERS    String          9  3.32 KiB   1.7 KiB             1  n/a        n/a
REF        String          9  3.06 KiB   1.63 KiB            1  n/a        n/a
ALT        String          9  3.28 KiB   1.72 KiB            3  n/a        n/a
rlen       Integer         9  3.06 KiB   1.61 KiB            1  1          2
INFO/NS    Integer         9  1.84 KiB   779 bytes           1  2          3
INFO/AN    Integer         9  736 bytes  743 bytes           1  6          6
INFO/AC    Integer         9  784 bytes  731 bytes           2  1          3
INFO/DP    Integer         9  1.84 KiB   803 bytes           1  9          14
INFO/AF    Float           9  1.33 KiB   842 bytes           2  0.017      0.67
INFO/AA    String          9  1.32 KiB   761 bytes           1  n/a        n/a
INFO/DB    Flag            9  1 KiB      755 bytes           1  1          1
INFO/H2    Flag            9  736 bytes  722 bytes           1  1          1
FORMAT/GT  Integer         9  3.99 KiB   1.76 KiB            3  -2         2
FORMAT/GQ  Integer         9  2.25 KiB   875 bytes           1  2          61
FORMAT/DP  Integer         9  2.25 KiB   848 bytes           1  0          8
FORMAT/HQ  Integer         9  3.09 KiB   1.5 KiB             2  2          65

This tells us about the fields found in the input VCF, and provides details about their data type, raw and compressed size, the maximum dimension of a value and the maximum and minimum values observed in the field. These extreme values are use to determine the width of integer types in the final Zarr encoding. The defaults can be overridden by generating a “storage schema” using the vcf2zarr mkschema command on an ICF:

vcf2zarr mkschema sample.icf > sample.schema.json
head -n 20 sample.schema.json

{
    "format_version": "0.6",
    "dimensions": {
        "variants": {
            "size": 9,
            "chunk_size": 9
        },
        "samples": {
            "size": 3,
            "chunk_size": 3
        },
        "alleles": {
            "size": 4,
            "chunk_size": 4
        },
        "alt_alleles": {
            "size": 3,
            "chunk_size": 3
        },
        "filters": {

We’ve displayed the first 20 lines here so you can get a feel for the JSON format. The jq tool provides a useful way of manipulating these schemas. Let’s look at the schema for just the call_genotype field, for example:

jq '.fields[] | select(.name == "call_genotype")' < sample.schema.json

{
  "name": "call_genotype",
  "dtype": "i1",
  "dimensions": [
    "variants",
    "samples",
    "ploidy"
  ],
  "description": "",
  "compressor": {
    "id": "blosc",
    "cname": "zstd",
    "clevel": 7,
    "shuffle": 2,
    "blocksize": 0
  },
  "filters": null,
  "source": null
}

If we wanted to trade-off file size for better decoding performance we could turn off the shuffle option here (which is set to 2, for Blosc BitShuffle). This can be done by loading the sample.schema.json file into your favourite editor, and changing the value “2” to “0”.

Todo

There is a lot going on here. It would be good to give some specific instructions for how to do this using jq. We would also want to explain the numbers provided for shuffle, as well as links to the Blosc documentation. Better mechanisms for updating schemas via a Python API would be a useful addition here.

A common thing we might want to do is to drop some fields entirely from the final Zarr, either because they are too big and unwieldy in the standard encoding (e.g. call_PL) or we just don’t want them.

Suppose here wanted to drop the FORMAT/HQ/call_HQ field. This can be done using jq like:

vcf2zarr mkschema sample.icf \
| jq 'del(.fields[] | select(.name == "call_HQ"))' > sample_noHQ.schema.json 

Then we can use the updated schema as input to encode:

vcf2zarr encode sample.icf -Qs sample_noHQ.schema.json sample_noHQ.vcz

Tip

Use the -Q/--no-progress flag to suppress progress bars.

We can then inspect to see that there is no call_HQ array in the output:

vcf2zarr inspect sample_noHQ.vcz

name                   dtype    stored     size         ratio    nchunks  chunk_size    avg_chunk_stored    shape      chunk_shape    compressor                                                      filters
---------------------  -------  ---------  ---------  -------  ---------  ------------  ------------------  ---------  -------------  --------------------------------------------------------------  ------------
/call_genotype_mask    bool     12.56 KiB  54 bytes    0.0042          1  54.0 bytes    12.56 KiB           (9, 3, 2)  (9, 3, 2)      Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_genotype         int8     12.56 KiB  54 bytes    0.0042          1  54.0 bytes    12.56 KiB           (9, 3, 2)  (9, 3, 2)      Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_allele        object   8.63 KiB   288 bytes   0.033           1  288.0 bytes   8.63 KiB            (9, 4)     (9, 4)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_AC            int8     8.56 KiB   18 bytes    0.0021          1  18.0 bytes    8.56 KiB            (9, 2)     (9, 2)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_AF            float32  8.56 KiB   72 bytes    0.0082          1  72.0 bytes    8.56 KiB            (9, 2)     (9, 2)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/region_index          int32    8.54 KiB   72 bytes    0.0082          1  72.0 bytes    8.54 KiB            (3, 6)     (3, 6)         Blosc(cname='zstd', clevel=9, shuffle=NOSHUFFLE, blocksize=0)   None
/variant_filter        bool     8.52 KiB   27 bytes    0.0031          1  27.0 bytes    8.52 KiB            (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_GQ               int8     8.51 KiB   27 bytes    0.0031          1  27.0 bytes    8.51 KiB            (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/call_DP               int8     8.5 KiB    27 bytes    0.0031          1  27.0 bytes    8.5 KiB             (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/call_genotype_phased  bool     8.5 KiB    27 bytes    0.0031          1  27.0 bytes    8.5 KiB             (9, 3)     (9, 3)         Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/variant_id            object   4.6 KiB    72 bytes    0.015           1  72.0 bytes    4.6 KiB             (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_contig        int8     4.55 KiB   9 bytes     0.0019          1  9.0 bytes     4.55 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_AA            object   4.54 KiB   72 bytes    0.015           1  72.0 bytes    4.54 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/filter_description    object   4.5 KiB    24 bytes    0.0052          1  24.0 bytes    4.5 KiB             (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_quality       float32  4.49 KiB   36 bytes    0.0078          1  36.0 bytes    4.49 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_position      int32    4.49 KiB   36 bytes    0.0078          1  36.0 bytes    4.49 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_length        int8     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_AN            int8     4.48 KiB   9 bytes     0.002           1  9.0 bytes     4.48 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_NS            int8     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_DB            bool     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/sample_id             object   4.46 KiB   24 bytes    0.0053          1  24.0 bytes    4.46 KiB            (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_H2            bool     4.46 KiB   9 bytes     0.002           1  9.0 bytes     4.46 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/variant_DP            int8     4.45 KiB   9 bytes     0.002           1  9.0 bytes     4.45 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     None
/filter_id             object   4.45 KiB   24 bytes    0.0053          1  24.0 bytes    4.45 KiB            (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/contig_id             object   4.44 KiB   24 bytes    0.0053          1  24.0 bytes    4.44 KiB            (3,)       (3,)           Blosc(cname='zstd', clevel=7, shuffle=SHUFFLE, blocksize=0)     [VLenUTF8()]
/variant_id_mask       bool     4.44 KiB   9 bytes     0.002           1  9.0 bytes     4.44 KiB            (9,)       (9,)           Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None

Tip

Use the max-variants-chunks option to encode the first few chunks of your dataset while doing these kinds of schema tuning operations!

Large dataset#

The explode and encode commands have powerful features for conversion on a single machine, and can take full advantage of large servers with many cores. Current biobank scale datasets, however, are so large that we must go a step further and distribute computations over a cluster. Vcf2zarr provides some low-level utilities that allow you to do this, that should be compatible with any cluster scheduler.

The distributed commands are split into three phases:

init <num_partitions>: Initialise the computation, setting up the data structures needed for the bulk computation to be split into num_partitions independent partitions
partition : perform the computation of partition j
finalise: Complete the full process.

When performing large-scale computations like this on a cluster, errors and job failures are essentially inevitable, and the commands are resilient to various failure modes.

Let’s go through the example above using the distributed commands. First, we dexplode-init to create an ICF directory:

vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf -n 5 -Q

num_partitions  3
num_samples     3
num_variants    9

Here we asked dexplode-init to set up an ICF store in which the data is split into 5 partitions. The number of partitions determines the level of parallelism, so we would usually set this to the number of parallel jobs we would like to use. The output of dexplode-init is important though, as it tells us the actual number of partitions that we have (partitioning is based on the VCF indexes, which have a limited granularity). You should be careful to use this value in your scripts (the format is designed to be machine readable using e.g. cut and grep). In this case there are only 3 possible partitions.

Once dexplode-init is done and we know how many partitions we have, we need to call dexplode-partition this number of times:

vcf2zarr dexplode-partition sample-dist.icf 0
vcf2zarr dexplode-partition sample-dist.icf 1
vcf2zarr dexplode-partition sample-dist.icf 2

This is not how it would be done in practise of course: you would use your cluster scheduler of choice to dispatch these operations.

Todo

Document how to do this conveniently over some popular schedulers.

Tip

Use the --one-based argument in cases in which it’s more convenient to index the partitions from 1 to n, rather than 0 to n - 1.

Finally we need to call dexplode-finalise:

vcf2zarr dexplode-finalise sample-dist.icf

Todo

Document the process for dencode, noting the information output about memory requirements.

Tutorial

Contents

Tutorial#

Small dataset#

Medium dataset#

Large dataset#