vcfpartition#

Overview#

The vcfpartition utility outputs a set of region strings that partition indexed VCF/BCF files into either an approximate number of parts, or into parts of approximately a given size. This is useful for parallel processing of large VCF files.

Why is this in bio2zarr?

The vcfpartition program is packaged with bio2zarr because the underlying functionality was developed for vcf2zarr, and there is currently no easy way to split processing of large VCFs up.

Partitioning into a number of parts#

Here, we partition a BCF file into three parts using the --num-parts/-n argument:

vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3
20:1-	CEUTrio.20.21.gatk3.4.g.bcf
21:1-10092544	CEUTrio.20.21.gatk3.4.g.bcf
21:10092557-	CEUTrio.20.21.gatk3.4.g.bcf

The output is a tab-delimited stream of region strings and the file path.

Tip

The file path is included in the output to make it easy to work with multiple files at once, and also to simplify shell scripting tasks.

We can use this, for example, in a shell loop to count the number of variants in each partition:

vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 | while read split; 
do
    bcftools view -Hr $split | wc -l
done
3450
5772
10688

Note

Note that the number of variants in each partition is quite uneven, which is generally true across files of all scales.

Another important point is that there is granularity limit to the partitions:

vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 30
20:1-	CEUTrio.20.21.gatk3.4.g.bcf
21:1-10092544	CEUTrio.20.21.gatk3.4.g.bcf
21:10092557-	CEUTrio.20.21.gatk3.4.g.bcf

Here, we asked for 30 partitions, but the underlying indexes provide a maxmimum of 3.

Warning

Do not assume that the number of partitions you ask for is what you get!

Partitioning into a fixed size#

It is also possible to partition a VCF file into chunks of approximately a given size.

ls -lh  NA12878.prod.chr20snippet.g.vcf.gz
-rw-r--r-- 1 runner docker 3.8M Nov 26 09:12 NA12878.prod.chr20snippet.g.vcf.gz

In this example, we have 3.8M file, and would like to process this in chunks of approximately 500K at a time:

vcfpartition  NA12878.prod.chr20snippet.g.vcf.gz -s 500K
20:60001-196608	NA12878.prod.chr20snippet.g.vcf.gz
20:196611-327680	NA12878.prod.chr20snippet.g.vcf.gz
20:327695-425984	NA12878.prod.chr20snippet.g.vcf.gz
20:425988-557056	NA12878.prod.chr20snippet.g.vcf.gz
20:557078-671744	NA12878.prod.chr20snippet.g.vcf.gz
20:671751-786432	NA12878.prod.chr20snippet.g.vcf.gz
20:786434-917504	NA12878.prod.chr20snippet.g.vcf.gz
20:917522-	NA12878.prod.chr20snippet.g.vcf.gz

Tip

Suffixes like M, MiB, G, GB, or raw numbers in bytes are all supported.

We get 8 partitions in this example. Note again that these target sizes are quite approximate.

Parallel example#

Here we use illustrate using vcfpartition to count the variants in each partition in parallel using xargs. In this case we use 3 partitions with 3 processes, but because the number of variants per partition can be quite uneven, it is a good idea to partition up work into (say) four times the number of cores available for processing.

vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 \
    | xargs -P 3 -I {} sh -c "bcftools view -Hr {} | wc -l"
3450
5772
10688