vcfpartition#
Overview#
The vcfpartition utility outputs a set of region strings that partition indexed VCF/BCF files into either an approximate number of parts, or into parts of approximately a given size. This is useful for parallel processing of large VCF files.
Why is this in bio2zarr?
The vcfpartition
program is packaged with bio2zarr because the underlying
functionality was developed for vcf2zarr, and there is currently
no easy way to split processing of large VCFs up.
Partitioning into a number of parts#
Here, we partition a BCF file into three parts using the --num-parts/-n
argument:
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3
20:1- CEUTrio.20.21.gatk3.4.g.bcf
21:1-10092544 CEUTrio.20.21.gatk3.4.g.bcf
21:10092557- CEUTrio.20.21.gatk3.4.g.bcf
The output is a tab-delimited stream of region strings and the file path.
Tip
The file path is included in the output to make it easy to work with multiple files at once, and also to simplify shell scripting tasks.
We can use this, for example, in a shell loop to count the number of variants in each partition:
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 | while read split;
do
bcftools view -Hr $split | wc -l
done
3450
5772
10688
Note
Note that the number of variants in each partition is quite uneven, which is generally true across files of all scales.
Another important point is that there is granularity limit to the partitions:
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 30
20:1- CEUTrio.20.21.gatk3.4.g.bcf
21:1-10092544 CEUTrio.20.21.gatk3.4.g.bcf
21:10092557- CEUTrio.20.21.gatk3.4.g.bcf
Here, we asked for 30 partitions, but the underlying indexes provide a maxmimum of 3.
Warning
Do not assume that the number of partitions you ask for is what you get!
Partitioning into a fixed size#
It is also possible to partition a VCF file into chunks of approximately a given size.
ls -lh NA12878.prod.chr20snippet.g.vcf.gz
-rw-r--r-- 1 runner docker 3.8M Jan 17 17:06 NA12878.prod.chr20snippet.g.vcf.gz
In this example, we have 3.8M file, and would like to process this in chunks of approximately 500K at a time:
vcfpartition NA12878.prod.chr20snippet.g.vcf.gz -s 500K
20:60001-196608 NA12878.prod.chr20snippet.g.vcf.gz
20:196611-327680 NA12878.prod.chr20snippet.g.vcf.gz
20:327695-425984 NA12878.prod.chr20snippet.g.vcf.gz
20:425988-557056 NA12878.prod.chr20snippet.g.vcf.gz
20:557078-671744 NA12878.prod.chr20snippet.g.vcf.gz
20:671751-786432 NA12878.prod.chr20snippet.g.vcf.gz
20:786434-917504 NA12878.prod.chr20snippet.g.vcf.gz
20:917522- NA12878.prod.chr20snippet.g.vcf.gz
Tip
Suffixes like M, MiB, G, GB, or raw numbers in bytes are all supported.
We get 8 partitions in this example. Note again that these target sizes are quite approximate.
Parallel example#
Here we use illustrate using vcfpartition
to count the variants in each
partition in parallel using xargs. In this case we use 3 partitions with 3
processes, but because the number of variants per partition can be quite
uneven, it is a good idea to partition up work into (say) four times the number
of cores available for processing.
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 \
| xargs -P 3 -I {} sh -c "bcftools view -Hr {} | wc -l"
3450
5772
10688