sgkit.io.bgen.rechunk_bgen#
- sgkit.io.bgen.rechunk_bgen(ds, output, *, chunk_length=10000, chunk_width=1000, compressor=Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0), probability_dtype='uint8', max_mem='4GB', pack=True, tempdir=None)#
Rechunk BGEN dataset as Zarr.
This function will use the algorithm https://rechunker.readthedocs.io/en/latest/ to rechunk certain fields in a provided Dataset for better downstream performance. Depending on the system memory available (and the max_mem setting) this rechunking may occur without the need of any intermediate data store. Otherwise, approximately as much disk space is required as was needed to store the original BGEN data. Experiments show that this Zarr representation is ~20% larger even with all available optimizations and fairly aggressive compression (i.e. the default clevel 7).
Note that this function is not evaluated lazily. The rechunking algorithm will run inline so calls to it may be slow. The resulting Dataset is generated based on the final, serialized Zarr data.
- Parameters:
- ds
Dataset
Dataset to rechunk, typically the result from read_bgen.
- output
str
|Path
|MutableMapping
[str
,bytes
]Union
[str
,Path
,MutableMapping
[str
,bytes
]] Zarr store or path to directory in file system.
- chunk_length
int
(default:10000
) Length (number of variants) of chunks in which data are stored, by default 10_000.
- chunk_width
int
(default:1000
) Width (number of samples) to use when storing chunks in output, by default 1_000.
- compressor
Any
|None
Optional
[Any
] (default:Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)
) Zarr compressor, no compression is used when set as None.
- probability_dtype
Any
|None
Optional
[Any
] (default:'uint8'
) Data type used to encode genotype probabilities, must be either uint8 or uint16. Setting this parameter results in a loss of precision. If None, probabilities will not be altered when stored.
- max_mem
str
(default:'4GB'
) The amount of memory (in bytes) that workers are allowed to use. A string (e.g. 100MB) can also be used.
- pack
bool
(default:True
) Whether or not to optimize variable representations by removing unnecessary dimensions and elements. This includes storing 2 genotypes instead of 3, omitting dosage and collapsing the genotype probability mask to 2 dimensions. All of the above are restored in the resulting Dataset at the expense of extra computations on read.
- tempdir
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Temporary directory where intermediate files are stored. The default None means use the system default temporary directory.
- ds
Warning
This functional is only applicable to diploid, bi-allelic BGEN datasets.
- Return type:
- Returns:
:
- Dataset
The rechunked dataset.