sgkit.io.bgen.bgen_to_zarr¶
- sgkit.io.bgen.bgen_to_zarr(input, output, region=None, chunk_length=10000, chunk_width=1000, temp_chunk_length=100, compressor=Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0), probability_dtype='uint8', max_mem='4GB', pack=True, tempdir=None)¶
Convert a BGEN file to a Zarr on-disk store.
This function is a convenience for calling
read_bgen()
followed byrechunk_bgen()
.- Parameters
- input :
str
|Path
Union
[str
,Path
] Path to local BGEN dataset.
- output :
str
|Path
|MutableMapping
[str
,bytes
]Union
[str
,Path
,MutableMapping
[str
,bytes
]] Zarr store or path to directory in file system.
- region :
Mapping
|None
Optional
[Mapping
[Hashable
,Any
]] (default:None
) Indexers on dataset dimensions used to define a subset of data to convert. Must be None or a dict with keys matching dimension names and values equal to integers or slice objects. This is passed directly to Dataset.isel so it has the same semantics.
- chunk_length :
int
int
(default:10000
) Length (number of variants) of chunks in which data are stored, by default 10_000.
- chunk_width :
int
int
(default:1000
) Width (number of samples) to use when storing chunks in output, by default 1_000.
- temp_chunk_length :
int
int
(default:100
) Length of chunks used in raw BGEN read, by default 100. This defines the vertical chunking (i.e. in the variants dimension) used when reading the raw data and because there is no horizontal chunking at this phase (i.e. in the samples dimension), this value should be much smaller than the target chunk_length.
- compressor :
Any
|None
Optional
[Any
] (default:Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)
) Zarr compressor, by default Blosc + zstd with compression level 7. No compression is used when set as None.
- probability_dtype :
Any
|None
Optional
[Any
] (default:'uint8'
) Data type used to encode genotype probabilities, must be either uint8 or uint16. Setting this parameter results in a loss of precision. If None, probabilities will not be altered when stored.
- max_mem :
str
str
(default:'4GB'
) The amount of memory (in bytes) that workers are allowed to use. A string (e.g. 100MB) can also be used.
- pack :
bool
bool
(default:True
) Whether or not to optimize variable representations by removing unnecessary dimensions and elements. This includes storing 2 genotypes instead of 3, omitting dosage and collapsing the genotype probability mask to 2 dimensions. All of the above are restored in the resulting Dataset at the expense of extra computations on read.
- tempdir :
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Temporary directory where intermediate files are stored. The default None means use the system default temporary directory.
- input :
Warning
This functional is only applicable to diploid, bi-allelic BGEN datasets.