sgkit.io.bgen.read_bgen#

sgkit.io.bgen.read_bgen(path, metafile_path=None, sample_path=None, chunks='auto', lock=False, persist=True, contig_dtype='str', gp_dtype='float32')#

Read BGEN dataset.

Loads a single BGEN dataset as dask arrays within a Dataset from a .bgen file.

Parameters:
path str | PathUnion[str, Path]

Path to BGEN file.

metafile_path str | Path | NoneUnion[str, Path, None] (default: None)

Path to companion index file used to determine BGEN byte offsets. Defaults to path + “.metafile” if not provided. This file is necessary for reading BGEN genotype probabilities and it will be generated the first time the file is read if it does not already exist. If it needs to be created, it can make the first call to this function much slower than subsequent calls.

sample_path str | Path | NoneUnion[str, Path, None] (default: None)

Path to .sample file, by default None. This is used to fetch sample identifiers and when provided it is preferred over sample identifiers embedded in the .bgen file.

chunks str | int | Tuple[int, int, int]Union[str, int, Tuple[int, int, int]] (default: 'auto')

Chunk size for genotype probability data (3 dimensions), by default “auto”.

lock bool (default: False)

Whether or not to synchronize concurrent reads of file blocks, by default False. This is passed through to [dask.array.from_array](https://docs.dask.org/en/latest/array-api.html#dask.array.from_array).

persist bool (default: True)

Whether or not to persist variant information in memory, by default True. This is an important performance consideration as the metadata file for this data will be read multiple times when False.

contig_dtype Any (default: 'str')

Data type for contig names, by default “str”. This may also be an integer type (e.g. “int”), but will fail if any of the contig names cannot be converted to integers.

gp_dtype Any (default: 'float32')

Data type for genotype probabilities, by default “float32”.

Warning

Only bi-allelic, diploid BGEN files are currently supported.

Return type:

Dataset

Returns:

: A dataset containing the following variables: