sgkit.io.bgen.read_bgen#
- sgkit.io.bgen.read_bgen(path, metafile_path=None, sample_path=None, chunks='auto', lock=False, persist=True, contig_dtype='str', gp_dtype='float32')#
Read BGEN dataset.
Loads a single BGEN dataset as dask arrays within a Dataset from a
.bgen
file.- Parameters:
- path
str
|Path
Union
[str
,Path
] Path to BGEN file.
- metafile_path
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Path to companion index file used to determine BGEN byte offsets. Defaults to
path
+ “.metafile” if not provided. This file is necessary for reading BGEN genotype probabilities and it will be generated the first time the file is read if it does not already exist. If it needs to be created, it can make the first call to this function much slower than subsequent calls.- sample_path
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Path to
.sample
file, by default None. This is used to fetch sample identifiers and when provided it is preferred over sample identifiers embedded in the.bgen
file.- chunks
str
|int
|Tuple
[int
,int
,int
]Union
[str
,int
,Tuple
[int
,int
,int
]] (default:'auto'
) Chunk size for genotype probability data (3 dimensions), by default “auto”.
- lock
bool
(default:False
) Whether or not to synchronize concurrent reads of file blocks, by default False. This is passed through to [dask.array.from_array](https://docs.dask.org/en/latest/array-api.html#dask.array.from_array).
- persist
bool
(default:True
) Whether or not to persist variant information in memory, by default True. This is an important performance consideration as the metadata file for this data will be read multiple times when False.
- contig_dtype
Any
(default:'str'
) Data type for contig names, by default “str”. This may also be an integer type (e.g. “int”), but will fail if any of the contig names cannot be converted to integers.
- gp_dtype
Any
(default:'float32'
) Data type for genotype probabilities, by default “float32”.
- path
Warning
Only bi-allelic, diploid BGEN files are currently supported.
- Return type:
- Returns:
: A dataset containing the following variables:
sgkit.variables.variant_id_spec
(variants)sgkit.variables.variant_contig_spec
(variants)sgkit.variables.variant_position_spec
(variants)sgkit.variables.variant_allele_spec
(variants)sgkit.variables.sample_id_spec
(samples)sgkit.variables.call_dosage_spec
(variants, samples)sgkit.variables.call_dosage_mask_spec
(variants, samples)sgkit.variables.call_genotype_probability_spec
(variants, samples, genotypes)sgkit.variables.call_genotype_probability_mask_spec
(variants, samples, genotypes)