sgkit.window_by_variant#

sgkit.window_by_variant(ds, *, size, step=None, variant_contig='variant_contig', merge=True)#

Add window information to a dataset, measured by number of variants.

Windows are defined over the variants dimension, and are used by some downstream functions to calculate statistics for each window. Windows never span contigs.

Parameters:
ds Dataset

Genotype call dataset.

size int

The window size, measured by number of variants.

step int | NoneOptional[int] (default: None)

The distance (number of variants) between start positions of windows. Defaults to size.

variant_contig Hashable (default: 'variant_contig')

Name of variable containing variant contig indexes. Defined by sgkit.variables.variant_contig_spec.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing the following variables:

Examples

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=2, n_contig=2)
>>> ds.variant_contig.values
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
>>> ds.variant_position.values
array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4])
>>> # Contiguous windows, each with the same number of variants (3)
>>> # except for the last window of each contig
>>> sg.window_by_variant(ds, size=3, merge=False)
<xarray.Dataset> Size: 96B
Dimensions:        (windows: 4)
Dimensions without coordinates: windows
Data variables:
    window_contig  (windows) int64 32B 0 0 1 1
    window_start   (windows) int64 32B 0 3 5 8
    window_stop    (windows) int64 32B 3 5 8 10
>>> # Overlapping windows
>>> sg.window_by_variant(ds, size=3, step=2, merge=False)
<xarray.Dataset> Size: 144B
Dimensions:        (windows: 6)
Dimensions without coordinates: windows
Data variables:
    window_contig  (windows) int64 48B 0 0 0 1 1 1
    window_start   (windows) int64 48B 0 2 4 5 7 9
    window_stop    (windows) int64 48B 3 5 5 8 10 10