sgkit.window_by_variant#

sgkit.window_by_variant(ds, *, size, step=None, variant_contig='variant_contig', merge=True)#

Add window information to a dataset, measured by number of variants.

Windows are defined over the variants dimension, and are used by some downstream functions to calculate statistics for each window. Windows never span contigs.

Parameters
ds Dataset

Genotype call dataset.

size int

The window size, measured by number of variants.

step int | NoneOptional[int] (default: None)

The distance (number of variants) between start positions of windows. Defaults to size.

variant_contig Hashable (default: 'variant_contig')

Name of variable containing variant contig indexes. Defined by sgkit.variables.variant_contig_spec.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type

Dataset

Returns

A dataset containing the following variables:

Examples

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=2, n_contig=2)
>>> ds.variant_contig.values
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
>>> ds.variant_position.values
array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4])
>>> # Contiguous windows, each with the same number of variants (3)
>>> # except for the last window of each contig
>>> sg.window_by_variant(ds, size=3, merge=False)
<xarray.Dataset> Size: 96B
Dimensions:        (windows: 4)
Dimensions without coordinates: windows
Data variables:
    window_contig  (windows) int64 32B 0 0 1 1
    window_start   (windows) int64 32B 0 3 5 8
    window_stop    (windows) int64 32B 3 5 8 10
>>> # Overlapping windows
>>> sg.window_by_variant(ds, size=3, step=2, merge=False)
<xarray.Dataset> Size: 144B
Dimensions:        (windows: 6)
Dimensions without coordinates: windows
Data variables:
    window_contig  (windows) int64 48B 0 0 0 1 1 1
    window_start   (windows) int64 48B 0 2 4 5 7 9
    window_stop    (windows) int64 48B 3 5 5 8 10 10